About the job
About Us
At Vercept, we are an energetic and mission-focused team with a proven history of academic excellence. Our talented researchers have made significant contributions to the field of artificial intelligence, receiving accolades such as best paper awards at leading AI conferences and achieving remarkable citation rankings. We are committed to pioneering transformative research that sets new standards in the industry and aim to revolutionize the world—one innovative breakthrough at a time.
What We Seek & Why You Should Join Us
We are in search of a Backend Engineer specializing in Inference Optimization who is passionate about tackling some of the most challenging systems issues in AI. In this role, you will focus on enhancing the performance of foundation model inference, operating at the cutting edge of machine learning and high-performance systems engineering. This is an exciting opportunity to establish new standards for latency, throughput, and efficiency on a large scale.
Role Overview
As a Backend Engineer, you will take ownership of the design and optimization of inference pipelines for large-scale models. Collaborating closely with researchers and infrastructure engineers, you will identify bottlenecks and implement advanced techniques such as quantization and KV caching, ensuring the deployment of high-performance serving systems in production. Your contributions will directly influence how swiftly and cost-effectively users engage with next-generation AI.
What We Expect From You
Essential Qualifications:
Extensive experience in optimizing model inference pipelines, including model quantization and KV caching.
Strong proficiency in backend systems and high-performance programming languages (Python, C++, or Rust).
Familiarity with distributed serving, GPU acceleration, and large-scale system architectures.
Proven ability to debug complex performance issues across model, runtime, and hardware layers.
Adaptability to work in fast-paced environments with ambitious technical objectives.
Preferred Qualifications:
Practical experience with vLLM or similar inference frameworks.
Background in GPU kernel optimization (CUDA, Triton, ROCm).
Experience in scaling inference across multi-node or heterogeneous clusters.
Prior involvement in model compilation (e.g., TensorRT, TVM, ONNX Runtime).
Hands-on experience with model quantization strategies.

