About the job
About Contextual AI
At Contextual AI, we are at the forefront of transforming how AI agents operate by addressing one of the most significant challenges in the field: context. By providing the right context at the right moment, we enable enterprises to achieve the accuracy and scalability required for effective AI deployment. Our innovative enterprise AI development platform bridges cutting-edge AI research with the practical needs of developers, allowing them to seamlessly ingest, query, and integrate data from various enterprise sources into their workflows.
Founded by the pioneers of Retrieval-Augmented Generation (RAG), our technology forms the backbone of the context layer that connects foundational AI models with relevant real-time information. Supported by visionary venture capital, we are not merely participating in the enterprise AI revolution; we are leading it. Join us as we create a future in which AI not only answers queries but also revolutionizes business operations.
About the Role
As a Member of Technical Staff focused on LLM Systems & Performance, you will join a dedicated and impactful team responsible for building and optimizing LLM systems from end-to-end. Your work will range from developing Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) pipelines to creating high-throughput inference clusters for production environments. You will collaborate with both researchers and engineers to design advanced models and the supporting infrastructure for our context layer.
What You'll Do
- Enhance and optimize components of our SFT and RL training pipelines (e.g., Verl, SkyRL), focusing on areas such as data loading, training loops, logging, and evaluation.
- Contribute to the development of LLM inference infrastructure (e.g., vLLM, SGLang), including optimizations for batching, KV-cache management, scheduling, and serving.
- Utilize profiling tools like Nsight to analyze and improve end-to-end performance metrics (throughput, latency, compute/memory/bandwidth) by identifying and resolving bottlenecks.
- Engage with distributed training and inference systems using technologies such as NCCL, NVLink, and various parallelism strategies on multi-GPU clusters.
- Assist in experimenting with and implementing quantization techniques (e.g., INT8, FP8, FP4, mixed-precision) for both training and inference.
- Write and optimize GPU kernels utilizing CUDA or Triton, employing techniques such as FlashAttention and Tensor Cores as appropriate.
- Collaborate with researchers to advance ideas from concept to prototype, through scaled experiments, and into production.

