About the job
About This Role
Join Databricks as a Software Engineer focused on GenAI inference, where you will play a pivotal role in designing, developing, and enhancing the inference engine that drives our Foundation Model API. Collaborating at the intersection of research and production, you will ensure our large language model (LLM) serving systems are optimized for speed, scalability, and efficiency. Your contributions will span the entire GenAI inference stack, from kernels and runtimes to orchestration and memory management.
What You Will Do
- Participate in the design and implementation of the inference engine, collaborating on a model-serving stack tailored for large-scale LLM inference.
- Work closely with researchers to integrate new model architectures or features such as sparsity, activation compression, and mixture-of-experts into the engine.
- Optimize latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators.
- Build and maintain tools for instrumentation, profiling, and tracing to identify bottlenecks and inform optimization efforts.
- Develop scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads.
- Ensure reliability, reproducibility, and fault tolerance in inference pipelines, including A/B launches, rollback, and model versioning.
- Integrate with federated and distributed inference infrastructure, orchestrating across nodes, balancing load, and managing communication overhead.
- Engage in cross-functional collaboration with platform engineers, cloud infrastructure, and security/compliance teams.
- Document and share insights, contributing to internal best practices and open-source initiatives as appropriate.

