About the job
At Databricks, our mission is to empower data teams to tackle some of the most pressing challenges of our time—ranging from revolutionizing transportation to fast-tracking medical advancements. We achieve this by developing and maintaining a cutting-edge data and AI infrastructure platform that enables our customers to harness deep data insights and enhance their business operations.
About the Team: Become Part of Databricks' Core Infrastructure
The Compute Infra Team is the powerhouse behind all Databricks products and Control Plane services. We design and scale the essential compute infrastructure that ensures the success of every Databricks customer, managing one of the largest and most dynamic data and AI clouds globally.
Mission: Shape the Future of Cloud Compute Efficiency and Scalability
As the Technical Lead for Compute Fleet Management, you'll play a pivotal role in setting benchmarks for how Databricks utilizes and optimizes compute resources across the three major cloud platforms: AWS, Azure, and GCP. This critical position directly influences our gross margin and customer satisfaction. Your responsibilities will encompass:
- Leading Fleet Optimization: Provisioning and pooling billions of cloud resources to achieve peak performance, unmatched efficiency, and solid resource isolation.
- Ensuring Hyper-Scale Resilience: Designing architecture that guarantees horizontal scalability and resilience against failures at both zonal and cloud account levels, ensuring Databricks remains operational at all times.
- Owning the Critical Path: Directing the development of low-dependency systems essential for bootstrapping and managing our extensive compute platform.
Outcomes: The Impact You Will Create
This position is ideal for an engineer who excels in taking ownership of the most complex and impactful challenges:
- High Availability: Achieving and maintaining 99.99% availability for all batch and serving workloads.
- Exceptional Efficiency: Driving utilization rates to 60% or higher—a vital metric that necessitates balancing high efficiency with a robust tolerance for cloud failures.
- Top-Tier Isolation: Architecting and enforcing stringent security and performance isolation across a wide array of customer workloads.

