companyDatabricks logo

Principal Engineer, Compute Fleet Management at Databricks | Bellevue, WA

DatabricksBellevue, Washington
On-site Full-time $264.3K/yr - $322.3K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

We are in search of a highly experienced Principal Engineer who has not only constructed but has also successfully overseen large-scale, mission-critical infrastructure systems. Ideal candidates will have: Extensive experience in cloud computing environments, particularly with AWS, Azure, and GCP. A strong background in designing resilient architectures for high-availability systems. Proven expertise in fleet management and resource optimization techniques. Exceptional problem-solving skills and the ability to lead engineering initiatives.

About the job

At Databricks, our mission is to empower data teams to tackle some of the most pressing challenges of our time—ranging from revolutionizing transportation to fast-tracking medical advancements. We achieve this by developing and maintaining a cutting-edge data and AI infrastructure platform that enables our customers to harness deep data insights and enhance their business operations.

About the Team: Become Part of Databricks' Core Infrastructure

The Compute Infra Team is the powerhouse behind all Databricks products and Control Plane services. We design and scale the essential compute infrastructure that ensures the success of every Databricks customer, managing one of the largest and most dynamic data and AI clouds globally.

Mission: Shape the Future of Cloud Compute Efficiency and Scalability

As the Technical Lead for Compute Fleet Management, you'll play a pivotal role in setting benchmarks for how Databricks utilizes and optimizes compute resources across the three major cloud platforms: AWS, Azure, and GCP. This critical position directly influences our gross margin and customer satisfaction. Your responsibilities will encompass:

  • Leading Fleet Optimization: Provisioning and pooling billions of cloud resources to achieve peak performance, unmatched efficiency, and solid resource isolation.
  • Ensuring Hyper-Scale Resilience: Designing architecture that guarantees horizontal scalability and resilience against failures at both zonal and cloud account levels, ensuring Databricks remains operational at all times.
  • Owning the Critical Path: Directing the development of low-dependency systems essential for bootstrapping and managing our extensive compute platform.

Outcomes: The Impact You Will Create

This position is ideal for an engineer who excels in taking ownership of the most complex and impactful challenges:

  • High Availability: Achieving and maintaining 99.99% availability for all batch and serving workloads.
  • Exceptional Efficiency: Driving utilization rates to 60% or higher—a vital metric that necessitates balancing high efficiency with a robust tolerance for cloud failures.
  • Top-Tier Isolation: Architecting and enforcing stringent security and performance isolation across a wide array of customer workloads.

About Databricks

Databricks is at the forefront of data innovation, providing a powerful platform that allows organizations to unlock the full potential of their data through AI and data analytics. Our commitment to excellence and passion for solving complex problems drives our continuous improvement and success.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.