About the job
Job Reference: P-1485
At Databricks, we are dedicated to empowering data teams to address the world’s most complex challenges — from transforming future transportation to spearheading medical advancements. We accomplish this by developing and managing the premier data and AI infrastructure platform, which enables our clients to extract profound insights from data and elevate their business strategies. Founded by engineers with a strong customer focus, we eagerly embrace every opportunity to tackle technical hurdles, whether it's designing cutting-edge UI/UX for data interaction or scaling our services across millions of virtual machines. Our journey has just begun.
In the role of Incident Manager, you will spearhead Databricks’ most pivotal production incidents, ensuring clear, precise, and timely communication with customers, executives, and engineers. Serving as both the incident commander and reliability engineer, you will orchestrate cross-team responses, provide real-time status updates, and collaborate with engineering to analyze and avert future failures. Your contributions will be instrumental in maintaining Databricks' technical resilience and building customer and stakeholder trust during critical events.
This position merges operational leadership, technical systems expertise, and outstanding communication skills. You will be positioned at the nexus of engineering acumen and operational transparency, guaranteeing that every significant incident is managed with accuracy, openness, and a commitment to ongoing enhancement.
Your Impact:
- Lead Critical Incidents: Coordinate cross-disciplinary response efforts across Databricks’ cloud services to swiftly mitigate impacts and restore normal operations.
- Drive Technical Root Cause Analysis and Reliability Improvements:
- Collaborate with engineering teams to trace and document underlying causes across distributed systems, services, and data stores.
- Summarize key learnings, communicate action items clearly, and ensure the implementation of technical and procedural enhancements.
- Own Incident Communications: Provide regular, high-quality updates to internal stakeholders (executives, engineering leadership, support) and craft customer-facing notifications that are accurate, timely, and empathetic.
- Mentor and Train Peers: Enhance the overall quality of Databricks’ incident response by mentoring peers in incident communication and technical response disciplines.

