About the job
About the Role:
Production Engineer
The Production Engineer at Rubrik is pivotal in ensuring operational excellence, managing alerts, addressing outages, and spearheading incident resolution as an Incident Manager. This position demands hands-on expertise in maintaining highly available critical services across multi-cloud environments while fostering continuous improvements through automation and intelligent monitoring.
What You Will Do:
- Become a key member of a 24/7 Production Operations team dedicated to managing and supporting vital infrastructure and services across multi-cloud environments.
- Supervise staging and production environments to guarantee maximum uptime and reliability.
- Deploy and maintain comprehensive observability solutions for real-time monitoring, alerting, and metrics collection.
- Lead incident management initiatives by promptly responding to alerts and outages, coordinating teams for swift resolution.
- Investigate recurring incidents to identify root causes, mitigate toil, and enhance system resilience.
- Design and develop automation tools to proactively detect, triage, and rectify production issues.
- Update and maintain runbooks to facilitate incident response and address recurring issues.
- Exhibit strong decision-making abilities under pressure, managing critical situations with urgency and composure.

