About the job
Your Role
The Fleet Reliability Operations Team serves as the core of CoreWeave’s capacity delivery and maintenance initiatives. This team is tasked with provisioning, updating, and managing server nodes, along with executing the processes and tools that configure and validate our server fleet. As the first responders to hardware issues in production, this team is empowered to drive automation and observability design throughout our server fleet lifecycle.
We are on the lookout for an Operations Engineering Manager to join the Fleet Reliability Operations team. This role will be pivotal in maintaining and enhancing our delivery volume as we expand our fleet tenfold. You will cultivate a robust talent pipeline, oversee onboarding and training, provide leadership in processes, and advocate for reliability and customer satisfaction. As the manager of this team, you will have the chance to:
- Establish and lead a 24/7 team of process-oriented engineers focused on reliability and observability.
- Facilitate the development and documentation of clear, consistent processes for provisioning, validating, and troubleshooting nodes in our server fleet.
- Critically assess and champion process and automation improvements, prioritizing event-driven automated remediation.
- Provide a 24/7 engineering support function for critical, time-sensitive node delivery and maintenance.
- Enhance our onboarding, documentation, enablement, and performance management programs to elevate team members' growth and capabilities.
- Foster a culture of accountability and performance measurement within your team.

