About the job
About the Team
Join the innovative Frontier Systems team at OpenAI, where we design, implement, and maintain the world's largest supercomputers, essential for advancing our most groundbreaking model training initiatives.
We transform data center blueprints into operational systems while crafting the software necessary for executing large-scale frontier model trainings.
Our mission is to establish, stabilize, and ensure the reliability and efficiency of these hyperscale supercomputers throughout the training of our frontier models.
About the Role
We are seeking passionate engineers to manage the next generation of compute clusters that underpin OpenAI’s frontier research.
This position merges distributed systems engineering with practical infrastructure work across our expansive data centers. You will scale Kubernetes clusters to unprecedented levels, automate bare-metal setups, and create the software layer that simplifies the complexity of numerous nodes across various data centers.
Your work will be at the crossroads of hardware and software, where speed and reliability are paramount. Be prepared to oversee dynamic operations, swiftly identify and resolve pressing issues, and constantly elevate the standards for automation and uptime.
In this role, you will:
Provision and scale extensive Kubernetes clusters, including automation for deployment, bootstrapping, and lifecycle management
Create software abstractions that integrate multiple clusters and provide a cohesive interface for training workloads
Oversee node deployment from bare metal to firmware upgrades, ensuring rapid, repeatable setups at scale
Enhance operational metrics by reducing cluster restart times (e.g., from hours to minutes) and expediting firmware and OS upgrade cycles
Integrate networking and hardware health systems to ensure end-to-end reliability across servers, switches, and data center infrastructure
Develop monitoring and observability systems to identify issues early and maintain cluster stability under high loads
You might thrive in this role if you:
Have extensive experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environments
Possess strong programming skills in languages relevant to cloud and infrastructure management

