companyOpenAI logo

Software Engineer, Frontier Clusters Infrastructure

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Proven expertise in operating and scaling Kubernetes clusters or equivalent container orchestration systems in large-scale environments. Strong programming skills in relevant languages such as Python, Go, or similar. Experience with bare-metal provisioning and management. Familiarity with networking and data center infrastructure. Excellent problem-solving skills and the ability to work in fast-paced environments.

About the job

About the Team

Join the innovative Frontier Systems team at OpenAI, where we design, implement, and maintain the world's largest supercomputers, essential for advancing our most groundbreaking model training initiatives.

We transform data center blueprints into operational systems while crafting the software necessary for executing large-scale frontier model trainings.

Our mission is to establish, stabilize, and ensure the reliability and efficiency of these hyperscale supercomputers throughout the training of our frontier models.

About the Role

We are seeking passionate engineers to manage the next generation of compute clusters that underpin OpenAI’s frontier research.

This position merges distributed systems engineering with practical infrastructure work across our expansive data centers. You will scale Kubernetes clusters to unprecedented levels, automate bare-metal setups, and create the software layer that simplifies the complexity of numerous nodes across various data centers.

Your work will be at the crossroads of hardware and software, where speed and reliability are paramount. Be prepared to oversee dynamic operations, swiftly identify and resolve pressing issues, and constantly elevate the standards for automation and uptime.

In this role, you will:

  • Provision and scale extensive Kubernetes clusters, including automation for deployment, bootstrapping, and lifecycle management

  • Create software abstractions that integrate multiple clusters and provide a cohesive interface for training workloads

  • Oversee node deployment from bare metal to firmware upgrades, ensuring rapid, repeatable setups at scale

  • Enhance operational metrics by reducing cluster restart times (e.g., from hours to minutes) and expediting firmware and OS upgrade cycles

  • Integrate networking and hardware health systems to ensure end-to-end reliability across servers, switches, and data center infrastructure

  • Develop monitoring and observability systems to identify issues early and maintain cluster stability under high loads

You might thrive in this role if you:

  • Have extensive experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environments

  • Possess strong programming skills in languages relevant to cloud and infrastructure management

About OpenAI

At OpenAI, we are at the forefront of artificial intelligence research, dedicated to advancing technology for the benefit of humanity. Our Frontier Systems team is pivotal in pushing the boundaries of what's possible with supercomputing, creating scalable and efficient systems that empower our groundbreaking AI models.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.