Software Engineer Frontier Clusters Infrastructure jobs in San Francisco – Browse 5,702 openings on RoboApply Jobs

Software Engineer, Frontier Clusters Infrastructure

OpenAISan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Proven expertise in operating and scaling Kubernetes clusters or equivalent container orchestration systems in large-scale environments. Strong programming skills in relevant languages such as Python, Go, or similar. Experience with bare-metal provisioning and management. Familiarity with networking and data center infrastructure. Excellent problem-solving skills and the ability to work in fast-paced environments.

About the job

About the Team

Join the innovative Frontier Systems team at OpenAI, where we design, implement, and maintain the world's largest supercomputers, essential for advancing our most groundbreaking model training initiatives.

We transform data center blueprints into operational systems while crafting the software necessary for executing large-scale frontier model trainings.

Our mission is to establish, stabilize, and ensure the reliability and efficiency of these hyperscale supercomputers throughout the training of our frontier models.

About the Role

We are seeking passionate engineers to manage the next generation of compute clusters that underpin OpenAI’s frontier research.

This position merges distributed systems engineering with practical infrastructure work across our expansive data centers. You will scale Kubernetes clusters to unprecedented levels, automate bare-metal setups, and create the software layer that simplifies the complexity of numerous nodes across various data centers.

Your work will be at the crossroads of hardware and software, where speed and reliability are paramount. Be prepared to oversee dynamic operations, swiftly identify and resolve pressing issues, and constantly elevate the standards for automation and uptime.

In this role, you will:

Provision and scale extensive Kubernetes clusters, including automation for deployment, bootstrapping, and lifecycle management
Create software abstractions that integrate multiple clusters and provide a cohesive interface for training workloads
Oversee node deployment from bare metal to firmware upgrades, ensuring rapid, repeatable setups at scale
Enhance operational metrics by reducing cluster restart times (e.g., from hours to minutes) and expediting firmware and OS upgrade cycles
Integrate networking and hardware health systems to ensure end-to-end reliability across servers, switches, and data center infrastructure
Develop monitoring and observability systems to identify issues early and maintain cluster stability under high loads

You might thrive in this role if you:

Have extensive experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environments
Possess strong programming skills in languages relevant to cloud and infrastructure management

About OpenAI

At OpenAI, we are at the forefront of artificial intelligence research, dedicated to advancing technology for the benefit of humanity. Our Frontier Systems team is pivotal in pushing the boundaries of what's possible with supercomputing, creating scalable and efficient systems that empower our groundbreaking AI models.

Similar jobs

1 - 20 of 5,702 Jobs

Select all on this page (20)

Apply

Software Engineer, Frontier Clusters Infrastructure

OpenAI

Full-time|On-site|San Francisco

About the TeamJoin the innovative Frontier Systems team at OpenAI, where we design, implement, and maintain the world's largest supercomputers, essential for advancing our most groundbreaking model training initiatives.We transform data center blueprints into operational systems while crafting the software necessary for executing large-scale frontier model trainings.Our mission is to establish, stabilize, and ensure the reliability and efficiency of these hyperscale supercomputers throughout the training of our frontier models.About the RoleWe are seeking passionate engineers to manage the next generation of compute clusters that underpin OpenAI’s frontier research.This position merges distributed systems engineering with practical infrastructure work across our expansive data centers. You will scale Kubernetes clusters to unprecedented levels, automate bare-metal setups, and create the software layer that simplifies the complexity of numerous nodes across various data centers.Your work will be at the crossroads of hardware and software, where speed and reliability are paramount. Be prepared to oversee dynamic operations, swiftly identify and resolve pressing issues, and constantly elevate the standards for automation and uptime.In this role, you will:Provision and scale extensive Kubernetes clusters, including automation for deployment, bootstrapping, and lifecycle managementCreate software abstractions that integrate multiple clusters and provide a cohesive interface for training workloadsOversee node deployment from bare metal to firmware upgrades, ensuring rapid, repeatable setups at scaleEnhance operational metrics by reducing cluster restart times (e.g., from hours to minutes) and expediting firmware and OS upgrade cyclesIntegrate networking and hardware health systems to ensure end-to-end reliability across servers, switches, and data center infrastructureDevelop monitoring and observability systems to identify issues early and maintain cluster stability under high loadsYou might thrive in this role if you:Have extensive experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environmentsPossess strong programming skills in languages relevant to cloud and infrastructure management

Nov 7, 2024

Apply

Software Engineer, Frontier Systems

OpenAI

Full-time|On-site|San Francisco

About Our TeamThe Frontier Systems team at OpenAI is at the forefront of technology, responsible for creating, deploying, and maintaining some of the world's largest supercomputers. These supercomputers are pivotal for training our most advanced AI models, pushing the boundaries of innovation.We transform sophisticated data center designs into operational systems and develop the software infrastructure necessary for extensive frontier model training. Our goal is to ensure these hyperscale supercomputers operate reliably and efficiently, supporting groundbreaking AI research.About the RoleAs a key member of the Frontier Systems team, you will be instrumental in designing the critical infrastructure that ensures our supercomputers function seamlessly for pioneering AI research. In this role, you'll address system-level challenges and implement automation solutions that minimize disruptions during large-scale training processes.Your responsibilities will encompass end-to-end ownership of your projects, allowing you to make significant contributions to our mission. This position is ideal for individuals who excel in diagnosing complex system issues and crafting automation strategies to proactively resolve problems across a vast network of machines.Your Responsibilities Include:Enhancing system health checks to maintain the stability of our hyperscale supercomputers during model training.Conducting in-depth investigations into hardware failures and system-level bugs to uncover root causes.Developing automation tools that monitor and resolve issues across thousands of systems, enabling uninterrupted research progress.You May Be a Great Fit If You Possess:7+ years of hands-on experience in software engineering.Strong proficiency in Python and shell scripting.Expertise in analyzing complex data sets using SQL, PromQL, Pandas, or other relevant tools.Experience in creating reproducible analyses.A solid balance of skills in both building and operationalizing systems.Prior experience with hardware is not a prerequisite for this role.Preferred Qualifications:Familiarity with the intricacies of hardware components, protocols, and Linux tools (e.g., PCIe, Infiniband, networking, power management, kernel performance tuning).Experience with system optimization and performance tuning.

May 9, 2025

Apply

Software Engineer, Frontier AI Infrastructure

Scale AI

Full-time|$138K/yr - $259.4K/yr|On-site|San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC

Scale AI is on the lookout for an exceptionally talented and driven Software Engineer, Frontier AI Infrastructure to become an integral part of our innovative Public Sector Engineering team. In this role, you will take charge of the model inference layer, enabling cutting-edge AI models, troubleshooting the latest AI tools, managing networking tasks, addressing latency issues, and monitoring pricing and usage metrics for AI models. You will spearhead technical discussions with cloud vendors and clients to fulfill critical contracts and resolve platform challenges. Additionally, you will collaborate closely with Product teams to anticipate feature requirements, transitioning from reactive 'infra-only debugging' to proactive integration testing.Your Responsibilities Include:Designing and implementing secure, scalable backend systems tailored for Public Sector clients, utilizing Scale's advanced cloud-native AI infrastructure.Owning services or systems while defining long-term health objectives and enhancing the health of related components.Redesigning the architecture to operate in compliant or restrictive environments, which entails creating swappable components (authentication, storage, logging) to adhere to government and security regulations without compromising product integrity.Collaborating with Product teams to develop integration tests that identify issues early, shifting focus from 'infra-only debugging' to preventing upstream failures.Actively participating in customer engagements, liaising with stakeholders to comprehend requirements and deliver innovative solutions.Contributing to the platform roadmap and product strategy for Scale AI's Public Sector division, playing a vital role in shaping the future trajectory of our offerings.

Mar 26, 2026

Apply

Software Engineer, Infrastructure

Sierra

Full-time|On-site|San Francisco, CA

About UsAt Sierra, we are revolutionizing the way businesses engage with their customers by building a cutting-edge platform that harnesses the power of AI. Our headquarters is located in the vibrant city of San Francisco, with additional offices expanding in Atlanta, New York, London, France, Singapore, and Japan.Our company culture is deeply rooted in our core values: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and foster an environment where innovation thrives.Sierra was co-founded by visionary leaders Bret Taylor, who currently serves as the Board Chair of OpenAI and has a rich history with Salesforce and Facebook, and Clay Bavor, who previously led Google Labs and spearheaded initiatives like Google Lens and Project Starline.Your RoleAs a Software Engineer focusing on Infrastructure at Sierra, you will play a pivotal role in designing, constructing, and maintaining the foundational systems that empower our AI platform. Your expertise will ensure that our infrastructure is not only secure and reliable but also scalable, allowing product teams to execute their work with agility and confidence.Guarantee the reliability, scalability, and performance of our platform and LLM inference serving in response to increasing traffic demands.Develop and oversee cloud infrastructure using Terraform to create secure, scalable, and reproducible environments.Establish and manage a self-service infrastructure platform to empower engineering teams in deploying and operating services independently.Take ownership of and improve CI/CD pipelines and release management processes, facilitating rapid and reliable deployments across Sierra’s platform.Design and manage distributed systems utilizing distributed databases, retrieval systems, and machine learning models.Develop and sustain core data serving abstractions along with essential authentication and security features (SSO, RBAC, authentication controls).Effectively navigate and integrate our technology stack with enterprise customer environments in a scalable and maintainable manner.

Oct 15, 2025

Apply