companyOpenAI logo

Software Engineer, Infrastructure Reliability at OpenAI | San Francisco

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

A Bachelor's degree in Computer Science or a related field is preferred. Candidates should have a strong foundation in coding, system design, and experience with tools used in distributed systems.

About the job

About Our Team

Join our dynamic Infrastructure organization at OpenAI, where we are actively seeking talented software engineers to bolster our efforts across several high-impact teams. With a variety of focus areas available—including Core Distributed Systems, Databases, Observability, and Cloud Infrastructure—you'll have the opportunity to work on projects that fascinate you. Our teams operate with a high level of autonomy and foster a deeply collaborative environment, all dedicated to enhancing safety, reliability, and operational velocity across the organization.

About the Role

As a Software Engineer focused on Infrastructure Reliability, you will play a pivotal role in scaling and fortifying the infrastructure that supports some of the world’s most widely utilized AI systems. Your work will ensure that our systems maintain high reliability, observability, performance, and security—enabling researchers to iterate rapidly and allowing products like ChatGPT and the OpenAI API to effectively serve millions of users.

This hands-on, impactful role is perfect for engineers who enjoy ownership, excel at solving complex technical challenges across the entire stack, and wish to contribute to systems that facilitate cutting-edge research deployed on a global scale. You will significantly influence technical direction, enhance system resilience, and collaborate closely with infrastructure, product, and research teams to transform intricate infrastructure into dependable platforms.

Key Responsibilities

  • Design, construct, and maintain reliable, high-performance systems utilized across engineering.

  • Identify and resolve performance bottlenecks and inefficiencies, ensuring our infrastructure scales appropriately.

  • Investigate and troubleshoot complex issues thoroughly.

  • Enhance automation to minimize manual tasks and improve internal developer tools.

  • Participate in incident response, postmortem analysis, and the development of best practices surrounding system reliability and scalability.

Ideal Candidate Profile

  • Possess a deep understanding of distributed systems principles, with a proven track record in developing and managing scalable, reliable systems.

  • Demonstrate a strong focus on performance and optimization, with the ability to maximize efficiency in complex, globally distributed systems.

  • Have experience managing orchestration systems such as Kubernetes at scale and creating abstractions over cloud platforms.

  • Be comfortable working within Linux environments and possess strong problem-solving skills.

About OpenAI

OpenAI is at the forefront of artificial intelligence research, dedicated to ensuring that AI benefits all of humanity. Our team fosters innovation and collaboration, and we are committed to building safe and effective AI systems.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.