About the job
Join the Fleet team at OpenAI, where we empower groundbreaking research and innovative product development by maintaining a robust computing environment. Our team manages extensive systems that encompass data centers, GPUs, and networking, ensuring peak performance, high availability, and efficiency. Our mission is to facilitate the seamless operation of OpenAI's models at scale, supporting both internal research initiatives and external products such as ChatGPT, while prioritizing safety, reliability, and responsible AI deployment over unchecked expansion.
About the Position
As a Software Engineer specializing in Operating Systems & Orchestration, you will play a crucial role in developing systems that manage our hardware, configurations, vendors, and the teams utilizing our infrastructure. Your work will involve designing and implementing solutions that fuse individual nodes and servers into cohesive clusters, directly enhancing the AI research experience. This role is located in San Francisco, CA, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new hires.
Key Responsibilities:
Architect and develop systems to manage extensive cloud and bare-metal infrastructures at scale.
Create tools that correlate low-level hardware metrics with high-level job scheduling and cluster management algorithms.
Utilize Large Language Models (LLMs) to streamline vendor operations and enhance infrastructure workflows.
Automate infrastructure processes to minimize repetitive tasks and bolster system reliability.
Work collaboratively with hardware, infrastructure, and research teams to ensure smooth integration across all components.
Continuously refine tools, automation, processes, and documentation to boost operational effectiveness.
Ideal Candidate Profile:
Demonstrates strong software engineering capabilities with experience in large-scale infrastructure environments.
Possesses extensive knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud platforms).
Has deep expertise in server-level systems (e.g., systems, containerization, Chef, Linux kernels, firmware management, host routing).
Is passionate about enhancing the performance and reliability of large compute fleets.
Thrives in fast-paced environments and is eager to tackle complex challenges.

