About the job
About Our Team
The Stargate Infrastructure team at OpenAI is at the forefront of developing and managing systems that support cutting-edge AI workloads at an unprecedented scale. Our mission encompasses the deployment and management of clusters, networks, and data center infrastructure across both first-party and partner environments.
As the complexity and scale of our systems expand, we are making significant investments in agentic systems and intelligent automation, aimed at optimizing infrastructure deployment, operation, and debugging processes. Our focus is on leveraging AI-driven methodologies to enhance real-world infrastructure workflows, leading to accelerated execution, improved reliability, and scalable operations.
About the Position
We are looking for an IC Agentic Engineering Manager to spearhead the development and implementation of agent-based systems for infrastructure delivery and operations within our Stargate team.
In this player-coach role, you will not only lead a small team but also engage directly in the design and implementation of systems. You will concentrate on integrating agentic systems into infrastructure workflows, including deployment orchestration, system initialization, issue triage, debugging, and capacity management.
This role is distinctly focused on applying agentic systems to address specific infrastructure challenges, collaborating closely with hardware, networking, and clustering teams.
Key Responsibilities
Architect and construct agent-based systems that facilitate infrastructure deployment and operations.
Identify high-impact opportunities for agent application across workflows, including:
Cluster initialization and deployment readiness.
Incident triage and root cause analysis.
System validation and health monitoring.
Capacity management and operational decision-making.
Lead a small team while also contributing as an IC in the areas of system design, development, and integration.
Collaborate with infrastructure, hardware, and networking teams to incorporate agentic systems into production workflows.
Develop systems that utilize telemetry, logs, and system signals to enable closed-loop automation.
Establish evaluation frameworks to assess system performance, reliability, and operational impact.
Drive the transition from prototype to production, ensuring robustness and scalability.

