About the job
About xAI
At xAI, we are driven by our mission to develop AI systems that profoundly understand the universe and assist humanity in its quest for knowledge. Our team is composed of passionate individuals who thrive on challenges and curiosity, emphasizing engineering excellence. We maintain a flat organizational structure where every member is expected to actively contribute to our mission. Leadership is earned through initiative and consistent delivery of excellence, fostering a strong work ethic and prioritization skills. Effective communication is essential, enabling team members to share insights and knowledge clearly.
About the Role
The Compute Infrastructure team at xAI is tasked with the design, construction, and management of extensive clusters and orchestration platforms that facilitate cutting-edge AI training, inference, and agent workloads at an unprecedented scale. In this role, you will redefine container orchestration beyond current systems like Kubernetes, manage exascale computing resources, optimize for high-performance training runs and production services, and work closely with research and systems teams to deliver reliable, ultra-scalable infrastructure that powers xAI's next-generation models and applications.
Responsibilities
- Construct and oversee large-scale clusters to host, persist, train, and serve AI workloads with exceptional reliability and performance.
- Design, develop, and enhance an in-house container orchestration platform that surpasses off-the-shelf solutions in scalability, isolation, resource efficiency, and fault-tolerance.
- Collaborate with research teams to architect and optimize compute clusters tailored for extensive training runs, inference services, and real-time applications.
- Profile, debug, and resolve intricate system-level performance bottlenecks, resource contention, scheduling dilemmas, and reliability issues across the entire stack.
- Take ownership of end-to-end infrastructure initiatives employing first-principles design, rigorous testing, automation, and continuous optimization to meet the demands of frontier AI compute.

