companyxAI logo

Software Engineer, Compute Infrastructure

xAIPalo Alto, CA
On-site Full-time $180K/yr - $440K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

We are looking for candidates who possess a strong background in software engineering, particularly in the realm of compute infrastructure. Ideal candidates will have experience with large-scale systems, container orchestration, and performance optimization. You should demonstrate a passion for problem-solving and a commitment to achieving excellence in your work.

About the job

About xAI

At xAI, we are driven by our mission to develop AI systems that profoundly understand the universe and assist humanity in its quest for knowledge. Our team is composed of passionate individuals who thrive on challenges and curiosity, emphasizing engineering excellence. We maintain a flat organizational structure where every member is expected to actively contribute to our mission. Leadership is earned through initiative and consistent delivery of excellence, fostering a strong work ethic and prioritization skills. Effective communication is essential, enabling team members to share insights and knowledge clearly.

About the Role

The Compute Infrastructure team at xAI is tasked with the design, construction, and management of extensive clusters and orchestration platforms that facilitate cutting-edge AI training, inference, and agent workloads at an unprecedented scale. In this role, you will redefine container orchestration beyond current systems like Kubernetes, manage exascale computing resources, optimize for high-performance training runs and production services, and work closely with research and systems teams to deliver reliable, ultra-scalable infrastructure that powers xAI's next-generation models and applications.

Responsibilities

  • Construct and oversee large-scale clusters to host, persist, train, and serve AI workloads with exceptional reliability and performance.
  • Design, develop, and enhance an in-house container orchestration platform that surpasses off-the-shelf solutions in scalability, isolation, resource efficiency, and fault-tolerance.
  • Collaborate with research teams to architect and optimize compute clusters tailored for extensive training runs, inference services, and real-time applications.
  • Profile, debug, and resolve intricate system-level performance bottlenecks, resource contention, scheduling dilemmas, and reliability issues across the entire stack.
  • Take ownership of end-to-end infrastructure initiatives employing first-principles design, rigorous testing, automation, and continuous optimization to meet the demands of frontier AI compute.

About xAI

xAI is at the forefront of AI innovation, dedicated to creating intelligent systems that enhance human understanding and drive knowledge acquisition. Our small, dynamic team is committed to pushing the boundaries of technology while fostering a culture of collaboration and continuous improvement.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.