About the job
About xAI
At xAI, we're on a mission to develop AI systems that not only understand the universe but also empower humanity in its quest for knowledge. Our team is compact, driven, and dedicated to engineering excellence. We welcome individuals who thrive on intellectual challenges and have a passion for curiosity. Operating within a flat organizational structure, we encourage all team members to be hands-on in contributing to our mission. Leadership opportunities are available for those who demonstrate initiative and consistently deliver outstanding results. Strong work ethic and prioritization skills are paramount. Effective communication is a must, as team members should be adept at sharing insights and knowledge with their peers.
About the Team
The Observability team is responsible for constructing and managing the essential infrastructure that allows engineers to monitor, troubleshoot, and enhance the performance and reliability of their systems. We process telemetry at an enormous scale, managing billions of time series and petabytes of logs, all while adhering to rigorous performance and availability standards.
About the Role
As part of a dynamic and impactful team, you will play a vital role in developing and maintaining xAI’s observability platform. You will take ownership of critical systems that facilitate metrics, logs, tracing, and alerting, enabling engineering teams to operate services at scale, preemptively identify issues before they affect users, and drive systemic improvements in reliability.
What You’ll Do
- Design and implement scalable observability infrastructure for metrics, logging, and tracing.
- Build high-performance telemetry pipelines capable of managing extensive ingestion volumes.
- Develop APIs, query engines, and user interfaces that deliver real-time insights into services.
- Establish and reinforce best practices for instrumentation, alerting, and reliability throughout the organization.
- Collaborate with infrastructure and product teams to seamlessly integrate observability into our internal platforms.
- Maintain end-to-end ownership of the reliability, scalability, and performance of the observability stack.

