About the job
Mithril develops AI infrastructure aimed at making GPU computing more accessible and affordable for enterprises, AI startups, and researchers. Clients include LG AI Research, Saronic, and the Broad Institute. The company was founded by a former Google DeepMind research scientist and a Stanford CS PhD. Mithril has secured $80M in seed and Series A funding from Sequoia Capital and Lightspeed Venture Partners. Over the past year, platform revenue has grown more than sixfold. Fast Company recognized Mithril as the 8th Most Innovative Company in Artificial Intelligence for 2026.
The engineering team at Mithril is small, with each member making a significant impact. This Site Reliability Engineer (SRE) position is a foundational role focused on shaping how the platform scales across a multi-cloud environment.
Role overview
This SRE will play a central role in keeping Mithril's global GPU orchestration platform stable and high-performing. The responsibilities extend beyond day-to-day maintenance. The primary focus is on designing and building automation, observability, and tooling to help manage advanced compute resources across multiple cloud providers. The goal is to ensure customers have fast and dependable access to infrastructure.
Collaboration with Mithril's founding team is central to this job. The SRE will help set service level objectives (SLOs), orchestrate capacity, and make influential infrastructure decisions, gaining visibility into both technical and commercial aspects of the business.
What makes this SRE role unique
This position differs from many early-stage SRE roles that focus mainly on on-call rotations and incident response. Here, the emphasis is on building infrastructure that actively shapes Mithril's marketplace. The systems developed will determine how supply is sourced, allocated, and monitored across providers, directly affecting customer experience and company revenue.
The role offers genuine ownership, a fast feedback loop with leadership, and the opportunity to define how infrastructure engineering evolves as Mithril grows.
Core responsibilities
About 70–75% of the work centers on platform reliability and infrastructure automation.
Reliability & SLOs
- Implement and manage service level indicators (SLIs) and service level objectives (SLOs) for Mithril's API layer and internal orchestration services to maintain high reliability and performance.
