Site Reliability Engineer at Hammerhead AI | Redwood City
Hammerhead AI
Full-time|On-site|Redwood City Join Hammerhead AIAt Hammerhead AI, we are revolutionizing the AI landscape by providing intelligent orchestration solutions that tackle one of the most significant challenges in the field: power accessibility. Our innovative platform enhances data center power infrastructure to maximize AI workload capacity without necessitating the construction of new power facilities or grid expansions.Utilizing reinforcement learning, our platform orchestrates power, cooling, and computing resources in real-time, allowing data centers to efficiently manage a greater volume of AI workloads within their current electrical and thermal parameters. With a track record of optimizing over 8 gigawatts of critical power globally at AutoGrid, we are seizing a $64 billion annual market opportunity while significantly minimizing the environmental impact of AI operations.At Hammerhead, you will: Engage at the nexus of AI, energy, and computing to shape the future of AI infrastructure. Collaborate with specialists in modern reinforcement learning, AI, IoT, and infrastructure technologies. Play a vital role in creating a sustainable future for AI computing. Join a forward-thinking company pioneering modern data center operations. Enjoy competitive compensation, equity, and benefits in a rapidly growing, purpose-driven environment. Learn from a seasoned team with a history of building and successfully exiting startups.Your RoleWe are on the lookout for a Site Reliability Engineer to spearhead the reliability, scalability, and operational excellence of our AI-powered power orchestration platform. Our software operates in production data centers globally, where real-time decisions profoundly impact gigawatts of computing infrastructure. Metrics such as availability, latency, and accuracy are vital.You will be working at the intersection of software and infrastructure, developing systems that deploy, monitor, and safeguard Hammerhead's platform in production. Collaborating with engineering teams, you will help establish service level objectives (SLOs), automate operational tasks, streamline releases, and ensure we can quickly diagnose and rectify any issues that arise.As a foundational SRE, you will be the first dedicated hire in this area, setting the benchmark for Hammerhead's software operations in production.You will report directly to the Head of Engineering.Key ResponsibilitiesTake ownership of production reliability for Hammerhead's platform: define and enforce SLOs, SLAs, and error budgets across services, and lead resolution efforts when targets are missed.
Apr 11, 2026