company

Site Reliability Engineer at Hammerhead AI | Redwood City

Hammerhead AIRedwood City
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

QualificationsStrong experience in site reliability engineering or a related field. Proficiency in monitoring and performance tuning of cloud-based services. Familiarity with automation tools and continuous integration/continuous deployment (CI/CD) practices. Understanding of networking, cloud infrastructure, and data center operations. Excellent problem-solving skills and a proactive mindset. Effective communication and teamwork abilities.

About the job

Join Hammerhead AI

At Hammerhead AI, we are revolutionizing the AI landscape by providing intelligent orchestration solutions that tackle one of the most significant challenges in the field: power accessibility. Our innovative platform enhances data center power infrastructure to maximize AI workload capacity without necessitating the construction of new power facilities or grid expansions.

Utilizing reinforcement learning, our platform orchestrates power, cooling, and computing resources in real-time, allowing data centers to efficiently manage a greater volume of AI workloads within their current electrical and thermal parameters. With a track record of optimizing over 8 gigawatts of critical power globally at AutoGrid, we are seizing a $64 billion annual market opportunity while significantly minimizing the environmental impact of AI operations.

At Hammerhead, you will:

  • Engage at the nexus of AI, energy, and computing to shape the future of AI infrastructure.
  • Collaborate with specialists in modern reinforcement learning, AI, IoT, and infrastructure technologies.
  • Play a vital role in creating a sustainable future for AI computing.
  • Join a forward-thinking company pioneering modern data center operations.
  • Enjoy competitive compensation, equity, and benefits in a rapidly growing, purpose-driven environment.
  • Learn from a seasoned team with a history of building and successfully exiting startups.

Your Role

We are on the lookout for a Site Reliability Engineer to spearhead the reliability, scalability, and operational excellence of our AI-powered power orchestration platform. Our software operates in production data centers globally, where real-time decisions profoundly impact gigawatts of computing infrastructure. Metrics such as availability, latency, and accuracy are vital.

You will be working at the intersection of software and infrastructure, developing systems that deploy, monitor, and safeguard Hammerhead's platform in production. Collaborating with engineering teams, you will help establish service level objectives (SLOs), automate operational tasks, streamline releases, and ensure we can quickly diagnose and rectify any issues that arise.

As a foundational SRE, you will be the first dedicated hire in this area, setting the benchmark for Hammerhead's software operations in production.

You will report directly to the Head of Engineering.

Key Responsibilities

  • Take ownership of production reliability for Hammerhead's platform: define and enforce SLOs, SLAs, and error budgets across services, and lead resolution efforts when targets are missed.

About Hammerhead AI

Hammerhead AI is at the forefront of AI technology, focusing on optimizing energy efficiency in data centers through innovative power orchestration solutions. Our mission is to enable a sustainable future for AI operations while addressing significant market needs.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.