companyZyphra logo

Product Infrastructure Engineer - Site Reliability

ZyphraSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Proven experience in high-performance computing environments, such as machine learning clusters or GPU farmsStrong background in infrastructure as code tools (e.g., Ansible, Terraform)Familiarity with software release engineering tailored for ML/AI systems is advantageousExperience in designing reliable environments for experimental workloads and reproducible executionsUnderstanding of compliance and auditing standards related to deployment and system securityExperience with load testing, fault injection, and chaos engineering to strengthen systems under pressureA passion for developing tools that render infrastructure seamless and reliable for end users

About the job

Zyphra is a cutting-edge artificial intelligence firm located in the heart of San Francisco, California.

The Opportunity:

As a Product Infrastructure Engineer specializing in Site Reliability, your primary focus will be on architecting and sustaining the frameworks that ensure Zyphra's infrastructure remains strong, observable, secure, and scalable. Your contributions will be pivotal in guaranteeing the dependability and reproducibility of machine learning workloads, managing deployment safety, and ensuring the long-term viability of our computational environments.

Your Responsibilities:

  • Enhancing and developing observability systems (monitoring, logging, alerting)

  • Creating resilient build and deployment systems across both research and production settings

  • Establishing secure release protocols with comprehensive audit trails and rollback capabilities

  • Collaborating closely with ML engineers, DevOps, and infrastructure teams to optimize system reliability and performance

  • Leading incident response efforts, conducting root-cause analysis, and facilitating postmortems with a strong emphasis on learning and prevention

  • This position is perfect for individuals who are passionate about creating systems that empower other teams to be faster, safer, and more efficient.

Qualifications:

  • Proven experience in high-performance computing environments, such as machine learning clusters or GPU farms

  • Strong background in infrastructure as code tools (e.g., Ansible, Terraform)

  • Familiarity with software release engineering tailored for ML/AI systems is advantageous

  • Experience in designing reliable environments for experimental workloads and reproducible executions

  • Understanding of compliance and auditing standards related to deployment and system security

  • Experience with load testing, fault injection, and chaos engineering to strengthen systems under pressure

  • A passion for developing tools that render infrastructure seamless and reliable for end users

Preferred Qualifications:

  • Experience with infrastructure as code (e.g., Ansible, Terraform)

  • Previous experience supporting ML/AI infrastructure, including GPU management and workload optimization

  • Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

About Zyphra

Zyphra is at the forefront of artificial intelligence innovation, dedicated to developing solutions that harness the power of AI to transform various industries. Based in San Francisco, we are committed to building a robust technology infrastructure that supports our cutting-edge research and applications.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.