About the job
Zyphra is a cutting-edge artificial intelligence firm located in the heart of San Francisco, California.
The Opportunity:
As a Product Infrastructure Engineer specializing in Site Reliability, your primary focus will be on architecting and sustaining the frameworks that ensure Zyphra's infrastructure remains strong, observable, secure, and scalable. Your contributions will be pivotal in guaranteeing the dependability and reproducibility of machine learning workloads, managing deployment safety, and ensuring the long-term viability of our computational environments.
Your Responsibilities:
Enhancing and developing observability systems (monitoring, logging, alerting)
Creating resilient build and deployment systems across both research and production settings
Establishing secure release protocols with comprehensive audit trails and rollback capabilities
Collaborating closely with ML engineers, DevOps, and infrastructure teams to optimize system reliability and performance
Leading incident response efforts, conducting root-cause analysis, and facilitating postmortems with a strong emphasis on learning and prevention
This position is perfect for individuals who are passionate about creating systems that empower other teams to be faster, safer, and more efficient.
Qualifications:
Proven experience in high-performance computing environments, such as machine learning clusters or GPU farms
Strong background in infrastructure as code tools (e.g., Ansible, Terraform)
Familiarity with software release engineering tailored for ML/AI systems is advantageous
Experience in designing reliable environments for experimental workloads and reproducible executions
Understanding of compliance and auditing standards related to deployment and system security
Experience with load testing, fault injection, and chaos engineering to strengthen systems under pressure
A passion for developing tools that render infrastructure seamless and reliable for end users
Preferred Qualifications:
Experience with infrastructure as code (e.g., Ansible, Terraform)
Previous experience supporting ML/AI infrastructure, including GPU management and workload optimization
Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

