Product Infrastructure Engineer - Site Reliability

ZyphraSan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Proven experience in high-performance computing environments, such as machine learning clusters or GPU farmsStrong background in infrastructure as code tools (e.g., Ansible, Terraform)Familiarity with software release engineering tailored for ML/AI systems is advantageousExperience in designing reliable environments for experimental workloads and reproducible executionsUnderstanding of compliance and auditing standards related to deployment and system securityExperience with load testing, fault injection, and chaos engineering to strengthen systems under pressureA passion for developing tools that render infrastructure seamless and reliable for end users

About the job

Zyphra is a cutting-edge artificial intelligence firm located in the heart of San Francisco, California.

The Opportunity:

As a Product Infrastructure Engineer specializing in Site Reliability, your primary focus will be on architecting and sustaining the frameworks that ensure Zyphra's infrastructure remains strong, observable, secure, and scalable. Your contributions will be pivotal in guaranteeing the dependability and reproducibility of machine learning workloads, managing deployment safety, and ensuring the long-term viability of our computational environments.

Your Responsibilities:

Enhancing and developing observability systems (monitoring, logging, alerting)
Creating resilient build and deployment systems across both research and production settings
Establishing secure release protocols with comprehensive audit trails and rollback capabilities
Collaborating closely with ML engineers, DevOps, and infrastructure teams to optimize system reliability and performance
Leading incident response efforts, conducting root-cause analysis, and facilitating postmortems with a strong emphasis on learning and prevention
This position is perfect for individuals who are passionate about creating systems that empower other teams to be faster, safer, and more efficient.

Qualifications:

Proven experience in high-performance computing environments, such as machine learning clusters or GPU farms
Strong background in infrastructure as code tools (e.g., Ansible, Terraform)
Familiarity with software release engineering tailored for ML/AI systems is advantageous
Experience in designing reliable environments for experimental workloads and reproducible executions
Understanding of compliance and auditing standards related to deployment and system security
Experience with load testing, fault injection, and chaos engineering to strengthen systems under pressure
A passion for developing tools that render infrastructure seamless and reliable for end users

Preferred Qualifications:

Experience with infrastructure as code (e.g., Ansible, Terraform)
Previous experience supporting ML/AI infrastructure, including GPU management and workload optimization
Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

About Zyphra

Zyphra is at the forefront of artificial intelligence innovation, dedicated to developing solutions that harness the power of AI to transform various industries. Based in San Francisco, we are committed to building a robust technology infrastructure that supports our cutting-edge research and applications.

Similar jobs

1 - 20 of 11,463 Jobs

Search for Infrastructure Site Reliability Engineer At Atomic Semi San Francisco

11,463 results

Select all on this page (20)

Apply

Infrastructure & Site Reliability Engineer at Atomic Semi | San Francisco

Atomic Semi

Full-time|$125K/yr - $195K/yr|On-site|San Francisco Office

About Atomic SemiAtomic Semi is pioneering the development of a compact and agile semiconductor fabrication facility.With today’s technology, alongside a few innovative simplifications, we are capable of realizing this vision. We will create our own tools, allowing for rapid iterations and enhancements.Our goal is to assemble a small, exceptional team of hands-on engineers to drive this initiative forward. Our team is composed of experts in mechanical, electrical, hardware, computer, and process engineering. We will manage the entire stack, from atoms to architecture, with a forward-thinking approach that pushes the boundaries of technology.Our philosophy emphasizes that smaller, faster, and self-built systems are superior.We are confident that our team and lab can create anything we envision. Equipped with 3D printers, diverse microscopes, e-beam writers, and general fabrication tools, we are committed to inventing whatever tools we may need along the way.Founded by Sam Zeloof and Jim Keller, Atomic Semi combines Sam's garage chip-making prowess with Jim's extensive 40-year leadership in the semiconductor industry.About the RoleWe are in search of an Infrastructure & Site Reliability Engineer to design, construct, deploy, and oversee the on-premises backend infrastructure that drives our rapid semiconductor fabrication process.This multifaceted role encompasses all elements of backend infrastructure and services.Our infrastructure philosophy prioritizes minimalism, clarity, on-site operations, and proximity to hardware. Expect a focus on bare-metal Linux, systemd, and single-file binaries rather than extensive use of Docker, cloud services, or Kubernetes. Proficiency in Rust, Go, and Python will be beneficial.We welcome candidates from various experience levels—ranging from outstanding early-career engineers to seasoned professionals. We are not fixed on a specific background; what is paramount is your proven ability to build real systems, enthusiasm for hands-on engineering, and a strong display of engineering excellence. If you are passionate about performance engineering, developing complex features from the ground up, and swiftly mastering new domains, this is an exciting opportunity for you.A portfolio or GitHub account is generally required to apply: demonstrate the projects you’ve undertaken!

Feb 13, 2026

Apply

Mechanical Engineer at Atomic Semi | San Francisco

Atomic Semi

Full-time|$117K/yr - $180K/yr|On-site|San Francisco Office

About Atomic SemiAtomic Semi is pioneering the development of a compact, high-speed semiconductor fabrication facility.Utilizing current technology and a few strategic simplifications, we are set to construct the tools ourselves to enable rapid iteration and enhancement.Our goal is to assemble a small, elite team of hands-on engineers across various disciplines—mechanical, electrical, hardware, computer, and process engineering—to take ownership of the entire technology stack from atoms to architecture. We are optimistic about the future and committed to pushing the boundaries of what technology can achieve.We believe that smaller, faster, and self-built solutions are the key to advancement.Equipped with state-of-the-art 3D printers, a diverse range of microscopes, e-beam writers, and general fabrication tools, we have the capability to invent anything that might be missing along our journey.Atomic was founded by Sam Zeloof, known for his garage chip-making, and Jim Keller, a seasoned leader in the semiconductor industry for over 40 years.About the RoleAs a Mechanical Engineer at Atomic Semi, your primary responsibilities will include designing, refining, and constructing specialized equipment for semiconductor fabrication.We seek hands-on engineers to tackle diverse challenges, from heavy design tasks to integrating mechanical, optical, and electronic components, as well as automating production processes. Our philosophy is to build and test solutions rapidly—in days or weeks, not months—with a focus on practical solutions that address real production needs.We are hiring across multiple experience levels for various specialized teams:The Positioning Team: This team is dedicated to systems-level engineering, designing, prototyping, and building semiconductor fabrication equipment and subsystems related to lithography, motion, and metrology.The Vacuum Team: This team handles the design, prototyping, validation, and scaling of core semiconductor fabrication equipment, including tool families such as Anneal, Deposition, Etch, and Supporting Infrastructure, with objectives ranging from fundamental design to automation readiness.

Jan 23, 2026

Apply

Mechanical Engineer - Automation at Atomic Semi | San Francisco

Atomic Semi

Full-time|$117K/yr - $180K/yr|On-site|San Francisco Office

About Atomic SemiAtomic Semi is pioneering the creation of a compact and rapid semiconductor fabrication facility.Our vision leverages contemporary technology and innovative simplifications to realize this goal. By developing our own tools, we aim for swift iterations and enhancements.We are assembling a small cadre of outstanding, hands-on engineers across various disciplines including mechanical, electrical, hardware, computer science, and process engineering. Our ambition is to master the entire stack—from the atomic level to architectural designs. Our team is driven by optimism and a commitment to pushing the boundaries of technology.We embrace the philosophy that smaller, faster, and self-built solutions are superior.Our belief in our capabilities is unwavering; our lab is equipped with 3D printers, an array of sophisticated microscopes, e-beam writers, general fabrication tools, and we are ready to invent anything that we find lacking.Founded by Sam Zeloof and Jim Keller, Atomic Semi is led by pioneers in the semiconductor industry, with Sam recognized for producing chips in his garage, and Jim boasting a 40-year legacy of leadership in the field.About the TeamOur automation engineering team is tasked with developing robotic systems to efficiently produce semiconductor packaging devices. We engage in designing, prototyping, building our own fabrication equipment, and innovating novel manufacturing processes. Our specialized team, skilled in electrical, mechanical, physics, and chemistry, collaborates closely to achieve our objectives.About the RoleAs a Mechanical Engineer - Automation at Atomic Semi, you will be responsible for generating and prototyping robotic components and systems from the ground up. Your projects will encompass precision small motion manipulation, integrating automation into existing machines, and overseeing complete system development and implementation.

Jan 23, 2026

Apply

Prototype Machinist at Atomic Semi | San Francisco

Atomic Semi

Full-time|$105K/yr - $125K/yr|On-site|San Francisco Office

Join Atomic Semi as a Prototype MachinistAt Atomic Semi, we are revolutionizing the semiconductor industry by creating a small, agile fabrication facility powered by cutting-edge technology. Our vision is to innovate and enhance semiconductor production through our unique in-house tools and methodologies.We are assembling a dynamic team of skilled engineers from various disciplines, including mechanical, electrical, hardware, computer, and process engineering. We aim to master every aspect of the semiconductor stack—from the atomic level to overall architecture. Our optimistic approach drives us to continuously explore new technological frontiers.Join us at our San Francisco office and be part of a team that believes in the power of small and fast solutions. With advanced 3D printers, a diverse range of microscopes, e-beam writers, and general fabrication equipment, we are fully equipped to innovate. If something is missing, we will create it ourselves.Founded by visionary leaders Sam Zeloof and Jim Keller, we are on a mission to redefine semiconductor production.

Jan 14, 2026

Apply

Rust Software Engineer at Atomic Semi | San Francisco

Atomic Semi

Full-time|$125K/yr - $195K/yr|On-site|San Francisco Office

About Atomic SemiAtomic Semi is innovating the semiconductor industry by creating a compact and efficient semiconductor fabrication facility.Leveraging existing technology with strategic simplifications, we are committed to developing our own tools to facilitate rapid iteration and enhancement.Our team comprises exceptional hands-on engineers specializing in mechanical, electrical, hardware, computer, and process engineering. We take pride in managing the entire stack—from atomic structures to architectural designs. Optimism fuels our ambition to continuously push technological boundaries.We believe that smaller, faster, and self-built solutions are key to success.Our lab is equipped with state-of-the-art 3D printers, a variety of advanced microscopes, e-beam writers, and general fabrication equipment. We are prepared to invent any missing tools as needed.Atomic Semi was founded by Sam Zeloof and Jim Keller, industry leaders with a track record of innovation and excellence in semiconductor technology.About the TeamOur software team is pivotal in ensuring the seamless operation of our fab, developing the code that controls, manages, automates, and orchestrates the hardware essential for chip production.About the RoleWe are on the lookout for a Rust Software Engineer to contribute to the development of next-generation high-performance tools for chip fabrication and design. Your work will integrate real-time collaboration, computational geometry, GPU-accelerated visualization, simulation, interactive editing, and data processing.This multifaceted, hands-on position will involve tasks ranging from low-level systems and data synchronization to high-performance rendering, creating rich UI/UX editing tools, and establishing distributed infrastructure. The software you develop will empower engineers to design and comprehend tangible physical systems.We welcome candidates across various experience levels—from outstanding early-career engineers to seasoned professionals. The key attribute we seek is a proven track record of creating real-world applications and a passion for technological innovation.

Jan 12, 2026

Apply

Fullstack Software Engineer at Atomic Semi | San Francisco

Atomic Semi

Full-time|$125K/yr - $195K/yr|On-site|San Francisco Office

Join Atomic SemiAt Atomic Semi, we are on a mission to revolutionize the semiconductor industry by creating an agile, state-of-the-art semiconductor fabrication facility.Harnessing the power of modern technology and innovative simplifications, we are dedicated to building our own tools for rapid iteration and enhancement.Our dynamic and collaborative team consists of top-notch engineers specializing in various domains such as mechanical, electrical, hardware, and computer engineering. We embrace the philosophy that we can manage everything from atomic elements to architectural designs.With a focus on quality and speed, we believe that smaller and faster is always better. We are committed to inventing solutions that meet our unique requirements, using a range of advanced technologies including 3D printers, sophisticated microscopes, e-beam writers, and a full spectrum of fabrication equipment.Founded by industry visionaries Sam Zeloof and Jim Keller, we are driven by a passion for innovation and excellence.Role OverviewAs a Fullstack Software Engineer, you will play a crucial role in developing our primary web application that operates the fab. Your focus will be on creating user-friendly features that empower process engineers to seamlessly interact with hardware, workflows, and data.

Mar 24, 2026

Apply

Office Coordinator at Atomic Semi | San Francisco

Atomic Semi

Full-time|On-site|San Francisco Office

Atomic Semi builds compact, high-speed semiconductor fabrication facilities in San Francisco. The team brings together engineers from mechanical, electrical, hardware, computer, and process backgrounds, all focused on refining technologies and processes for rapid iteration and improvement. With advanced tools like 3D printers, microscopy instruments, e-beam writers, and a full suite of fabrication equipment, Atomic Semi emphasizes hands-on problem solving and continuous development. The company was founded by Sam Zeloof and Jim Keller, both recognized for their deep expertise in semiconductor technology. Role overview The Office Coordinator plays a central role in keeping Atomic Semi’s San Francisco office running efficiently. This person supports daily operations, allowing engineers and staff to concentrate on their work. As the first point of contact, the coordinator helps create a welcoming and organized environment. Success in this role depends on strong multitasking, attention to detail, and organizational skills. Key responsibilities Manage front desk activities, greet visitors, coordinate calendars, and act as the main contact for vendors. Oversee office supplies and amenities, including placing grocery orders, organizing Monday breakfasts, weekly lunches, and handling regular restocking. Supervise the day porter and janitorial staff, set cleaning schedules, and address facility issues as they come up. Maintain an organized workspace to support productivity. Location This position is onsite at Atomic Semi’s San Francisco office.

Apr 24, 2026

Apply

Business Operations Generalist at Atomic Semi | San Francisco

Atomic Semi

Full-time|$105K/yr - $125K/yr|On-site|San Francisco Office

About Atomic SemiAtomic Semi is on a mission to revolutionize semiconductor manufacturing with a compact, agile fabrication facility.Leveraging current technologies and innovative simplifications, we are developing our own tools to enhance rapid iteration and improvement.Our exceptional team of hands-on engineers spans various disciplines including mechanical, electrical, hardware, computer science, and process engineering. Together, we embrace the challenge of owning the entire stack from atoms to architecture. We are driven by optimism and a commitment to pushing the boundaries of technology.We firmly believe in the mantra: smaller is better, faster is better, and building it ourselves is better.Our lab is equipped with state-of-the-art 3D printers, a diverse array of microscopes, e-beam writers, and general fabrication tools. When faced with gaps, we invent solutions as we progress.Founded by Sam Zeloof and Jim Keller, Atomic Semi blends garage ingenuity with decades of semiconductor expertise.About the RoleAs a Business Operations Generalist, you will be pivotal in enhancing our operational framework by supporting diverse projects across all functions. Your role will involve daily ad hoc support for internal services and longer-term cross-functional initiatives with both internal and external stakeholders, all aimed at fostering the growth and operational excellence of Atomic.You will collaborate with various stakeholders to tackle complex challenges, create new processes, gather and interpret data, and empower others in similar endeavors. Success in this dynamic environment necessitates a proactive approach, adaptability, and an ability to navigate shifting priorities and open-ended challenges.Joining our small yet impactful Business Operations team, you will have the opportunity to make substantial contributions across multiple functions from the outset, extending beyond traditional operational roles. This position is ideal for someone eager to learn, thrive in a fast-paced setting, take ownership, and help lay the groundwork for a rapidly scaling company.

Jan 25, 2026

Apply

Exciting Engineering Opportunities at Atomic Semi

Atomic Semi

Full-time|On-site|San Francisco Office

Join the Future with Atomic SemiAt Atomic Semi, we are pioneering the development of a compact and rapid semiconductor fabrication facility. Leveraging existing technologies and innovative simplifications, we aim to construct our own tools for swift iteration and enhancement.Our mission is supported by a small, exceptional team of hands-on engineers specializing in various fields including mechanical, electrical, hardware, computer engineering, and process design. Together, we take ownership of the entire stack, from atomic structures to overarching architectural designs. We are optimistic about technological advancements and are committed to pushing the boundaries of what's possible.We believe in the power of small, fast, and self-built solutions. Our state-of-the-art lab is equipped with 3D printers, diverse microscopes, e-beam writers, and general fabrication tools. If we find that something is missing, we’ll innovate and create it ourselves!Founded by Sam Zeloof and Jim Keller, Atomic Semi combines garage ingenuity with decades of semiconductor industry leadership.Open Invitation for Exceptional TalentWe are continuously on the lookout for remarkable individuals who can contribute to our full-stack product development, which encompasses numerous engineering disciplines. Here’s a glimpse of the areas we are involved in:Mechanical systems (high precision mechanical design, control systems, fluids, temperature & vibration control, optics)Electrical systems (RF, high voltage, FPGAs)Software (firmware, low-latency controls, data analysis, robotics and automation)Nanofabrication (process engineering, chemistry, device physics, transistor modeling)Our pace is rapid, and our needs evolve quickly. If you believe you can help us build something extraordinary, we would love to hear from you!Life at Atomic SemiAs an early-stage hardware startup backed by solid funding and world-class advisors, we operate from our lab and office located in San Francisco, CA.

Jul 29, 2024

Apply

Senior Site Reliability Engineer at Hyperbolic | San Francisco

Hyperbolic Labs

Full-time|On-site|San Francisco, CA

Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.

Mar 26, 2026

Apply

Site Reliability Engineer at Blaxel | San Francisco

Blaxel

Full-time|On-site|San Francisco

Join Our Team as a Site Reliability EngineerBlaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.Your ResponsibilitiesWorking closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.Who You AreExtensive technical expertise in site reliability engineering, with a passion for building scalable systems.

Mar 3, 2026

Apply

Site Reliability Engineer at Latent | San Francisco

Latent

Full-time|On-site|San Francisco

Site Reliability EngineerLocation: San Francisco, CA (5 Days In-Office)As a Site Reliability Engineer at Latent, you will be the backbone of our infrastructure, ensuring the exceptional stability and performance of our cutting-edge clinical AI platform that serves major health systems. Your role is pivotal in enhancing operational excellence, directly impacting patient access to critical treatments.What Makes a Great Engineer at LatentWe seek individuals who are not just technically skilled but also passionate about ownership and high standards. You will thrive in our dynamic, in-office culture where teamwork and a winning mentality are key.Tool Proficiency: You are highly adept with your tools, fluent in command line operations, and skilled in keyboard shortcuts.Ownership: You take pride in managing complex systems and have a successful history of scaling mission-critical deployments.Automation Drive: You have a passion for automation, consistently seeking innovative methods to enhance efficiency and establish operational excellence.Problem Solver: You proactively address challenges, stepping in to resolve issues without waiting for others.Your ResponsibilitiesAs our SRE, you will take full ownership of the production environment and enhance the developer experience:Infrastructure Ownership: Design, implement, and maintain a robust production environment, having experience with over 500 machine deployments.Kubernetes Mastery: Utilize your expertise in Kubernetes and Helm to manage our containerized infrastructure, ensuring optimal deployment, scalability, and operational health.CI/CD & Deployment Optimization: Streamline the deployment pipelines for TypeScript and Python/ML, supporting rapid feature releases while upholding top-notch reliability.DevX Support: Enhance developer workflows by supporting Developer Experience (DevX) initiatives to improve tool proficiency and CI/CD systems.Infrastructure as Code (IaC): Manage infrastructure definitions using Terraform.

Dec 5, 2025

Apply

Backend Infrastructure Engineer at Atomic Semi | San Francisco

Atomic Semi

Full-time|$125K/yr - $195K/yr|On-site|San Francisco Office

About Atomic SemiAtomic Semi is pioneering the development of a compact and rapid semiconductor fabrication facility.Utilizing current technologies and innovative simplifications, we aim to create our own tools, enabling swift iterations and continuous enhancements.Our team comprises exceptional, hands-on engineers across various disciplines—mechanical, electrical, hardware, computer, and process engineering. We are committed to owning our entire operational stack, from atomic structures to architectural designs. Our optimistic outlook drives us to push the boundaries of technology.We believe that smaller, faster solutions are superior, and that building our own tools is the best path forward.Equipped with advanced 3D printers, an extensive range of microscopes, e-beam writers, and general fabrication tools, we are ready to invent whatever is necessary throughout this journey.Founded by Sam Zeloof and Jim Keller, our vision is backed by Sam’s garage-chip manufacturing experience and Jim’s four decades of leadership in the semiconductor industry.About the RoleAs a Backend Infrastructure Engineer, you will develop the core systems that drive our fabrication products, with a focus on backend architecture, data models, and hardware integrations.

Mar 24, 2026

Apply

Senior Site Reliability Engineer at Drata | San Francisco

Drata

Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco

Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.

Apr 27, 2026

Apply

Site Reliability Engineer at EngFlow | San Francisco

EngFlow

Full-time|On-site|San Francisco

Join Our Team at EngFlowEngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.Discover more about our mission, culture, and team: EngFlow | Watch Our VideoWe are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.

Jan 27, 2026

Apply

Site Reliability Engineer at Mercor | San Francisco

Mercor

Full-time|On-site|San Francisco

Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.

Dec 27, 2025

Apply

Site Reliability Engineer at Superhuman | San Francisco

Superhuman, Inc.

Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco

At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.

Jun 18, 2025

Apply

Product Infrastructure Engineer - Site Reliability

Zyphra

Full-time|On-site|San Francisco

Zyphra is a cutting-edge artificial intelligence firm located in the heart of San Francisco, California.The Opportunity:As a Product Infrastructure Engineer specializing in Site Reliability, your primary focus will be on architecting and sustaining the frameworks that ensure Zyphra's infrastructure remains strong, observable, secure, and scalable. Your contributions will be pivotal in guaranteeing the dependability and reproducibility of machine learning workloads, managing deployment safety, and ensuring the long-term viability of our computational environments.Your Responsibilities:Enhancing and developing observability systems (monitoring, logging, alerting)Creating resilient build and deployment systems across both research and production settingsEstablishing secure release protocols with comprehensive audit trails and rollback capabilitiesCollaborating closely with ML engineers, DevOps, and infrastructure teams to optimize system reliability and performanceLeading incident response efforts, conducting root-cause analysis, and facilitating postmortems with a strong emphasis on learning and preventionThis position is perfect for individuals who are passionate about creating systems that empower other teams to be faster, safer, and more efficient.Qualifications:Proven experience in high-performance computing environments, such as machine learning clusters or GPU farmsStrong background in infrastructure as code tools (e.g., Ansible, Terraform)Familiarity with software release engineering tailored for ML/AI systems is advantageousExperience in designing reliable environments for experimental workloads and reproducible executionsUnderstanding of compliance and auditing standards related to deployment and system securityExperience with load testing, fault injection, and chaos engineering to strengthen systems under pressureA passion for developing tools that render infrastructure seamless and reliable for end usersPreferred Qualifications:Experience with infrastructure as code (e.g., Ansible, Terraform)Previous experience supporting ML/AI infrastructure, including GPU management and workload optimizationExposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

Aug 22, 2025

Apply

Site Reliability Engineer at Superhuman | San Francisco

Superhuman

Full-time|$214K/yr - $260K/yr|Hybrid|San Francisco, CA

At Superhuman, we embrace a flexible hybrid working model that combines focused work time with in-person collaboration, fostering trust, innovation, and a vibrant team culture.About SuperhumanSuperhuman, now part of Grammarly, is an AI productivity platform dedicated to unlocking the superhuman potential in everyone. Our suite of applications integrates AI with over 1 million tools and websites, offering innovative solutions such as Grammarly's writing assistance, Coda's collaborative workspaces, Mail's inbox management, and Go, our proactive AI assistant. Since our inception in 2009, we have empowered over 40 million individuals and 50,000 organizations worldwide, enabling them to eliminate busywork and focus on what truly matters. Discover more at superhuman.com and explore our values here.The OpportunityIn pursuit of our ambitious goals, we are seeking a Site Reliability Engineer to enhance our infrastructure team. This pivotal role involves building software that ensures the reliability of our back-end systems while collaborating closely with our engineering teams. You will also help plan for our future growth as we shift from a “you build it, you own it” model.Our engineers and researchers enjoy the freedom to innovate and influence our product roadmap, tackling increasingly complex technical challenges as we scale our systems. Learn more about our technical endeavors on our technical blog.As a Site Reliability Engineer, your responsibilities will include:Scaling our Kubernetes-based control plane, processing billions of events daily.Enhancing our automation mechanisms in response to workload demands.Deploying machine learning systems across the organization.

Mar 18, 2026

Apply

Site Reliability Engineer - Infrastructure for Analytics Platform

OpenAI

Full-time|On-site|San Francisco

The Scaling team at OpenAI builds and maintains the core infrastructure that supports research efforts. This group focuses on enabling rapid progress toward Artificial General Intelligence by providing the systems and tools researchers rely on every day. Their work covers everything from foundational infrastructure to specialized applications, all designed to handle increasing complexity and scale without sacrificing reliability or ease of use. Role overview OpenAI is seeking a Site Reliability Engineer to manage and improve the infrastructure behind its analytics platform. This position centers on supporting production systems that handle data-intensive, low-latency workloads. Key technologies include large-scale ClickHouse clusters, high-throughput Kafka pipelines, and stable integrations with Snowflake. The engineer in this role will turn ambiguous operational challenges into concrete solutions, deliver improvements quickly, and iterate based on real-world feedback. Success in this role means independently setting and raising operational standards, working closely with production systems, and collaborating across teams to ensure reliability at scale. Key responsibilities Manage the full lifecycle of infrastructure: provisioning, upgrades, scaling, and decommissioning using Infrastructure as Code (IaC). Operate and scale ClickHouse clusters, including sharding, replication, capacity planning, tuning, and maintenance. Run Kafka as the primary data ingestion layer, improving throughput, managing lag and backpressure, and ensuring robust failure recovery. Improve latency and reliability for workloads involving heavy data serving and querying. Develop and maintain monitoring and alerting systems, including SLIs/SLOs, dashboards, alert policies, and actionable runbooks. Create and refine incident response protocols, on-call procedures, and postmortem practices. Oversee backup, restore, and disaster recovery strategies, including regular drills. Plan and execute safe rollouts across development, staging, and production environments, using canary deployments and rollback plans. Work daily with software engineers to embed reliability into system design, implementation, and release cycles. Set and promote standards for operational readiness and runbooks, encouraging adoption across teams. Enhance CI/CD pipelines and improve the developer experience for greater speed and safety.

Apr 28, 2026

Create account — see all 11,463 results