Senior Site Reliability Engineer

alembicSan Francisco HQ

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Experience Level

Senior

Qualifications

Key ResponsibilitiesDesign, develop, and sustain scalable infrastructure to facilitate real-time analytics and machine learning workloads. Enhance system reliability and performance through automation, observability, and proactive capacity planning. Lead the evolution of CI/CD pipelines, deployment automation, rollback mechanisms, and configuration management. Establish and maintain monitoring, alerting, and incident response protocols, including SLOs, runbooks, and on-call rotations. Foster collaboration across engineering and data science teams to promote a culture of performance and reliability. Ensure security, compliance, and operational readiness of our cloud infrastructure. Drive post-incident analyses and continuous improvement efforts.

About the job

About the Role

Join alembic as a Senior Site Reliability Engineer (SRE) and become an integral part of our mission to enhance platform reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to architect, automate, and maintain the robust infrastructure that drives our platform, including data pipelines, machine learning workloads, and real-time analytics systems.

This hands-on position offers significant visibility across the technology stack and provides you with the opportunity to shape the future of our infrastructure and operations.

About alembic

Alembic is dedicated to building innovative solutions that empower organizations to harness the full potential of their data. Our team is committed to fostering a collaborative and dynamic work environment where creativity and technical excellence thrive.

1 - 20 of 6,323 Jobs

Search for Quality Engineer Rack Infrastructure Site Operations

6,323 results

Select all on this page (20)

Apply

Quality Engineer - Rack Infrastructure & Site Operations

OpenAI

Full-time|On-site|San Francisco

About the TeamAt OpenAI, we are revolutionizing the future of artificial intelligence. Together with our trusted capital and technology partners, we are constructing a state-of-the-art network of advanced datacenters tailored to meet the rigorous demands of AI workloads. Our Industrial Compute team is dedicated to ensuring that all datacenter systems are man…

Mar 24, 2026

Apply

Rack Product Engineer, AI Rack Infrastructure at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About Our TeamAt OpenAI, we are forging a global network of cutting-edge datacenters in collaboration with our technology and capital partners to meet the challenges of the most demanding AI workloads. The Industrial Compute team is dedicated to designing, manufacturing, and deploying datacenter infrastructure systems that prioritize reliability, scalability, and performance.Our team collaborates with engineering departments, manufacturing partners, construction teams, and datacenter operations to ensure that our rack systems are efficiently built, validated, and deployed across our rapidly expanding global infrastructure.Our responsibilities encompass product definition, design validation, manufacturing readiness, and field deployment — all aimed at ensuring that our rack infrastructure meets the performance, reliability, and operational requirements of OpenAI's compute platforms.About This RoleWe are on the lookout for a skilled Rack Product Engineer to spearhead the technical development, manufacturing readiness, and lifecycle performance of rack infrastructure utilized across OpenAI's datacenters. You will serve as the engineering subject matter expert, overseeing aspects of management, testing, quality assurance, and manufacturing engineering to scale operations and maintain on-time delivery.This position operates at the intersection of hardware design, manufacturing, supplier engagement, and datacenter deployment. You will collaborate closely with compute, mechanical, power, and networking teams to define rack architectures that are manufacturable, scalable, and operationally reliable.Your role will involve partnering with contract manufacturers and suppliers to ensure that rack systems are built to specifications while also driving internal design improvements, resolving field issues, and supporting rapid deployment of infrastructure on a global scale.Travel RequirementsThis position may require domestic and international travel as needed, estimated at 30%, to manufacturing sites, supplier facilities, and datacenter deployments.

Mar 16, 2026

Apply

Site Reliability Engineer - AI Infrastructure

Andromeda Cluster

Full-time|Remote|Global Remote / San Francisco, CA

Site Reliability Engineer - AI InfrastructureLocation: Global Remote / San Francisco · Full-TimeAbout AndromedaAndromeda Cluster, established by Nat Friedman and Daniel Gross, aims to democratize access to advanced AI infrastructure for early-stage startups, previously exclusive to hyperscalers. Our journey began with a single managed cluster that quickly reached capacity, propelling us to develop robust systems, networking, and orchestration layers to make AI infrastructure more accessible than ever.Today, we collaborate with top AI laboratories, data centers, and cloud service providers to deliver compute resources precisely when and where they're needed the most. Our platform efficiently manages the routing of training and inference jobs across a global supply chain, facilitating flexibility and efficiency in one of the most rapidly expanding markets worldwide.Our vision is to create a liquidity layer for global AI compute — a marketplace that dynamically moves the infrastructure and workloads essential for AGI, akin to the capital flows in global financial markets.We are on the lookout for talented individuals who excel in AI infrastructure, research, and engineering to join our pioneering team.Your ResponsibilitiesProvision, configure, and manage Kubernetes clusters for clients across various service providers.Develop automation tools to enhance the deployment and integration of clusters.Troubleshoot customer issues related to networking, storage, scheduling, and system layers.Enhance the reliability and scalability of training and inference infrastructures.Design and implement monitoring, alerting, and observability solutions for critical systems.Work collaboratively with engineering and product teams to strategize and deliver infrastructure for new services.Engage in on-call duties and incident response, leading postmortems and reliability enhancements.Ideal Candidate ProfileA minimum of 5 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles.Solid foundation in Linux systems and networking principles.Extensive expertise in Kubernetes and container orchestration at scale.Proficient in Infrastructure-as-Code methodologies (Terraform, Helm, etc.).

Nov 6, 2025

Apply

Senior Site Reliability Engineer for AI Infrastructure

Andromeda

Full-time|Remote|Global Remote / San Francisco, CA

Join Andromeda as a Senior Site Reliability Engineer specializing in AI Infrastructure. In this pivotal role, you will be responsible for ensuring the reliability, scalability, and performance of our cutting-edge AI systems. Collaborate with cross-functional teams to design and implement robust infrastructure solutions that support our innovative AI initiatives. Your expertise will play a crucial role in maintaining optimal service availability and improving system performance.

Apr 9, 2026

Apply

Product Infrastructure Engineer - Site Reliability

Zyphra

Full-time|On-site|San Francisco

Zyphra is a cutting-edge artificial intelligence firm located in the heart of San Francisco, California.The Opportunity:As a Product Infrastructure Engineer specializing in Site Reliability, your primary focus will be on architecting and sustaining the frameworks that ensure Zyphra's infrastructure remains strong, observable, secure, and scalable. Your contributions will be pivotal in guaranteeing the dependability and reproducibility of machine learning workloads, managing deployment safety, and ensuring the long-term viability of our computational environments.Your Responsibilities:Enhancing and developing observability systems (monitoring, logging, alerting)Creating resilient build and deployment systems across both research and production settingsEstablishing secure release protocols with comprehensive audit trails and rollback capabilitiesCollaborating closely with ML engineers, DevOps, and infrastructure teams to optimize system reliability and performanceLeading incident response efforts, conducting root-cause analysis, and facilitating postmortems with a strong emphasis on learning and preventionThis position is perfect for individuals who are passionate about creating systems that empower other teams to be faster, safer, and more efficient.Qualifications:Proven experience in high-performance computing environments, such as machine learning clusters or GPU farmsStrong background in infrastructure as code tools (e.g., Ansible, Terraform)Familiarity with software release engineering tailored for ML/AI systems is advantageousExperience in designing reliable environments for experimental workloads and reproducible executionsUnderstanding of compliance and auditing standards related to deployment and system securityExperience with load testing, fault injection, and chaos engineering to strengthen systems under pressureA passion for developing tools that render infrastructure seamless and reliable for end usersPreferred Qualifications:Experience with infrastructure as code (e.g., Ansible, Terraform)Previous experience supporting ML/AI infrastructure, including GPU management and workload optimizationExposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

Aug 22, 2025

Apply

Senior Site Reliability Engineer - Infrastructure Security

MongoDB, Inc.

Full-time|On-site|Austin; San Francisco; Seattle; United States

Join MongoDB as a Senior Site Reliability Engineer specializing in Infrastructure Security. In this pivotal role, you'll be at the forefront of ensuring the reliability and security of our cloud infrastructure. Your expertise will help us to design and maintain systems that are robust, efficient, and secure, providing critical support to our engineering teams.Your responsibilities will include monitoring system performance, implementing security protocols, and troubleshooting incidents to maintain high availability. You will collaborate with cross-functional teams to enhance our security posture, ensuring that our services are resilient and secure.

Mar 26, 2026

Apply

Infrastructure Operations Engineer

Baseten

Full-time|Remote|San Francisco

Join Baseten as an Infrastructure Operations Engineer and become an integral part of our innovative team. In this role, you will be responsible for maintaining and enhancing our infrastructure, ensuring optimal performance and reliability. You will work collaboratively with cross-functional teams to develop and implement solutions that drive efficiency and scalability.If you are passionate about infrastructure management and seek to make a significant impact in a fast-paced environment, we want to hear from you!

Mar 10, 2026

Apply

Site Reliability Engineer - Infrastructure for Analytics Platform

OpenAI

Full-time|On-site|San Francisco

About the TeamThe Scaling team at OpenAI is dedicated to designing, constructing, and managing essential infrastructure that propels research forward. Our mission is straightforward: to expedite the advancement of research toward Artificial General Intelligence (AGI). We achieve this by developing foundational systems that our researchers depend on, which range from fundamental infrastructure components to tailored applications for research. These systems are designed to scale with the growing complexity and volume of our workloads while maintaining reliability and user-friendliness.About the RoleWe are in search of a skilled Site Reliability Engineer to take ownership of our production-critical infrastructure from start to finish. This role focuses on managing data-intensive, low-latency workloads, particularly involving large-scale ClickHouse clusters, high-throughput Kafka pipelines, and dependable integrations with Snowflake. You will transform unclear operational challenges into actionable plans, deliver practical solutions swiftly, and refine them based on production feedback and iterations.The ideal candidate will have the ability to independently establish and elevate operational standards across teams while remaining actively engaged with production systems.Key ResponsibilitiesOversee the lifecycle management of infrastructure, including provisioning, upgrades, scaling, and decommissioning with an Infrastructure as Code (IaC) approach.Manage and scale ClickHouse clusters, focusing on sharding, replication, capacity planning, performance tuning, and maintenance.Operate Kafka as the data ingestion backbone, enhancing throughput, lag management, backpressure handling, and failure recovery.Enhance end-to-end latency and reliability for data-heavy serving and querying workloads.Develop and sustain robust monitoring and alerting systems: SLIs/SLOs, dashboards, alert policies, and actionable runbooks.Establish, implement, and continuously refine incident response protocols, on-call practices, and postmortem evaluations.Manage backup/restore and disaster recovery strategies, including regular recovery drills.Plan and execute safe rollouts across various environments (development, staging, production), including canary deployments and rollback strategies.Collaborate daily with software engineers to embed reliability within design, implementation, and release processes.Set the benchmark for operational readiness and runbook standards, driving their adoption across teams.Enhance CI/CD pipelines and developer experience for improved speed and safety.

Apr 28, 2026

Apply

Infrastructure Engineering Lead, IT

OpenAI

Full-time|On-site|San Francisco

About Our TeamThe Infrastructure Engineering team operates within the IT department, dedicated to the reliable construction, deployment, and management of critical on-premises and hybrid environments that empower our internal services and vital research and development projects.This newly established team is committed to implementing rigorous Site Reliability Engineering (SRE) practices in environments where uptime, safety, recoverability, and security are paramount. We aim to replace unique, one-off infrastructure with standardized infrastructure-as-code components that enhance reliability and operational efficiency as OpenAI continues to grow.About This RoleWe are in search of an Infrastructure Engineering Lead who will architect, build, and maintain reliable, secure, and scalable infrastructure that supports identity, access, endpoint, and shared platform services throughout the organization.You will take full ownership of infrastructure and identity systems from conceptual design and provisioning to policy enforcement, upgrades, recovery, and ongoing operations. Your goal will be to develop robust, production-grade platforms that minimize operational hurdles, enforce security by default, and empower teams to work more effectively and confidently.This position is ideal for a senior engineer who excels in navigating ambiguity, relishes the challenge of overseeing complex systems from start to finish, and enhances reliability and security by transforming fragile implementations into standardized, repeatable infrastructure.This role is based at our San Francisco headquarters and requires in-office attendance.Key Responsibilities:Define and refine infrastructure patterns for on-prem and hybrid environments, including self-hosted platforms, vendor-supported systems, and lab settings.Establish standardized, production-grade deployment and operational models that replace custom-built solutions.Collaborate with IT, Security, Identity, and Network teams to ensure infrastructure is designed to meet reliability, security, and access standards.Design and enhance the production architecture for Identity and Access Management (IAM) adjacent platforms, such as Microsoft Entra, utilizing SRE principles.Develop common management protocols and shared resources within Azure subscriptions to ensure uniformity and policy compliance in operations.

Jan 30, 2026

Apply

Manufacturing Quality Engineer for Datacenter Infrastructure at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About Our TeamAt OpenAI, we are at the forefront of innovation, collaborating with our capital and technology partners to develop a cutting-edge global network of data centers tailored to meet the rigorous demands of AI workloads. The Infrastructure Quality team plays a pivotal role in ensuring that all systems are manufactured, delivered, and commissioned to meet the highest benchmarks of quality, reliability, and performance.Our team works collaboratively with manufacturing partners, general contractors, engineering teams, and operations staff to guarantee that every component is primed for installation, startup, and sustainable service. Our responsibilities encompass vendor qualification through to commissioning, ensuring operational readiness across our expansive global portfolio.Role OverviewWe are on the lookout for a skilled Manufacturing Quality Engineer (MQE) who will be instrumental in establishing, implementing, and managing a quality program focused on manufacturing for datacenter infrastructure. Your role will involve vendor oversight, quality assurance, process optimization, and addressing issues across all critical systems.You will conduct vendor audits, monitor performance metrics, and coordinate corrective actions aimed at minimizing risks, enhancing predictability, and ensuring dependable delivery schedules. By fostering partnerships with vendors, construction teams, and internal stakeholders, you will contribute to the timely and high-standard delivery of OpenAI’s data centers.Main ResponsibilitiesConduct comprehensive vendor audits, evaluations, and performance reviews throughout the production, inspection, testing, and delivery phases.Develop and monitor quality metrics that assess manufacturing performance and identify emerging trends.Collaborate closely with vendors to refine processes, enhance training programs, and improve quality controls prior to shipment.Establish and maintain a datacenter-focused manufacturing quality program that supports rapid global deployment timelines.Incorporate quality requirements into sourcing, design, and construction workflows.Lead and guide vendor and contractor quality teams, ensuring compliance with OpenAI standards.Collaborate with engineering, sourcing, construction, and operations teams to align quality processes with project goals.Mentor vendor teams and provide structured feedback to drive ongoing improvement.Lead corrective and preventive action plans, including root cause analysis for recurring issues.

May 1, 2026

Apply

Senior Site Reliability Engineer at Crusoe | San Francisco, CA

Crusoe

Full-time|$172K/yr - $209K/yr|On-site|San Francisco, CA - US

At Crusoe, our mission is to drive the future of energy and intelligence. We are developing the infrastructure that empowers ambitious AI creations without compromising on scale, speed, or sustainability.Join us in leading the AI revolution through sustainable technology. At Crusoe, you will be at the forefront of innovation, contributing to impactful projects and collaborating with a team dedicated to transforming cloud infrastructure responsibly.About This Role:As a Senior Site Reliability Engineer, you will play a crucial role in ensuring the operational excellence of Crusoe’s energy-efficient, AI-optimized GPU cloud. Your focus will be on maintaining stability, resilience, and performance, driving initiatives that enhance our cloud platform.This position is perfect for engineers who thrive in dynamic environments, relish the challenge of solving operational issues, and seek to advance their technical careers while enhancing incident response and reliability for a large-scale distributed platform.You will collaborate closely with senior SREs, infrastructure engineers, and platform teams to bolster reliability, minimize operational toil, and refine our incident management processes.What You’ll Be Working On:Work with cross-functional teams to establish and enhance availability metrics for our cloud infrastructure, including the development, tracking, and improvement of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).Assist in incident response by diagnosing and resolving service disruptions, while supporting post-incident processes through root cause analysis documentation and participation in reviews.Build, maintain, and monitor the health of our infrastructure using Crusoe’s observability tools (Prometheus, Grafana, Alertmanager, OpenTelemetry).Identify and communicate reliability risks and performance bottlenecks, along with early indicators of potential incidents that may impact service availability.Develop automation and tools to reduce operational toil, minimize manual processes, and improve service recovery and self-healing capabilities.Collaborate with compute, network, storage, and platform teams to enhance service resilience and strengthen disaster recovery preparedness.Engage in knowledge sharing and contribute to the development of operational best practices across the organization.

Dec 5, 2025

Apply

Infrastructure & Site Reliability Engineer at Atomic Semi | San Francisco

Atomic Semi

Full-time|$125K/yr - $195K/yr|On-site|San Francisco Office

About Atomic SemiAtomic Semi is pioneering the development of a compact and agile semiconductor fabrication facility.With today’s technology, alongside a few innovative simplifications, we are capable of realizing this vision. We will create our own tools, allowing for rapid iterations and enhancements.Our goal is to assemble a small, exceptional team of hands-on engineers to drive this initiative forward. Our team is composed of experts in mechanical, electrical, hardware, computer, and process engineering. We will manage the entire stack, from atoms to architecture, with a forward-thinking approach that pushes the boundaries of technology.Our philosophy emphasizes that smaller, faster, and self-built systems are superior.We are confident that our team and lab can create anything we envision. Equipped with 3D printers, diverse microscopes, e-beam writers, and general fabrication tools, we are committed to inventing whatever tools we may need along the way.Founded by Sam Zeloof and Jim Keller, Atomic Semi combines Sam's garage chip-making prowess with Jim's extensive 40-year leadership in the semiconductor industry.About the RoleWe are in search of an Infrastructure & Site Reliability Engineer to design, construct, deploy, and oversee the on-premises backend infrastructure that drives our rapid semiconductor fabrication process.This multifaceted role encompasses all elements of backend infrastructure and services.Our infrastructure philosophy prioritizes minimalism, clarity, on-site operations, and proximity to hardware. Expect a focus on bare-metal Linux, systemd, and single-file binaries rather than extensive use of Docker, cloud services, or Kubernetes. Proficiency in Rust, Go, and Python will be beneficial.We welcome candidates from various experience levels—ranging from outstanding early-career engineers to seasoned professionals. We are not fixed on a specific background; what is paramount is your proven ability to build real systems, enthusiasm for hands-on engineering, and a strong display of engineering excellence. If you are passionate about performance engineering, developing complex features from the ground up, and swiftly mastering new domains, this is an exciting opportunity for you.A portfolio or GitHub account is generally required to apply: demonstrate the projects you’ve undertaken!

Feb 13, 2026

Apply

Senior/Staff Site Reliability Engineer

fal

Full-time|On-site|San Francisco

Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!

Feb 23, 2026

Apply

Site Reliability Engineer (SRE)

Baseten

Full-time|On-site|San Francisco Office

ABOUT BASETENBaseten is at the forefront of powering mission-critical AI inference for some of the most innovative companies globally, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. We integrate cutting-edge applied AI research with a flexible infrastructure and intuitive developer tools to empower companies at the leading edge of AI to deploy sophisticated models effectively. With our recent $300M Series E funding round—supported by prominent investors such as BOND, IVP, Spark Capital, Greylock, and Conviction—we are rapidly expanding. Join our dynamic team and contribute to creating an essential platform for engineers to launch AI products with ease.THE ROLEAs a Site Reliability Engineer, you will design and implement resilient systems and processes that ensure our infrastructure is scalable, reliable, and efficient. Your responsibilities will encompass everything from automating deployments and monitoring systems to enhancing performance and managing incidents effectively.Collaboration is key; you will work closely with our users to understand their challenges in operationalizing machine learning, facilitating their onboarding onto our platform, and leveraging these insights to inform improvements to Baseten.EXAMPLE INITIATIVESAs part of our Infrastructure team, you will engage in exciting projects such as:Innovative multi-cloud capacity managementOptimizing inference on B200 GPUsImplementing multi-node inferenceUtilizing fractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and maintain scalable infrastructures to support the deployment and operational needs of machine learning models.Establish standards and best practices to enhance reliability and performance across the infrastructure.Proactively identify and resolve reliability issues using monitoring and alerting systems.Collaborate with cross-functional teams to apply best practices in infrastructure management and incident response.Create automation scripts to streamline processes and reduce manual intervention.

Oct 9, 2025

Apply

Infrastructure Engineer at Descript | Remote

Descript

Full-time|$191K/yr - $250K/yr|Remote|Remote, San Francisco, California, United States

Join Our Team as an Infrastructure Engineer At Descript, we are dedicated to revolutionizing the way audio and video content is created and edited. Our mission is to make this process fast, easy, and accessible for everyone. We are developing a state-of-the-art media editor that leverages real-time collaboration, innovative user experience, and advanced AI technology. We envision a future where content creation is seamless and enjoyable. As an Infrastructure Engineer, you will take ownership of the reliability and performance of our production systems. You will lead initiatives that empower our engineering team to enhance the quality and efficiency of their work. Your role will involve managing and optimizing our core production infrastructure, which serves as the foundation for all engineering efforts. We seek individuals with a strong grasp of systems fundamentals, a passion for teaching and mentoring, and the ability to make strategic decisions. In this position, you will collaborate closely with engineering leadership to define what reliability means at Descript as we scale. If you are excited about shaping the future of a beloved product, this role is perfect for you! Your Responsibilities: Develop technical and business solutions that enhance the quality and reliability of our product features and systems. Lead initiatives to improve the reliability of our core infrastructure, including production clusters, networking, databases, and observability systems. Promote best practices during code reviews, technical design discussions, and launch planning. Oversee our incident management and fire drill processes. Work with engineering leadership to set goals and prioritize production reliability efforts.

Jan 22, 2026

Apply

Software Engineer, Site Reliability (SRE)

Sierra

Full-time|On-site|San Francisco, CA

About UsAt Sierra, we are pioneering a transformative platform that empowers businesses to forge authentic customer experiences through AI technology. Headquartered in the vibrant city of San Francisco, we also boast a dynamic presence in Atlanta, New York, London, France, Singapore, and Japan.Our operations are anchored in core values that shape our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and are integral to our mission.Our visionary founders, Bret Taylor and Clay Bavor, bring unparalleled expertise. Bret, currently the Board Chair of OpenAI, previously co-led Salesforce and served as CTO at Facebook, while Clay led numerous initiatives at Google, including AR/VR projects and Google Workspace.Your RoleIn your capacity as a Software Engineer on the Site Reliability team, you will play a crucial role in establishing and enhancing the reliability, observability, and scalability of Sierra’s AI-centric infrastructure. Collaborating closely with our engineering and product teams, your goal is to ensure our systems remain highly available, efficient, and primed for growth.Lead the development of Sierra’s observability stack—including monitoring, alerting, logging, and tracing—to provide engineers with critical insights into system health and performance.Collaborate with product and platform engineers to architect systems that prioritize reliability and scalability from the outset, not as an afterthought.Design and implement robust, scalable, and secure cloud infrastructure on AWS, employing Terraform and cutting-edge DevOps tools.Enhance the reliability and scalability of our LLM deployments, ensuring they operate efficiently and cost-effectively.Drive improvements in deployment pipelines, CI/CD tooling, and incident management processes to minimize downtime and accelerate response times.Define and cultivate SRE practices within Sierra, shaping culture, tooling, and best practices across the engineering organization.QualificationsBachelor's degree in Computer Science or a related field, or equivalent experience.Proven experience in Site Reliability Engineering or a similar role, with a strong understanding of cloud infrastructure (AWS).Proficiency in Terraform and modern DevOps practices.Experience with observability tools and techniques—monitoring, alerting, logging, and tracing.Strong problem-solving skills with a focus on scalability and performance optimization.Excellent collaboration and communication skills, with the ability to work effectively in a team environment.

Oct 21, 2025

Apply

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAI

Full-time|On-site|San Francisco

About Our TeamThe Frontier Systems team at OpenAI is at the forefront of technological innovation, responsible for designing, deploying, and maintaining state-of-the-art supercomputers that power our most advanced model training initiatives. We transform innovative data center designs into fully functional systems and develop the necessary software to support extensive frontier model training.Our mission is to ensure the stability and efficiency of these hyperscale supercomputers, providing an uninterrupted environment for the training of frontier models.About the OpportunityWe are seeking passionate engineers to manage the next generation of compute clusters that fuel OpenAI’s leading-edge research. This role merges distributed systems engineering with practical infrastructure expertise across our expansive data centers. You will be tasked with scaling Kubernetes clusters to unprecedented levels, automating bare-metal deployments, and creating software solutions that simplify interactions across a multitude of nodes in various data centers.You will operate at the confluence of hardware and software, where speed and reliability are of utmost importance. Prepare to oversee dynamic operations, swiftly diagnose and resolve critical issues, and continuously enhance automation and system uptime.Key Responsibilities:Deploy and scale substantial Kubernetes clusters, implementing automation for provisioning, bootstrapping, and lifecycle management.Create software abstractions that integrate multiple clusters, delivering a seamless interface for training workloads.Oversee node deployment from bare metal to firmware upgrades, ensuring swift and repeatable processes at scale.Enhance operational metrics, striving to minimize cluster restart times (e.g., reducing from hours to minutes) and expedite firmware or OS upgrades.Integrate networking and hardware health systems to ensure comprehensive reliability across servers, switches, and data center infrastructure.Develop monitoring and observability systems that proactively identify issues and maintain cluster stability under peak loads.Be prepared to perform at the level of a software engineer in execution and problem-solving.You May Be a Great Fit If You:Possess extensive experience in operating or scaling Kubernetes clusters or similar container orchestration systems.

Nov 3, 2025

Apply

Infrastructure Engineer

HappyRobot

Full-time|On-site|San Francisco

About HappyRobotHappyRobot is pioneering the AI-native operating system for the real economy, bridging the gap between intelligence and action. By harnessing real-time truths, specialized AI workers, and orchestrating intelligence, we empower enterprises to manage complex, mission-critical operations with unprecedented autonomy.Our AI OS accumulates knowledge, optimizes processes at every level, and evolves continually. Our initial focus is on supply chain and industrial-scale operations, where resilience, speed, and ongoing improvement are paramount—liberating humans to engage in strategy, creativity, and other high-value endeavors.To explore our vision further, check out our Manifesto. To date, HappyRobot has successfully raised $62 million, including a recent $44 million in Series B funding in September 2025, with support from esteemed investors like Y Combinator (YC), Andreessen Horowitz (a16z), and Base10—partners dedicated to our mission of redefining enterprise operations. We are using this investment to build a world-class team of individuals with relentless drive, exceptional problem-solving skills, and a passion for pushing boundaries in a dynamic, high-intensity environment. If this resonates with you, we invite you to join us at HappyRobot.About the RoleWe are in search of an Infrastructure Engineer to spearhead the enhancement of our operational resilience as we scale. You will be responsible for the stability, observability, and debugging processes that ensure our systems operate seamlessly. As the primary troubleshooter for complex failures in real-time, you will design tools that transform chaos into clarity and assist in transitioning our operations from reactive to proactive.This role carries significant impact and trust, as you will influence how we approach reliability—reducing incident frequency, creating internal tools, and directly enhancing developer focus and system uptime. If you thrive on uncovering the root causes of challenging issues and fortifying systems (and teams), this is your opportunity.

Dec 30, 2025

Apply

Senior Site Reliability Engineer

alembic

Full-time|On-site|San Francisco HQ

About the RoleJoin alembic as a Senior Site Reliability Engineer (SRE) and become an integral part of our mission to enhance platform reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to architect, automate, and maintain the robust infrastructure that drives our platform, including data pipelines, machine learning workloads, and real-time analytics systems.This hands-on position offers significant visibility across the technology stack and provides you with the opportunity to shape the future of our infrastructure and operations.

Dec 22, 2025

Apply

Senior Staff Engineer in Cloud Site Operations

Crusoe Energy Systems

Full-time|On-site|San Francisco, CA - US

Join Crusoe Energy Systems as a Senior Staff Engineer in Cloud Site Operations and take your career to new heights. In this pivotal role, you will spearhead the design, implementation, and maintenance of our cloud infrastructure, ensuring the reliability and scalability of our operations. You will collaborate with cross-functional teams to optimize processes and implement innovative solutions.

Mar 23, 2026

Create account — see all 6,323 results

Senior Site Reliability Engineer

Experience Level

Qualifications

About the job

About the Role

About alembic

Similar jobs