Senior Software Engineer, Site Reliability Engineer (SRE)

HarveySan Francisco

On-site Full-Time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Qualifications

QualificationsProven experience in software engineering with a focus on site reliability or infrastructure. Strong proficiency in programming languages such as Python, Go, or Java. Experience with cloud platforms like AWS, Azure, or Google Cloud. Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes). Demonstrated ability to troubleshoot complex systems and perform root cause analysis. Excellent communication skills and a collaborative mindset. A passion for automation and optimizing workflows.

About the job

Why Join Harvey?

At Harvey, we are revolutionizing the landscape of legal and professional services with a holistic approach. By integrating advanced AI technology, a robust enterprise platform, and extensive industry knowledge, we are redefining how essential knowledge work is conducted for years to come.

This is a unique opportunity to contribute to the foundation of a transformative company at a pivotal moment in its journey. With over 1000 clients across more than 58 countries, a solid product-market fit, and outstanding investor backing, we are rapidly expanding and creating a new category in real-time. The challenges are significant, expectations are high, and the potential for personal, professional, and financial development is unparalleled.

Our team comprises driven, intelligent individuals who are deeply passionate about our mission. We prioritize speed, intensity, and accountability in addressing challenges — from initial ideation to long-term solutions. By maintaining close relationships with our clients, from executives to engineers, we collaboratively address pressing issues with urgency and care. If you excel in uncertain environments, strive for excellence, and wish to shape the future of work alongside a team that raises the bar, we invite you to build alongside us.

At Harvey, we are currently writing the future of professional services — and we are just getting started.

Your Role

As a Senior Software Engineer on the Site Reliability team at Harvey, your mission will be to uphold the reliability, scalability, and performance of our innovative legal AI platform. You will become part of a high-impact team that operates at the crossroads of infrastructure and product, taking ownership of the systems that ensure our platform remains fast, secure, and continuously available. From scaling operations across 50+ regions to automating critical processes, your efforts will fortify Harvey's resilience as we expand. If you are enthusiastic about constructing robust systems and simplifying complexity through automation, we would love to collaborate with you.

This position is situated in San Francisco, CA, and we adhere to an in-person work model, providing relocation assistance to new employees.

Your Responsibilities

Design, implement, and oversee monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions.
Lead incident management processes, including postmortems, root cause analyses, and driving actionable enhancements.
Automate operational tasks and workflows by developing tools and processes for capacity planning, seamless rollouts, and secure data access to maintain high reliability and minimize manual intervention.
Collaborate across teams to drive solutions that enhance system performance and reliability.

About Harvey

About HarveyHarvey is at the forefront of transforming legal and professional services, merging cutting-edge AI with a powerful platform that streamlines operations across the board. With a commitment to innovation and excellence, we are setting the standard for the future of work in our industry. Our diverse team, dynamic culture, and unwavering focus on client success make us a unique place to grow your career and impact the world of professional services.

Similar jobs

1 - 20 of 7,190 Jobs

Search for Senior Software Engineer Site Reliability Engineering At Abridge Sf Office

7,190 results

Select all on this page (20)

Apply

Senior Software Engineer, Site Reliability Engineering at Abridge | SF Office

Abridge

Full-time|On-site|SF Office

About AbridgeFounded in 2018, Abridge is dedicated to enhancing understanding in the healthcare sector. Our innovative AI-powered platform is specifically designed to enhance medical conversations, streamlining clinical documentation while allowing healthcare providers to prioritize what matters most—their patients.Our enterprise-grade technology revolutionizes patient-clinician dialogues by converting them into structured clinical notes in real-time, with integrated EMR functionalities. Utilizing Linked Evidence and our auditable AI, we uniquely map AI-generated summaries to verified ground truth, fostering quick trust among providers. As trailblazers in generative AI for healthcare, we are establishing industry benchmarks for the responsible integration of AI within health systems.Our diverse team comprises practicing MDs, AI scientists, PhDs, creatives, technologists, and engineers, all united in empowering individuals and simplifying care. Our offices are situated in San Francisco's Mission District, New York's SoHo neighborhood, and East Liberty in Pittsburgh.The RoleAs part of our rapidly scaling services and engineering team, we are seeking seasoned Site Reliability Engineers (SREs) to enhance our software's performance, stability, and scalability significantly. This role focuses primarily on distributed systems, with approximately 80% dedicated to software and 20% to cloud infrastructure.You will play a pivotal role in integrating load testing and chaos engineering into our CI pipelines. You will utilize observability and profiling tools to pinpoint and rectify performance bottlenecks, collaborate with various teams to transition their applications to more scalable infrastructures, and ensure a seamless experience as we expand our application adoption in the healthcare domain. This may include embedding with other teams for extended periods.The platform we are developing must optimize both engineering speed and security, facing significant scale challenges and presenting numerous opportunities to exercise creativity, independence, and leadership in taking projects from inception to fruition. This is a rare chance to advance your career in a rapidly growing company that harnesses cutting-edge technologies.What You'll DoUtilize load testing, chaos engineering, and other testing methodologies to uncover performance and latency issues across all systems, implementing code changes to resolve them.Lead software modifications that facilitate the migration of applications at the code level to new infrastructures (including run times, event-driven frameworks, databases, etc.).

May 21, 2025

Apply

Senior Backend Engineer at Abridge | San Francisco

Abridge

Full-time|On-site|SF Office

About AbridgeFounded in 2018, Abridge is dedicated to transforming healthcare through advanced technology. Our AI-driven platform enhances the clarity of medical dialogues, streamlining clinical documentation and allowing healthcare providers to prioritize patient care.Our enterprise-level technology converts patient-clinician interactions into organized clinical notes instantaneously, with sophisticated EMR integrations. Utilizing Linked Evidence and our uniquely designed, auditable AI, we stand out as the only provider mapping AI-generated summaries to verified data, empowering clinicians to trust and validate our outputs. As leaders in generative AI for healthcare, we are establishing benchmarks for the ethical use of AI in health systems.Our diverse team comprises practicing physicians, AI researchers, PhDs, creatives, technologists, and engineers, all collaborating to enhance healthcare understanding. We operate out of vibrant locations in San Francisco's Mission District, New York's SoHo, and Pittsburgh's East Liberty.The RoleJoin our innovative engineering team as a Senior Backend Engineer, where your expertise will help us build the data and processing infrastructure essential for our ongoing expansion. You will be a key player in a newly formed team, predominantly working with Typescript and Node, deploying on platforms like Kubernetes and Temporal.What You'll DoDesign, develop, and maintain robust cloud-native applications, services, and APIs critical for our rapidly expanding operations. Potential projects may involve creating new processing pipelines or revamping our data storage solutions to accommodate growing demands.Evaluate commercial and open-source solutions, providing build vs buy recommendations to enhance our technological capabilities.Refine existing systems to introduce new features and enhance performance, stability, quality, and security, while also contributing to the overall codebase health.Foster a deep understanding of user needs, maintaining a user-focused approach in all developments.Facilitate effective communication within cross-functional product delivery teams, collaborating with colleagues from product, machine learning, and platform engineering as you contribute to diverse projects across the organization.What You’ll BringStrong experience in backend development, particularly with Typescript and Node.Familiarity with container orchestration tools like Kubernetes.Proven ability to enhance existing systems and tackle technical challenges effectively.

Oct 29, 2025

Apply

Senior IT Systems Engineer at abridge | San Francisco

abridge

Full-time|On-site|SF Office

Join our dynamic team at abridge as a Senior IT Systems Engineer. In this pivotal role, you will leverage your expertise to design, implement, and maintain robust IT systems that drive our business operations forward. Collaborate with cross-functional teams to innovate and enhance our technological infrastructure, ensuring seamless integration and optimal performance.

Mar 5, 2026

Apply

Senior Quality Engineer at Abridge | San Francisco

Abridge

Full-time|On-site|San Francisco-Onsite

Join the Abridge TeamFounded in 2018, Abridge is dedicated to enhancing understanding in healthcare through our innovative AI-driven platform. Our technology streamlines clinical documentation, allowing healthcare professionals to prioritize patient care.With advanced real-time transformation of patient-clinician dialogues into structured clinical notes, we ensure deep integration with EMR systems. Our unique approach, powered by Linked Evidence and an auditable AI framework, sets us apart as leaders in generative AI for healthcare, establishing new standards for responsible AI deployment.Our dynamic team comprises MDs, AI researchers, PhDs, creatives, technologists, and engineers united in the mission to improve healthcare outcomes. We operate out of offices in San Francisco, New York, and Pittsburgh, creating an inclusive and collaborative environment.Your RoleAs a Senior Quality Engineer, you will be instrumental in fostering a culture of quality across our organization. Your focus will extend beyond testing to encompass quality ownership throughout the development process.If you possess a passion for enhancing products through rigorous testing and delivering reliable software that resonates with users, we invite you to be part of our mission.Your ResponsibilitiesProduct Testing: Test the limits of our product, identifying bugs, edge cases, and usability issues.Hands-On Testing: Conduct thorough manual testing to ensure functionality, reliability, and an excellent user experience.Automated Testing: Develop, implement, and maintain automated testing frameworks for comprehensive end-to-end testing.Collaborative Quality Practices: Work alongside developers to integrate quality measures early in the development cycle.Requirements Refinement: Collaborate with product managers to establish and enhance quality requirements for optimal user experience.CI/CD Optimization: Improve CI/CD pipelines with robust automated testing solutions.Mentorship: Promote a culture of curiosity and continuous improvement by mentoring peers and sharing best practices.

Jan 23, 2026

Apply

Senior Platform Engineer at Abridge | San Francisco

Abridge

Full-time|On-site|SF Office

About AbridgeFounded in 2018, Abridge is dedicated to enhancing comprehension in healthcare through innovative technology. Our AI-driven platform is specifically designed for medical conversations, streamlining clinical documentation while allowing clinicians to prioritize what truly matters—their patients.Our cutting-edge technology converts patient-clinician discussions into structured clinical notes in real-time, seamlessly integrating with electronic medical records (EMR). Leveraging Linked Evidence and our unique, verifiable AI, we are the only organization that correlates AI-generated summaries with ground truth, enabling providers to trust and verify outcomes swiftly. As trailblazers in generative AI for healthcare, we are establishing industry standards for the ethical use of AI across health systems.We are a diverse team of practicing MDs, AI experts, PhDs, creatives, technologists, and engineers united in our mission to empower individuals and simplify healthcare. Our offices are located in the vibrant Mission District of San Francisco, SoHo in New York, and East Liberty in Pittsburgh.The RoleAs Abridge expands, we seek skilled Platform Engineers to join our team and enhance our cloud infrastructure and developer experience. You will be a vital member of a centralized Platform team, focusing on platform development, adoption, and ongoing support of existing tools and software. This role is primarily infrastructure-oriented (80%) with a smaller focus on application software (20%).Your responsibilities will include transforming our infrastructure stack to be multi-tenant and multi-cloud, implementing and promoting cloud security practices, managing modular Terraform, and integrating new cloud infrastructure into our production stack for scalability. Additionally, you will develop service templates and various developer tools, including canary releases, feature flagging, load testing, CI/CD pipelines, and more.The platform we are creating aims to enhance both engineering speed and security, accommodating substantial scale while offering numerous opportunities for creativity, autonomy, and leadership in taking projects from inception to completion. This is a unique chance to accelerate your career within a rapidly growing company using the latest technologies.

May 21, 2025

Apply

Senior Full Stack Engineer - Integrations at Abridge | San Francisco

Abridge

Full-time|On-site|SF Office

About AbridgeAbridge, founded in 2018, is dedicated to enhancing understanding in healthcare. Our innovative AI-driven platform is specifically designed to facilitate medical conversations, significantly improving clinical documentation efficiency while allowing clinicians to prioritize patient care.Our enterprise-grade technology converts patient-clinician dialogues into structured clinical notes in real-time, integrating deeply with electronic medical records (EMRs). Supported by Linked Evidence and our unique auditable AI, we stand alone in the industry by mapping AI-generated summaries to verified ground truths, enabling providers to swiftly trust and validate the outputs. As trailblazers in generative AI for healthcare, we're establishing benchmarks for the ethical application of AI across health systems.Our dynamic team consists of practicing MDs, AI experts, PhDs, creatives, technologists, and engineers collaborating to empower people and simplify healthcare delivery. We have offices in the Mission District of San Francisco, SoHo in New York, and East Liberty in Pittsburgh.The RoleWe are seeking Senior Integrations Engineers to join our expanding team! This pivotal role will support our growth during this hyper-expansion phase. You will be essential in designing, developing, and scaling our integrations that drive our product. Collaborating closely with engineers, implementation specialists, support teams, and product managers, you will create real-time and batch data integrations with our partners and clients.What You’ll DoEnhance our business by designing, building, and maintaining our integrations and infrastructure.Create secure and efficient healthcare integrations that adhere to technology standards, utilizing REST-like APIs, FHIR standards, and various third-party APIs.Work collaboratively with our healthcare system clients to deploy, diagnose, and support integrations, comfortably engaging with external stakeholders.Improve existing systems by adding new features and enhancing performance, stability, quality, and security.Champion high standards by improving code quality, enhancing documentation, and sharing knowledge while managing technical debt.Foster deep user empathy and maintain a user-centric approach in all projects.Communicate effectively within cross-functional product delivery teams, collaborating with various disciplines.

Apr 9, 2026

Apply

Software Engineer, Site Reliability (SRE)

Sierra

Full-time|On-site|San Francisco, CA

About UsAt Sierra, we are pioneering a transformative platform that empowers businesses to forge authentic customer experiences through AI technology. Headquartered in the vibrant city of San Francisco, we also boast a dynamic presence in Atlanta, New York, London, France, Singapore, and Japan.Our operations are anchored in core values that shape our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and are integral to our mission.Our visionary founders, Bret Taylor and Clay Bavor, bring unparalleled expertise. Bret, currently the Board Chair of OpenAI, previously co-led Salesforce and served as CTO at Facebook, while Clay led numerous initiatives at Google, including AR/VR projects and Google Workspace.Your RoleIn your capacity as a Software Engineer on the Site Reliability team, you will play a crucial role in establishing and enhancing the reliability, observability, and scalability of Sierra’s AI-centric infrastructure. Collaborating closely with our engineering and product teams, your goal is to ensure our systems remain highly available, efficient, and primed for growth.Lead the development of Sierra’s observability stack—including monitoring, alerting, logging, and tracing—to provide engineers with critical insights into system health and performance.Collaborate with product and platform engineers to architect systems that prioritize reliability and scalability from the outset, not as an afterthought.Design and implement robust, scalable, and secure cloud infrastructure on AWS, employing Terraform and cutting-edge DevOps tools.Enhance the reliability and scalability of our LLM deployments, ensuring they operate efficiently and cost-effectively.Drive improvements in deployment pipelines, CI/CD tooling, and incident management processes to minimize downtime and accelerate response times.Define and cultivate SRE practices within Sierra, shaping culture, tooling, and best practices across the engineering organization.QualificationsBachelor's degree in Computer Science or a related field, or equivalent experience.Proven experience in Site Reliability Engineering or a similar role, with a strong understanding of cloud infrastructure (AWS).Proficiency in Terraform and modern DevOps practices.Experience with observability tools and techniques—monitoring, alerting, logging, and tracing.Strong problem-solving skills with a focus on scalability and performance optimization.Excellent collaboration and communication skills, with the ability to work effectively in a team environment.

Oct 21, 2025

Apply

Staff IT Automation Engineer at Abridge | San Francisco

Abridge

Full-time|On-site|SF Office

About AbridgeFounded in 2018, Abridge is dedicated to enhancing healthcare understanding through innovative AI solutions. Our platform is tailored for medical conversations, streamlining clinical documentation and allowing healthcare professionals to prioritize patient care.We offer cutting-edge technology that converts patient-clinician dialogues into structured clinical notes in real-time, seamlessly integrating with EMR systems. With our auditable AI and the unique capability to map AI-generated summaries to verified data, we build trust among healthcare providers. As trailblazers in generative AI for healthcare, we are establishing industry best practices for responsible AI deployment across health systems.Our dynamic team comprises practicing MDs, AI specialists, PhDs, creative thinkers, technologists, and engineers, all focused on empowering individuals and making healthcare more comprehensible. Our offices are located in San Francisco's Mission District, New York's SoHo, and Pittsburgh's East Liberty.The RoleIn the position of Staff IT Automation Engineer, you will take on a pivotal technical leadership role, responsible for enhancing efficiency and scalability throughout our corporate IT infrastructure and enterprise ecosystem. You will architect, develop, and maintain essential automation solutions that remove manual processes, enhance security, and elevate the global employee experience. This role demands extensive engineering expertise, a product-oriented mindset for automation, and the ability to mentor junior engineers while addressing complex integration and scaling challenges. You will collaborate closely with IT leadership, Security leadership, and departmental heads across the organization.What You'll DoSaaS Ecosystem Automation: Spearhead the design and implementation of automated workflows for provisioning, de-provisioning, and user management across key SaaS applications, including Google Workspace, Slack, GitHub, Notion, Rippling, and other core business tools.Workflow Automation: Create scalable, resilient workflow automation solutions to optimize essential business processes such as onboarding, offboarding, access requests, compliance enforcement, and self-service IT support.Asset Management Automation: Develop and implement solutions for the automated lifecycle management of various endpoints and assets, including corporate laptops, among others.

Nov 19, 2025

Apply

Senior Site Reliability Engineer at Drata | San Francisco

Drata

Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco

Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.

Apr 27, 2026

Apply

Senior Software Engineer, Site Reliability Engineer (SRE)

Harvey

Full-Time|On-site|San Francisco

Why Join Harvey?At Harvey, we are revolutionizing the landscape of legal and professional services with a holistic approach. By integrating advanced AI technology, a robust enterprise platform, and extensive industry knowledge, we are redefining how essential knowledge work is conducted for years to come.This is a unique opportunity to contribute to the foundation of a transformative company at a pivotal moment in its journey. With over 1000 clients across more than 58 countries, a solid product-market fit, and outstanding investor backing, we are rapidly expanding and creating a new category in real-time. The challenges are significant, expectations are high, and the potential for personal, professional, and financial development is unparalleled.Our team comprises driven, intelligent individuals who are deeply passionate about our mission. We prioritize speed, intensity, and accountability in addressing challenges — from initial ideation to long-term solutions. By maintaining close relationships with our clients, from executives to engineers, we collaboratively address pressing issues with urgency and care. If you excel in uncertain environments, strive for excellence, and wish to shape the future of work alongside a team that raises the bar, we invite you to build alongside us.At Harvey, we are currently writing the future of professional services — and we are just getting started.Your RoleAs a Senior Software Engineer on the Site Reliability team at Harvey, your mission will be to uphold the reliability, scalability, and performance of our innovative legal AI platform. You will become part of a high-impact team that operates at the crossroads of infrastructure and product, taking ownership of the systems that ensure our platform remains fast, secure, and continuously available. From scaling operations across 50+ regions to automating critical processes, your efforts will fortify Harvey's resilience as we expand. If you are enthusiastic about constructing robust systems and simplifying complexity through automation, we would love to collaborate with you.This position is situated in San Francisco, CA, and we adhere to an in-person work model, providing relocation assistance to new employees.Your ResponsibilitiesDesign, implement, and oversee monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions.Lead incident management processes, including postmortems, root cause analyses, and driving actionable enhancements.Automate operational tasks and workflows by developing tools and processes for capacity planning, seamless rollouts, and secure data access to maintain high reliability and minimize manual intervention.Collaborate across teams to drive solutions that enhance system performance and reliability.

Dec 1, 2025

Apply

Senior Site Reliability Engineer

alembic

Full-time|On-site|San Francisco HQ

About the RoleJoin alembic as a Senior Site Reliability Engineer (SRE) and become an integral part of our mission to enhance platform reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to architect, automate, and maintain the robust infrastructure that drives our platform, including data pipelines, machine learning workloads, and real-time analytics systems.This hands-on position offers significant visibility across the technology stack and provides you with the opportunity to shape the future of our infrastructure and operations.

Dec 22, 2025

Apply

Senior Site Reliability Engineer at Hyperbolic | San Francisco

Hyperbolic Labs

Full-time|On-site|San Francisco, CA

Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.

Mar 26, 2026

Apply

Senior Manager, Site Reliability Engineering

Tubi TV

Full-time|$227.2K/yr - $324.5K/yr|Hybrid|San Francisco, CA (Hybrid)

About the Role: At Tubi, our Site Reliability Engineering (SRE) team transcends traditional operations. We embody a software engineering ethos, leveraging a developer's toolkit to tackle the complexities of large-scale, distributed systems. Our core mission focuses on building resilience from the ground up, empowering our product teams to innovate swiftly while delivering an exceptional user experience. We oversee the availability, latency, performance, and capacity of our platform, driven by a culture of data-informed decision-making, blameless learning, and relentless automation. We are on the lookout for a seasoned and visionary Senior Manager of SRE to lead and expand our newly formed Site Reliability Engineering team. You will be more than just a people manager or tech lead; you will be the strategic architect behind our reliability roadmap. Your role will involve building and mentoring a team of skilled engineers, cultivating an environment of blameless learning and continuous improvement, while advocating for the engineering practices that balance rapid innovation with unwavering stability. You will play a pivotal role within our engineering leadership, collaborating with peers across the organization to embed reliability as a shared responsibility and a fundamental principle of our engineering culture.

Mar 17, 2026

Apply

Senior Site Reliability Engineer

Hive

Full-time|On-site|San Francisco

About HiveHive stands at the forefront of cloud-based AI innovation, providing cutting-edge solutions that enable organizations to understand, search, and generate content. Our platform is relied upon by some of the world's most prestigious and forward-thinking companies. We empower developers with an extensive suite of state-of-the-art, pre-trained AI models that handle billions of API requests each month. In addition to our robust model offerings, we deliver comprehensive software applications backed by proprietary AI models and datasets, unlocking transformative applications in various sectors such as content moderation, brand protection, sponsorship measurement, and context-based advertising.With over $120 million in funding from esteemed investors like General Catalyst, 8VC, Glynn Capital, Bain & Company, and Visa Ventures, Hive has cultivated a vibrant global team of over 250 employees across our San Francisco, Seattle, and Delhi offices. If you’re passionate about shaping the future of AI, we invite you to join our dynamic team!DevOps and Systems TeamIn response to our distinctive machine learning demands, we have developed our own data centers focusing on distributed high-performance computing with GPU integration. While we harness the power of these data centers, our infrastructure remains hybrid, leveraging public cloud solutions when advantageous. As we scale our machine learning models for commercial use, we are expanding our DevOps and Site Reliability team to ensure the reliability of our enterprise SaaS offerings. Our ideal candidate thrives in dynamic environments, embraces automation, and believes that every task can be automated and every server can scale. You take pride in enhancing performance across all layers of our stack and are committed to never performing the same task manually twice.

Apr 20, 2022

Apply

Senior/Staff Site Reliability Engineer

fal

Full-time|On-site|San Francisco

Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!

Feb 23, 2026

Apply

Senior Site Reliability Engineer at Carta | San Francisco, CA

Carta

Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA

Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.

Apr 3, 2026

Apply

Site Reliability Engineer III

Veeam Software

Full-time|On-site|San Francisco Bay, CA, USA

Join Veeam Software as a Site Reliability Engineer III, where you'll be at the forefront of ensuring the reliability, scalability, and performance of our software solutions. You will leverage your expertise in system administration and programming to improve our infrastructure and automate processes, making Veeam a leader in cloud data management.

Mar 22, 2026

Apply

Software Engineer, Generative AI Platform

Abridge

Full-time|On-site|SF Office

About AbridgeFounded in 2018, Abridge is dedicated to enhancing understanding in healthcare through our innovative AI-powered platform. We specialize in transforming medical conversations into structured clinical notes in real-time, enabling clinicians to prioritize patient care. Our enterprise-grade technology seamlessly integrates with electronic medical records (EMRs) to ensure accuracy and trust in AI-generated summaries.As pioneers in generative AI for healthcare, we are setting the industry benchmarks for responsible AI deployment across health systems. Our diverse team consists of practicing MDs, AI scientists, PhDs, creatives, technologists, and engineers united in their mission to empower patients and make healthcare more comprehensible. We have offices located in San Francisco's Mission District, New York's SoHo neighborhood, and East Liberty in Pittsburgh.The RoleJoin us as an AI Platform Engineer, where your work will significantly impact the healthcare sector. You will collaborate with a multidisciplinary team of researchers, clinical scientists, and product engineers to design and develop the runtime, orchestration engine, and evaluation platform necessary for agentic orchestration and LLM-driven workflows.What You’ll DoCreate GenAI systems that transform LLMs into composable, reliable tools, utilizing retrieval, tool use, agentic reasoning, and structured outputs.Develop a highly reliable and scalable agent runtime that includes orchestration, shared state and memory, tool-calling interfaces, and scheduling focused on cost, latency, and quality.Build secure, sandboxed environments for agent actions and code, optimizing for cold start, isolation, and observability.Deliver unified interfaces for multiple model sizes and providers; integrate with open tool ecosystems such as MCP-style connectors.Create an evaluation platform for both online and offline assessments, A/B testing, safety checks, and regression gates that enhance agent reliability over time.Collaborate with Research to bring new agent capabilities from prototype to production.What You’ll BringDemonstrated experience in building agent applications with tool-calling, context engineering, and related technologies.Strong problem-solving skills and the ability to work in a fast-paced, collaborative environment.Familiarity with generative AI technologies and their applications in healthcare.

Oct 7, 2025

Apply

Senior Staff Site Reliability Engineer - Observability

Okta, Inc.

Full-time|$194K/yr - $267K/yr|On-site|San Francisco, California

Discover OktaOkta is recognized as The World’s Identity Company, empowering individuals to securely leverage any technology across various devices and applications. Our versatile Okta Platform and Auth0 Platform provide reliable access, authentication, and automation, placing identity at the forefront of business security and expansion.At Okta, we value diverse perspectives and experiences. We seek continuous learners and individuals who can enhance our team with their distinct backgrounds.Join us as we create a world where identity is truly yours.We are in search of a highly skilled Observability Site Reliability Engineer specializing in Google Cloud, to take charge of and elevate our Observability ecosystem within GCP. In this position, you will progress beyond basic monitoring to develop a world-class, comprehensive, and scalable Observability Platform that supports our SRE teams and business collaborators. You will implement infrastructure as code by employing Terraform and demonstrating strong coding skills in Go, Python, or Ruby to automate the deployment of agents and collectors across intricate distributed systems.Key ResponsibilitiesAutomated Infrastructure: Design, build, and maintain scalable observability infrastructure utilizing tools such as Terraform.GCP Observability Engineering: Enhance the collection, processing, and storage of Observability data to guarantee high reliability and low latency for our Splunk and Grafana services.Incident Response: Engage in on-call rotations and conduct post-incident reviews to foster systemic improvements and promote 'observability-driven development.'Automation: Minimize 'toil' by automating the deployment and scaling of observability agents and collectors.

Mar 11, 2026

Apply

Senior Staff Site Reliability Engineer - Tech Lead

Unify

Full-time|On-site|San Francisco Office

Join Unify as a Senior Staff Site Reliability Engineer and take the lead in transforming our technology landscape. In this pivotal role, you will spearhead initiatives to enhance our system reliability and performance, ensuring seamless operations across our platforms. Your expertise will guide a dynamic team, driving innovation and implementing best practices in site reliability engineering.

Mar 24, 2026

Create account — see all 7,190 results