1 - 20 of 2,268 Jobs

Search for Site Reliability Engineer (CloudOps) at Axon | Australia

2,268 results

Apply
Axon logoAxon logo
Full-time|Remote|Australia

Join Axon and be a Force for Good.At Axon, we’re driven by a mission to Protect Life. We are innovators tackling some of society’s most pressing safety and justice challenges with our advanced ecosystem of devices and cloud software. Our collaborative environment thrives on diverse perspectives, fostering connections built on honesty and empathy.Life at Axon…

Mar 31, 2026
Apply
Axon logoAxon logo
Full-time|Remote|Australia

Become a Catalyst for Change at Axon.At Axon, our mission is to Protect Life. We are pioneers tackling society's most pressing safety and justice challenges with our suite of devices and cloud-based software. Like our products, we believe in collaboration, fostering open communication and valuing diverse viewpoints from our clients, communities, and one another.Working at Axon is fast-paced, fulfilling, and impactful. Here, you will take charge and effect genuine change. You'll continuously develop while contributing to a mission that matters in a company where your input is valued.Your ContributionAs an enthusiastic, technically adept, and personable Sales Engineer at Axon, you will serve as a trusted advisor regarding the value and technical aspects of Axon's industry-leading solutions. Your expertise in Software as a Service (SaaS), on-premise software, networking, installation, and configuration will empower you to support significant Proof of Concepts (PoCs), assisting Axon in our mission to Protect Life and Accelerate Justice. As a member of the Axon ANZ Sales Engineering Team, you will help convey the value of and deploy evidence capture devices and digital evidence management solutions both SaaS and on-premise to Public Safety Agencies throughout Australia. This role is customer-facing and quota-carrying, allowing you to influence product enhancements and the implementation of Axon Enterprise’s offerings.Don’t worry if you don't meet every requirement; at Axon, we aim high. We dream big with a long-term vision because we aspire to reshape the world into a safer, better place.We are dedicated to building diverse teams that reflect the communities we serve.Research indicates that women and people of color often hesitate to apply for jobs unless they check every box in the description. If you're excited about this role and our mission to Protect Life but feel your experience doesn’t perfectly align with every qualification, we encourage you to apply anyway. You might be just the candidate we need!

Mar 27, 2026
Apply
Axon logoAxon logo
Full-time|On-site|Australia

Join Axon and be a Force for Good.At Axon, our mission is to safeguard lives through innovative technology. We are pioneers addressing vital safety and justice challenges using a comprehensive ecosystem of devices and cloud-based solutions. Our collaboration is key; we connect with honesty and empathy, embracing diverse perspectives from our customers and communities.Life at Axon is dynamic, rewarding, and impactful. Here, you will have the opportunity to take charge and create meaningful change, all while growing alongside a team that is dedicated to a mission that truly matters.Your ImpactAs a Strategic Account Executive at Axon, you will be tasked with driving the sales of our cutting-edge products and services to large, complex Law Enforcement and Public Safety agencies. This role is externally focused and involves meeting sales quotas. You will need to confidently articulate intricate solutions, foster and maintain relationships with senior stakeholders, navigate across various customer agencies, and lead the Axon team towards achieving success. Your portfolio will include a significant number of customers across Australia, tackling major sales opportunities that can reach into the 7- and 8-digit range. Your achievements will hinge on your ability to build exceptional relationships at both operational and executive levels, both internally and externally.At Axon, we aim to ensure every individual feels valued for their contributions to our mission of protecting lives. We seek intelligent, driven individuals eager to make a remarkable impact. We cultivate an environment that encourages success and makes work enjoyable. Join us in our fast-paced, challenging, and fulfilling journey.What You’ll DoManage and expand revenue and market share within designated agencies to ensure customer satisfaction and meet Axon’s strategic goals.Develop and nurture client relationships to drive revenue growth.Collaborate with your account team to formulate and execute an account strategy that offers compelling value propositions.Foster customer relations and guarantee effective service delivery to accounts.Focus on customer satisfaction by understanding their business and workflows while establishing a proper contact network within accounts.Support the execution of strategy at the account level.Engage with experts and specialists as needed.Oversee all aspects of account management and growth.

Mar 27, 2026
Apply
ClickHouse logoClickHouse logo
Full-time|Remote|Australia(Remote)

About ClickHouseCelebrated on the 2025 Forbes Cloud 100 list, ClickHouse stands as one of the most pioneering and rapidly expanding private cloud enterprises. With over 3,000 clients and an Annual Recurring Revenue (ARR) that has soared by more than 250 percent year-over-year, ClickHouse is a frontrunner in real-time analytics, data warehousing, observability, and AI workloads.The company's sustained growth was recently validated by a significant $400 million Series D funding round. In just three months, notable clients such as Capital One, Lovable, Decagon, Polymarket, and Airwallex have adopted or expanded their use of our platform. These clients join an esteemed roster of AI innovators and global brands like Meta, Cursor, Sony, and Tesla.We are on a mission to revolutionize how organizations harness data. Join us on this exciting journey!About the RoleWe are dedicated to delivering our customers reliable and secure services, which is why we are expanding our central Site Reliability Engineering team. In this role, you will lead initiatives to guarantee the reliability, availability, scalability, and performance of our cloud infrastructure. You will collaborate with teams such as Control Plane, Data Plane, Core, Security, Support, and Operations, guiding them to design and implement scalable, secure, highly available, and fault-tolerant distributed systems. You will also take ownership of incident management, response, and post-mortem analysis, including conducting blameless postmortems and driving continuous improvement of our Cloud services. Utilize your software engineering expertise to develop software platforms and tools that enhance the operational and engineering efficiencies of ClickHouse Cloud. This is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance ClickHouse Cloud.Your Responsibilities:Collaborate with various engineering teams at ClickHouse to design and implement scalable, secure, and highly available systems.Establish and oversee service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.Ensure all infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane, and ClickHouse Core) have monitoring and alerting systems in place for timely incident detection and resolution.Refine incident response processes and post-mortem analyses for outages in ClickHouse Cloud, including liaising with the support team to communicate effectively with impacted customers.Drive continuous improvement of the reliability and performance of our ClickHouse services.Plan, enable, and lead Chaos Engineering initiatives.

Mar 13, 2026
Apply
IMC Trading logoIMC Trading logo
Full-time|On-site|Sydney, Australia

The Platform Engineering division at IMC Trading is dedicated to enhancing the productivity of technology teams by offering self-service tools, comprehensive documentation, and robust support. This team is tasked with the design, construction, and upkeep of the foundational runtime platforms essential for IMC's software applications. Our mission is to optimize development workflows, create a consistent technical framework globally, and provide teams with the resources they need to innovate effectively.As a global entity, Platform Engineering serves as a crucial link between the technical demands of application development and the operational challenges of deploying and maintaining these applications in live environments. Our goal is to reduce friction and ensure that engineering teams can function seamlessly, driving our initiatives forward.We are currently seeking a dynamic and dedicated Site Reliability Engineer who will be pivotal in enhancing and optimizing our developer services infrastructure. You will become part of a highly skilled team that supports a range of critical systems, including source control, continuous integration pipelines, and observability tools, all integral to the stability and performance of our trading platforms.

Mar 12, 2026
Apply
AlayaCare logoAlayaCare logo
Full-time|On-site|Brisbane, Queensland, Australia

Role overview AlayaCare is hiring a Senior Site Reliability Engineer in Brisbane, Queensland. This position focuses on maintaining and improving the reliability and performance of AlayaCare’s cloud-based solutions. The role involves close collaboration with teams across the company to design, build, and support scalable systems that keep services running smoothly.

Apr 17, 2026
Apply
Future Secure AI logoFuture Secure AI logo
Full-time|On-site|Sydney

Future Secure AI creates AI Co-Workers that help enterprises automate operational tasks. The team builds production systems designed to handle real-world scale and reliability. Daily work emphasizes disciplined engineering, resilience, and a culture that encourages both collaboration and individual growth. Leadership maintains an entrepreneurial approach and remains accessible to support employees. Role overview The Site Reliability Engineer, based in Sydney, will design, build, and maintain the infrastructure that powers AI Co-Workers. This position works closely with product, AI, and engineering teams. The role involves taking responsibility for system reliability throughout the entire lifecycle. Key responsibilities Design, build, and manage reliable production infrastructure for AI Co-Workers. Oversee Kubernetes-based platforms for deploying and running AI workloads. Create and maintain infrastructure as code using Terraform. Implement and manage Helm-based deployment workflows. Define, measure, and improve system reliability using SLIs, SLOs, and SLAs. Participate in on-call rotations, handle incident response, conduct root cause analysis, and contribute to post-mortem reviews. Reduce operational toil through automation and engineering improvements. Develop and enhance observability, including monitoring, logging, and alerting. Work with engineers to keep systems resilient, scalable, and secure. Manage tasks across build, deploy, and operate phases of the software lifecycle.

Apr 21, 2026
Apply
ClickHouse logoClickHouse logo
Full-time|Remote|Australia (remote)

About ClickHouseFeatured on the prestigious 2025 Forbes Cloud 100 list, ClickHouse stands at the forefront of innovation in the private cloud sector. Boasting over 3,000 clients and an astonishing annual recurring revenue (ARR) growth of more than 250% year-on-year, ClickHouse excels in real-time analytics, data warehousing, observability, and AI workloads.Our continuous and rapid growth was recently affirmed by a remarkable $400M Series D funding round. In the past three months, notable clients such as Capital One, Lovable, Decagon, Polymarket, and Airwallex have either adopted our platform or expanded their existing deployments, joining a distinguished clientele of AI pioneers and global brands like Meta, Cursor, Sony, and Tesla.Join us in our mission to revolutionize the way businesses harness data. Be part of our exciting journey!Note: This position is fully remote and can be based in any country where ClickHouse has a hiring presence.At ClickHouse, we prioritize delivering reliable and secure services to our customers. To support this commitment, we are expanding our Site Reliability Engineering team within ClickHouse Core. As a pivotal member of our Core Reliability Engineering Team, you will spearhead initiatives to ensure and enhance the reliability, availability, scalability, and performance of ClickHouse. You will collaborate with teams such as Control Plane, Dataplane, Security, Support, and Operations to guide them in implementing ClickHouse optimally for our customers. Furthermore, you will manage engineering escalation processes, conduct investigations, lead blameless post-mortem analyses, and drive continuous improvements in how ClickHouse operates and is optimized in the cloud. This role presents a unique opportunity to make a significant impact on our high-performance ClickHouse Cloud.

Apr 2, 2026
Apply
Future Secure AI logoFuture Secure AI logo
Full-time|On-site|Sydney

About Future Secure AI Future Secure AI creates AI Co-Workers that automate key operational tasks for enterprises. Our systems run at scale in production, where reliability and disciplined engineering matter every day. The company values courage, rigor, and curiosity, and maintains an entrepreneurial, approachable leadership style. Professional growth is a priority here, and team members are supported to do their best work. Role Overview: Site Reliability Engineer (Sydney) This Site Reliability Engineer role focuses on building and supporting the platforms behind our AI Co-Workers. The position suits someone who takes responsibility for reliability from start to finish and enjoys collaborating with product, AI, and engineering teams. What You Will Do Design, build, and maintain production infrastructure for AI Co-Workers. Manage Kubernetes-based platforms for deploying and running AI workloads. Use Terraform to create and maintain infrastructure as code. Implement and manage deployment workflows using Helm. Define, measure, and improve system reliability with SLIs, SLOs, and SLAs. Participate in on-call rotation, incident response, root cause analysis, and post-mortem reviews. Reduce manual operational work by automating processes and making engineering improvements. Improve observability through better monitoring, logging, and alerting. Work closely with engineering teams to build resilient, scalable, and secure systems. Contribute across all phases of the software lifecycle: build, deploy, and operate. Location Sydney

Apr 16, 2026
Apply
The Trade Desk logoThe Trade Desk logo
Full-time|On-site|Sydney

Join Our Team: The Trade Desk is a pioneering technology firm dedicated to enhancing the internet experience through ethical and intelligent advertising solutions. With an impressive capacity of handling over 1 trillion queries daily, our platform operates on a scale that is unmatched in the industry. We are proud of our award-winning culture, which emphasizes trust, ownership, empathy, and collaboration. We celebrate the diverse experiences and perspectives of our team members and strive to create inclusive environments where everyone can show their true selves every day. Are you passionate about tackling complex challenges on a large scale? Do you want to be part of a vibrant, globally connected team where your efforts will significantly impact the media landscape? Discover why Fortune magazine consistently ranks The Trade Desk among the top small to medium-sized workplaces worldwide. Your Future Team: The Trade Desk Network Team is responsible for managing comprehensive networking across one of the industry’s most challenging infrastructures, which includes extensive bare-metal datacenters and major public cloud platforms. We operate at the intersection of network engineering and software development, collaborating closely with application, datacenter, and Site Reliability Engineering (SRE) teams to design and maintain networks that facilitate a global, high-performance advertising technology platform. Our approach is software-first, and we seamlessly integrate modern AI-assisted development tools like Cursor and Claude into our workflows. You will be instrumental in shaping the future of network automation rather than merely maintaining existing systems. What We Seek: We are in search of a Senior Software Engineer who excels at the intersection of advanced networking knowledge and software development. You will closely collaborate with SRE and infrastructure teams to define strategies and create the next generation of network automation, rooted in industry best practices and a focus on scalable, maintainable solutions. You possess a profound dedication to keeping networks healthy, efficient, and resilient. Your Responsibilities: Design, develop, and expand a global network platform that encompasses physical datacenters and multi-cloud environments, including AWS, Azure, and Alibaba Cloud. Support thousands of hosts across the globe, engineering reliable and efficient solutions to manage petabyte-scale data challenges. Take ownership of troubleshooting and resolving intricate network issues, ensuring high availability and performance across the entire infrastructure.

Mar 31, 2026
Apply
AlayaCare logoAlayaCare logo
Full-time|On-site|Melbourne, Victoria, Australia

About the Role AlayaCare is looking for a Senior Site Reliability Engineer in Melbourne, Victoria. This role focuses on keeping our cloud-based applications stable, scalable, and efficient. The position combines systems engineering and software development to strengthen platform performance and reliability. What You Will Do Monitor systems to maintain uptime and performance Develop automation tools to streamline operations Work with teams across the company to improve operational processes Who We’re Looking For Experience in systems engineering and software development Strong problem-solving skills Interest in technology and reliability engineering

Apr 17, 2026
Apply
IMC logoIMC logo
Full-time|On-site|Sydney, Australia

Join IMC as a Graduate Site Reliability Engineer, where you will play a crucial role in ensuring the efficiency and reliability of our cutting-edge, low-latency Linux trading environment. Balancing speed and quality is essential in our operations, and your contributions will be pivotal in automating various aspects of our platform.In this exciting position, you will embrace DevOps principles and implement SRE techniques to enhance IMC's operational capabilities. You will collaborate with a dynamic team, tackling the daily challenges of managing high-volume data flows in a complex distributed environment. Your role will involve addressing technical and trading challenges while ensuring the high availability, stability, and performance of our end-to-end systems.

Mar 30, 2026
Apply
Netwealth logoNetwealth logo
Full-time|On-site|Melbourne Office

About NetwealthAt Netwealth, we are not just a financial services company; we are pioneers in transforming the wealth management landscape in Australia. Our award-winning platform, built on NextGen technology, empowers both advisers and investors to achieve exceptional results. Recognized as one of Australia's most innovative FinTech firms, we are proud of our rapid growth since our inception in 1999, driven by our relentless challenge to the status quo.We operate with agility, free from excessive bureaucracy, allowing us to deliver smarter solutions that create a tangible impact for our clients. What truly sets us apart is our people—a vibrant team of curious, optimistic, and courageous individuals who work collaboratively to enhance the lives of Australians. We prioritize authenticity and adaptability, fostering an environment where you can excel, develop your career, and contribute to meaningful work.If you seek a workplace where your contributions are valued, innovation is encouraged, and you can help shape a brighter financial future, consider joining us at Netwealth.The OpportunityAs the Senior Manager of Site Reliability Engineering (SRE), reporting directly to the Head of Developer Platform, you will lead multiple SRE teams (1-3) and collaborate closely with engineering, product management, platform, and senior technology leaders. This is a highly impactful position where you will influence strategy, culture, and decision-making through data-driven reliability practices.Join a team that values your voice and encourages your ideas. Your growth is essential to us, and this role is more than just a title; it’s a chance to engage in meaningful work alongside individuals who are passionate about their impact.We understand that choosing your next career step is significant. As you explore this opportunity, we want you to envision stepping into our workspace, meeting your future colleagues, and confidently affirming, “This feels right for me.”What You’ll DoIntegrate SRE principles and shared reliability ownership across engineering teams.Establish and lead SRE strategy, standards, and operational frameworks.Enhance incident management, on-call, and support practices as platforms expand.Drive reliability decisions utilizing SLIs, SLOs, SLAs, error budgets, and observability data.Balance reliability enhancements with project delivery speed and business results.Engage senior leaders through evidence-based discussions.

Apr 2, 2026
Apply
AlayaCare logoAlayaCare logo
Full-time|On-site|Sydney, New South Wales, Australia

About the Role AlayaCare is hiring a Senior Site Reliability Engineer in Sydney, New South Wales. This role focuses on maintaining and improving the reliability and performance of AlayaCare’s software products. The position calls for strong experience in cloud infrastructure, automation, and monitoring. What You Will Do Support and enhance the reliability of software systems serving AlayaCare’s clients Apply expertise in cloud infrastructure to strengthen system stability Automate processes to streamline operations and reduce manual intervention Monitor system health and performance to identify and resolve issues quickly Location This position is based in Sydney, New South Wales, Australia.

Apr 17, 2026
Apply
Freelancer Ltd. logoFreelancer Ltd. logo
Full-time|On-site|Sydney, New South Wales, Australia

Join our dynamic Systems Engineering team as a Senior DevOps Engineer / Site Reliability Engineer, where your expertise will play a vital role in designing and delivering mission-critical services and systems. Collaborate closely with software engineers to manage infrastructure and services at scale, utilizing an array of cutting-edge technologies to support the high-traffic Freelance.com marketplace and various other business products deployed on Amazon Web Services. Our tech stack includes Nginx, MySQL, Redis, ElasticSearch, RabbitMQ, Consul, Docker, and Kubernetes, all aimed at building highly resilient, dynamically scalable, self-healing systems through automation and monitoring using Terraform, Puppet, Prometheus, Grafana, Kibana, and Jenkins.

Dec 15, 2025
Apply
coreflow logocoreflow logo
Full-time|On-site|Sydney

About UsAt coreflow, we are transforming the entertainment industry through the power of AI. As one of the fastest-growing startups on a global scale, we proudly serve 20 million users in our first year. Our team thrives in a collaborative, in-person environment based in Sydney, Australia.We adhere to core principles that guide our work and influence every decision we make:User-First: Our focus is on creating products that resonate with our users. We dedicate time to understand their needs and prioritize delivering value.High Agency, High Ownership: We take full responsibility for our work, from start to finish. We learn from our mistakes and are committed to finding solutions without placing blame.Urgency: This is a unique opportunity in a fast-paced environment. We prioritize effectively, seek leverage, and maintain an inspiring pace of work.Your RoleAs our first dedicated Site Reliability Engineer, you will be pivotal in ensuring reliability and making core platform decisions as we scale to support hundreds of millions of users.Key ProjectsEnhance uptime and minimize RTO across essential services.Manage and strengthen GPU clusters that facilitate millions of AI generations daily.Establish platform-wide observability (metrics, tracing, alerting) and uphold SLOs.Refine AWS infrastructure to optimize costs while ensuring top-tier performance.Qualifications5+ years of experience in operating production systems at scale.Proficient in AWS (infrastructure as code, high-scale computing, K8s/ECS or similar).Strong background in observability and incident response.Expertise in CI/CD and deployment pipelines.Familiarity with our technology stack: TypeScript, Next.js, React, TailwindCSS, tRPC, Postgres, Temporal, AWS.A problem-solver who addresses root causes rather than just symptoms.A relentless drive to succeed; this role will challenge you.What We OfferCompetitive salary with significant growth potential.

Feb 24, 2026
Apply
Griffith University logoGriffith University logo
Reliability Engineer

Griffith University

Full-time|On-site|Nathan

Role Overview Griffith University is seeking a Reliability Engineer to strengthen the reliability and performance of key systems at the Nathan campus. This role focuses on maintaining operational excellence across the university’s infrastructure. Main Responsibilities Analyze data to spot trends affecting system reliability. Conduct failure mode and effects analysis (FMEA) to uncover potential issues before they arise. Develop and recommend strategies that improve reliability and reduce downtime. Work closely with teams from different disciplines to put best practices and new solutions into action.

Apr 20, 2026
Apply
Algolia logoAlgolia logo
Full-time|Remote|Remote - Australia

Join Algolia, a trailblazer in AI Search, serving over 17,000 businesses with lightning-fast, predictive search capabilities at an internet scale. Our platform handles over 30 billion search queries weekly, outperforming major competitors like Microsoft Bing and Yahoo.With a recent Series D funding of $150 million, we have elevated our valuation to $2.25 billion, allowing us to continuously enhance our leading platform and support renowned clients such as Under Armour, PetSmart, and Stripe.About the AI Research TeamThe AI Research team at Algolia merges fundamental research with product engineering to create innovative AI-driven features.This dynamic team comprises PhD researchers, full-stack engineers, and infrastructure experts collaborating to explore groundbreaking ideas, assess their impact, and implement successful research outcomes into real-world applications, ensuring that our work translates into tangible, customer-facing systems.The OpportunityWe are on the lookout for a dedicated Senior Site Reliability Engineer to embed within the AI Research team. In this role, you will enhance both the research and product engineering functions by ensuring the reliability, scalability, and operability of the infrastructure that underpins our innovative work.This position is a traditional SRE role focused on cloud-first, service-oriented architectures hosted on Google Cloud Platform. Although our team develops AI-powered systems, prior experience in AI or ML is not a prerequisite. Our primary focus is on strong SRE fundamentals, experience with production service management, and the ability to thrive in a setting characterized by ambiguity and high ownership.You will play a significant role in both daily operations and long-term (12-month) planning, influencing how the team builds and manages its platforms moving forward.What You’ll Work OnPlatform Reliability & EnablementEnhance the reliability of platforms utilized by the AI Research team. Some examples of our infrastructure initiatives include:A production inference service (embedding model serving API)AI data feature storeInternal tools for innovative research and experimentation

Feb 18, 2026
Apply
Arista Networks logoArista Networks logo
Full-time|On-site|Sydney

Join Our TeamArista Networks is on the lookout for a talented Site Reliability Engineer (SRE) to enrich our Engineering Productivity (EngProd) team. You will play a pivotal role in maintaining and enhancing our growing infrastructure tailored for our internal user base. The ideal candidate will be adaptable, proactive, and eager to embrace new technologies. As part of our software engineering team, you will collaborate with fellow engineers to design, construct, and manage secure, scalable, and fault-tolerant tools within a hybrid cloud environment.In the EngProd group, you will work closely with engineers to architect, build, scale, and manage systems utilized by Arista’s product development teams. These systems incorporate industry-standard technologies such as Ansible, Artifactory, Gerrit, Jenkins, Kubernetes, Grafana, Spinnaker, MySQL, ElasticSearch, Google Cloud, Varnish, and Perforce, along with bespoke internal systems designed to automate CI/CD, testing, analysis, and visualization.Your ResponsibilitiesSafely build, deploy, and operate critical production systems with an emphasis on scalability, reliability, observability, performance, and security.Monitor and enhance the developer experience across various services.Automate processes to minimize toil and streamline production operations.Proactively monitor, respond to, and improve alerts; establish automated alert handling.Draft and maintain incident response documentation.Triage platform and infrastructure issues, assisting Arista software engineers in their troubleshooting efforts while engaging with third-party vendor support.Compose postmortem reports and devise solutions to prevent recurrence of incidents.Plan and communicate maintenance schedules for production systems.Collaborate with product development teams to identify and resolve infrastructural bottlenecks affecting their workflows.Research and implement best practices for maintaining secure, scalable, and fault-tolerant systems.Analyze the design and implementation details of open-source systems to improve triage and resolution processes.

Feb 24, 2026
Apply
Heidi Health logo
Full-time|On-site|Sydney

About UsAt Heidi Health, we believe that healthcare deserves a more harmonious approach—one that ensures continuous and deeply human care. Our mission is to develop an AI Care Partner that collaborates with clinicians to achieve this goal.Our diverse team comprises doctors, engineers, designers, researchers, and creatives dedicated to creating tools that empower clinicians to concentrate on what really counts: their patients.In just 18 months, we've reclaimed over 18 million hours for healthcare professionals, facilitating 73 million patient visits across 116 countries. Currently, over two million patient visits weekly are powered by Heidi around the globe.Supported by nearly $100 million in funding, we are expanding into the US, UK, Canada, and Europe. We collaborate with premier health systems, including the NHS, Beth Israel Lahey Health, and Monash Health.The PositionThe Senior Site Reliability Engineer will join our core Platform/SRE team responsible for production. You will directly engage in incident response, on-call duties, system reliability, and the daily operations of Heidi’s platform.We welcome strong mid-level SRE candidates eager to take on more responsibility, as well as seasoned SREs who thrive in hands-on operational roles. This position is purposefully operations-focused, with an emphasis on maintaining the health of real systems in production.Your ResponsibilitiesEngage in on-call and incident response: Address production incidents, assist in service restoration, and ensure clear communication during incidents, gradually taking on more leadership in managing incidents.Enhance operational reliability: Identify recurring issues and reliability risks, driving improvements through better alerting, automation, system adjustments, or process enhancements.Oversee components of the production environment: Manage and enhance Kubernetes clusters, cloud infrastructure, and core platform services, with increasing ownership as you gain experience.Bolster observability: Improve dashboards, alerts, logs, and traces to ensure quicker detection and diagnosis of issues, focusing on actionable insights.Minimize operational toil: Automate repetitive tasks, streamline runbooks, and enhance tooling to make on-call and daily operations more efficient and secure.

Feb 10, 2026

Sign in to browse more jobs

Create account — see all 2,268 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.