1 - 20 of 6,730 Jobs

Search for Site Reliability Engineer - Infrastructure

6,730 results

Apply
companyPostHog logo
Full-time|Remote|Remote

Join PostHog as a Site Reliability Engineer specializing in Infrastructure. In this role, you will be pivotal in ensuring the reliability, performance, and scalability of our systems. You'll collaborate with cross-functional teams to design and implement robust infrastructure solutions that support our growing user base.

Mar 26, 2026
Apply
companyxlabs logo
Full-time|Remote|Europe

About UsAt xLabs, we specialize in providing essential infrastructure and contributing to open-source technologies that empower the innovators of the Internet of Value. Our primary focus is on developing engineering tools and delivering resilient infrastructure that facilitates the seamless creation of decentralized applications. As key contributors to Wormhole, we are instrumental in constructing the backbone of the Internet of Value and actively participate in enhancing various major blockchains.About the RoleYou will thrive in distributed teams, emphasizing clear and consistent communication.Critical thinking is essential; you will analyze challenges and propose multiple solutions, ideally documented for clarity.Proactivity is key; you will take the initiative while prioritizing the team's success.Fluency in English, both spoken and written, is required.Experience with Infrastructure-as-Code and GitOps is necessary.You should be familiar with workload orchestration tools, such as Kubernetes.Understanding software design and distributed systems is crucial.Proficiency in at least one programming language is expected.Experience operating Blockchain RPC nodes and Validators, such as Ethereum, Solana, or similar platforms, is advantageous.What We DoWe manage the compute infrastructure and the applications that rely on it. Many applications, such as blockchains, are not developed by us. Therefore, we spend significant time reviewing documentation, code, Slack channels, and Discord servers to cultivate a thorough understanding of each application's architecture, optimal operating procedures, and debugging strategies when issues arise. Documenting our findings, including lessons learned from past mistakes and strategies to prevent future occurrences, is an integral part of our responsibilities.BenefitsFully remote team—ideal candidates are located in US, South America, and European time zones.Team culture that is collaborative, transparent, and supportive, where all contributions are highly valued.Flexible working hours that respect your independence and encourage self-motivation.Generous vacation policy that allows you to take the time necessary to recharge, with paid time off that honors work-life balance.

Nov 6, 2025
Apply
companygetground logo
Full-time|Hybrid|London

Location: London, Waterloo (Hybrid, 4 days in-office - Wednesday is our designated work from home day, though you are welcome to join us in the office on Wednesdays if you prefer)At getground, we are revolutionizing one of the world's most significant asset classes: property. With over £2 billion in assets on our platform and a community of more than 30,000 users across 70 countries, we are shaping the future of asset ownership and tackling wealth inequality.Our innovative product streamlines property investing from start to finish, making real estate investment accessible to everyone.Your Key Responsibilities:Collaborating within cross-functional product teams to transition infrastructure and reliability initiatives from concept to live deployment.Thriving in a dynamic environment where autonomy and ownership are fundamental to our operations.Developing and sustaining a robust, scalable infrastructure within our GCP cloud ecosystem. Utilizing Kubernetes, Terraform, Cloudflare, and cutting-edge observability tools to ensure seamless platform functionality.Working closely with engineering teams to formulate CI/CD pipelines, enhance deployment methodologies, and advocate for reliability as a core engineering principle.Contributing to the establishment of SRE practices for a rapidly growing fintech platform. Mentoring fellow engineers as we expand our teams and influence.Your Day-to-Day Activities:Designing, implementing, and maintaining cloud infrastructure on Google Cloud Platform (GCP), ensuring it meets scalability, reliability, and security standards.Taking ownership of our Kubernetes clusters and containerization strategy, including Docker image optimization, cluster management, and deployment orchestration.Creating and optimizing Infrastructure as Code using Terraform, producing modular, testable, and well-documented configurations that adapt to our rapid growth.Managing and enhancing our Cloudflare infrastructure, including Workers for edge computing, DNS, CDN, security policies, and performance optimization.Implementing AI-powered product features in isolated and secure serverless environments.Establishing comprehensive monitoring and observability with Prometheus and Grafana, defining SLIs/SLOs, and proactively identifying potential issues before they affect users.Designing and maintaining CI/CD pipelines with appropriate quality gates, testing strategies, and deployment methodologies (blue-green, canary) to facilitate rapid deployments.

Feb 27, 2026
Apply
companyAxon Enterprise, Inc. logo
Full-time|Hybrid|London, England, United Kingdom

Become a Force for Good with Axon.At Axon, our mission is to Protect Life. We are innovators tackling society’s most pressing safety and justice challenges through our integrated ecosystem of devices and cloud software. Like our products, we thrive on collaboration, connecting with transparency and empathy, and embracing diverse perspectives from our customers, communities, and each other.Working at Axon is fast-paced, challenging, and purposeful. Here, you will take the initiative and make a tangible impact. Constantly develop your skills as you dedicate yourself to a mission that matters within a company that values your contributions.Your ContributionJoin us in revolutionizing infrastructure automation for critical law enforcement systems. As a Senior Site Reliability Engineer, you will lead the creation of a cutting-edge infrastructure provisioning and automation platform. This platform allows engineering teams to independently access cloud infrastructure, ensuring safety and efficiency while minimizing manual interventions and operational risks.Your role will involve hands-on contributions to build and enhance systems leveraging automation and intelligent agents to generate, validate, test, and manage infrastructure at scale. We seek an engineer with a strong software development background, proficiency in programming languages such as Go or Python, and extensive experience in designing and operating cloud platforms, with a drive to enhance developer productivity, reliability, and platform robustness.Work Location:This position is based in our London office and follows a hybrid work schedule. We emphasize in-person collaboration, requiring team members to be onsite from Tuesday to Friday, with the option to work remotely on Mondays, unless a workplace accommodation has been approved. We believe that connection fuels innovation, and our in-office culture is designed to promote meaningful teamwork, mentorship, and collective success.Key ResponsibilitiesDevelop robust, user-friendly foundational platforms and tools that enable engineering teams to provision infrastructure quickly, consistently, and securely across diverse cloud providers.Write efficient, maintainable, and clear code in Go.Promote and uphold Infrastructure as Code (IaC) best practices and coding standards.Utilize strong problem-solving skills to troubleshoot issues in cloud-native distributed systems.Influence and educate the engineering organization on adopting new and improved architectural patterns.Provide comprehensive documentation to facilitate self-service by engineers.

Mar 27, 2026
Apply
companySectigo logo
Full-time|On-site|Manchester

Role Overview Sectigo is hiring a Site Reliability Engineer in Manchester. This role focuses on maintaining and improving the reliability, availability, and performance of Sectigo's systems. The position sits within the development team and involves close collaboration to strengthen infrastructure and support scalable applications. What You Will Do Work with development and operations teams to ensure systems remain reliable and available Enhance infrastructure to support growing and scalable applications Contribute technical expertise to ongoing projects and operational improvements What Sectigo Looks For Technical background in site reliability or related fields Experience supporting scalable systems Commitment to operational excellence Strong teamwork and communication skills Location This position is based in Manchester.

Apr 17, 2026
Apply
companyNice Ltd. logo
Full-time|Remote|United Kingdom - Remote

Role overview Nice Ltd. is seeking a Site Reliability Engineer to join the team remotely from anywhere in the United Kingdom. The primary focus of this position is to enhance the reliability, scalability, and performance of the company’s services. Collaboration is central to the role, as you will work with colleagues across various teams to design, build, and maintain systems that deliver smooth user experiences. What you will do Work with cross-functional teams to develop and maintain dependable systems Automate operational tasks to streamline processes and minimize manual intervention Identify and resolve technical issues that affect infrastructure and services Use both software engineering and systems administration skills to drive ongoing improvements Impact The work in this role directly supports Nice Ltd.’s commitment to delivering reliable, high-quality services to customers each day.

Apr 22, 2026
Apply
companyaccesso logo
Full-time|On-site|United Kingdom

Role Overview accesso is hiring a Site Reliability Engineer in the United Kingdom. This role focuses on keeping accesso’s solutions reliable, available, and performing well for clients. The Site Reliability Engineer works closely with teams across the company to design, build, and maintain systems that support a wide range of users. What You Will Do Collaborate with cross-functional teams to create and maintain reliable systems Monitor system health and performance proactively Troubleshoot technical issues as they arise Develop and implement automation to improve operational efficiency

Apr 16, 2026
Apply
companyOrgvue logo
Full-time|On-site|London, England, United Kingdom

At Orgvue, we are at the forefront of organizational design and planning software, harnessing the transformative power of data visualization and modeling to help organizations become more adaptable and high-performing. Our platform empowers HR, finance, and business leaders to make swift, informed workforce decisions in an ever-evolving landscape.Trusted by some of the world's largest enterprises and renowned management consulting firms, Orgvue enables organizations to visualize and proactively shape their futures. Headquartered in London, we also have offices in Philadelphia, The Hague, Toronto, and Sydney.We are currently on the lookout for a Principal Site Reliability Engineer to join our team as a senior technical leader specializing in scaling and fortifying our AWS and Kubernetes-based infrastructure.Role OverviewIn this pivotal role, you will collaborate with product, platform, and operations teams to ensure our systems are reliable, observable, and resilient, even at scale. This position marries hands-on technical proficiency with strategic foresight, enabling us to cultivate a world-class reliability culture and a strong engineering framework for growth. We seek an individual with robust technical skills, exceptional communication abilities, and a passion for cross-team collaboration.Key ResponsibilitiesEstablish and uphold SLOs, SLIs, and error budgets across vital servicesDesign and execute a comprehensive cloud infrastructure and tooling strategyElevate SRE practices organization-wideImplement effective observability metrics, logs, and traces using our observability toolsLead the team in creating automated, self-healing systemsManage and refine our incident response protocols, including on-call practices and a post-mortem cultureMentor engineers throughout the organization on reliability best practices, operational readiness, and scalable infrastructureDrive Infrastructure as Code (IaC) initiatives using Terraform, Kubernetes, CloudFormation, and GitOps methodologiesWork closely with security, DevOps, and software teams to guarantee compliance, scalability, and operational excellenceAssess and introduce tools, patterns, and practices that enhance the performance and reliability of our SaaS platformQualificationsProven experience leading SRE transformationsExtensive hands-on expertise with Kubernetes (EKS preferred) in production settingsStrong proficiency with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)Expertise in Infrastructure as Code utilizing tools such as Terraform, with familiarity in GitOps workflowsSolid background in observability: metrics, visualization, logging, and tracingUnderst...

Feb 6, 2026
Apply
companyBumble Inc. logo
On-site|On-site|UK London

Join our dynamic team as a Senior Site Reliability Engineer at Bumble Inc., where your expertise in Linux and system-level operations will be pivotal in managing complex production environments. We seek a proactive engineer capable of independently troubleshooting incidents, leading post-incident recovery efforts, and implementing enhancements to boost overall system stability, performance, and observability. This role is ideal for hands-on SREs with a solid foundation in Linux infrastructure and third-party system operations, focusing on optimizing large-scale environments of over 5,000 hosts utilizing technologies such as Kafka, Redis, and Kubernetes. Please note, this position centers on operational excellence rather than application development, requiring deep technical acumen and advanced troubleshooting capabilities.

Nov 19, 2025
Apply
companyNeo4j logo
Full-time|On-site|London

About Neo4j Neo4j builds a graph intelligence platform used by 84 of the Fortune 100 and supported by the world’s largest graph community. The platform powers knowledge graphs for AI, delivers reliable graph capabilities across cloud environments, and integrates with a wide range of systems. Neo4j’s technology is designed for precision, accountability, and governance, helping organizations turn data into actionable insights for intelligent applications and AI systems. Engineered for seamless operation in any cloud, Neo4j supports dynamic, personalized, and autonomous AI solutions. The focus is on delivering swift results, contextual knowledge, and solutions that improve both customer and employee experiences. Our Vision Neo4j’s mission is to help the world understand data. As business and society become more interconnected, Neo4j’s technology enables organizations to find and understand relationships within their data. The company pioneered the graph database category and continues to lead in helping teams innovate and stay competitive. About the Site Reliability Engineering Team The Site Reliability Engineering (SRE) team supports Neo4j’s Database as a Service (DBaaS) product, Neo4j Aura. Aura operates globally across all major cloud providers, running hundreds of Kubernetes clusters and managing thousands of Neo4j instances in production. This team is redefining SRE within Neo4j Aura. Rather than simply reacting to incidents, the SRE group empowers teams to design for reliability from the start. The work centers on building tools, practices, and a culture that embed SRE principles into the foundation of Aura’s operations. Collaboration with product teams and a commitment to resilience and engineering excellence are central to the team’s approach. What You Will Do Automate for insight and scale: Build systems that enable fast, safe, and scalable troubleshooting across thousands of Neo4j instances. This includes developing internal tools that provide actionable insights. Location London

Apr 20, 2026
Apply
companyJobGether logo
Full-time|On-site|UK

Role overview JobGether seeks an Engineering Manager to lead its Site Reliability Engineering team in the UK. This position carries responsibility for the reliability and performance of core systems, ensuring consistent service delivery and high system uptime across the company. What you will do Lead and support a team of site reliability engineers Mentor engineers, helping them develop professionally Work with teams across the organization to strengthen system reliability Proactively identify and address reliability and performance concerns before users are affected Establish and maintain best practices in reliability engineering Encourage ongoing improvement within the team Location This role is based in the UK.

Apr 27, 2026
Apply
companyCustomer.io logo
Full-time|$140K/yr - $180K/yr|Remote|EMEA Remote

About Customer.ioAt Customer.io, we empower over 8,000 businesses, from agile startups to established global brands, to engage their audiences through billions of daily communications including emails, push notifications, in-app messages, and SMS. Our platform is designed to facilitate automated messaging that resonates with recipients.We enhance team communications by leveraging real-time behavioral insights. Our technology stack includes Go, React, Ember, and AI, enabling us to deploy swiftly while ensuring scalability and reliability.We are seeking a Senior Site Reliability Engineer to join our team in scaling our infrastructure, minimizing operational overhead, and enhancing system reliability as we expand. If you have experience with high-scale systems and a passion for improving platforms for both developers and customers, we invite you to apply.

Mar 6, 2026
Apply
companyKraken Digital Asset Exchange logo
Full-time|On-site|United Kingdom

Kraken Digital Asset Exchange seeks an Engineering Manager to lead its Site Reliability Engineering (SRE) team in the United Kingdom. This position guides a group of engineers who focus on making Kraken’s platform more reliable, scalable, and high-performing. What you will do Mentor and support SRE team members as they develop professionally Promote operational excellence in both daily activities and long-term projects Collaborate with other departments to ensure infrastructure meets demanding reliability and performance standards Role overview This role centers on leading engineers who enhance the stability and efficiency of Kraken’s systems. The Engineering Manager provides guidance, encourages best practices, and works across teams to maintain high operational standards.

Apr 24, 2026
Apply
companyKaluza logo
Full-time|£40K/yr - £60K/yr|Hybrid|Bristol, England, United Kingdom; Edinburgh, Scotland, United Kingdom; London, England, United Kingdom

Join our dynamic Release Engineering team at Kaluza as a Site Reliability Engineer. In this pivotal role, you will play a crucial part in enhancing our software development lifecycle by developing innovative engineering solutions that empower our software teams to deploy high-quality code efficiently. Your efforts will significantly boost engineering productivity through the optimization of testing, deployment, and release processes across all Kaluza engineering teams.

Feb 23, 2026
Apply
companybet365 logo
Full-time|Hybrid|Manchester

Join bet365 as a Site Reliability Engineer and play a crucial role in enhancing system reliability, observability, and performance through a robust engineering approach. You will be instrumental in incident resolution and in the implementation of best practices.Bringing strong software engineering skills to the table, you will focus on monitoring the health, performance, and availability of our critical systems, significantly impacting our operational efficiency.Your engineering expertise will be key in implementing solutions that boost reliability, which includes service instrumentation using tools like OpenTelemetry, improving logging practices, and developing features that enhance maintainability. Additionally, you will help create tools and automation for effective service management.Collaboration is essential in this role, as you will work across various functions to integrate reliability and observability best practices into the software development lifecycle. By supporting governance standards established by central teams, you will cultivate a culture where these principles are fundamental to development. Your contributions will ensure our systems meet user needs and enhance overall service performance.This position is eligible for our hybrid working from home policy.

Dec 1, 2025
Apply
companyWheely logo
Full-time|On-site|London, England, United Kingdom

About WheelyWheely is revolutionizing premium transportation in major cities across Europe, the United States, and the Middle East. We seamlessly integrate cutting-edge technology with the artistry of five-star chauffeuring to provide an unparalleled experience that has earned the trust of over 100,000 active riders and 1,200 corporate clients.As a profitable and rapidly growing scale-up, we have raised $43M and surpassed $100M in annual revenue. Following our recent launch in New York City, we are swiftly expanding across the US and EMEA. If you take pride in your craft and are eager to contribute to our next phase of growth, we invite you to connect with us.Our infrastructure has been rebuilt almost from the ground up over the past few years, and we are now seeking to further expand our infrastructure team.As a valued member of our team, you will focus on minimizing incidents related to availability, performance, and security. You will accelerate the delivery of new features to customers by building flexible, highly available, and secure infrastructure, ensuring a smooth journey for every customer.

Apr 9, 2026
Apply
companyxAI logo
Full-time|On-site|London, UK

About xAIAt xAI, our mission is to develop advanced AI systems that can comprehend the universe and assist humanity in its quest for knowledge. Our dedicated team is small, highly motivated, and committed to engineering excellence, making it an ideal environment for individuals who thrive on challenges and curiosity. We foster a flat organizational structure where every employee plays a crucial role in driving our mission forward. We value initiative and excellence, rewarding those who consistently demonstrate strong work ethic and prioritization skills. Effective communication is essential, and all team members are expected to share their insights clearly and concisely.About the TeamYou will join a team responsible for the backend services that power our innovative products, including grok.com and our API. Our focus is on developing and maintaining highly scalable and reliable services capable of efficiently processing tens of thousands of queries per second, hosted across multiple Kubernetes clusters in both on-premises and cloud environments.About the RoleWe are looking for a candidate who meets the following criteria:In-depth expertise in Kubernetes.Proficiency with continuous deployment systems, including Buildkite and ArgoCD.Extensive experience with monitoring tools such as Prometheus, Grafana, and PagerDuty.Strong knowledge of infrastructure as code practices utilizing tools like Pulumi or Terraform.Familiarity with systems programming languages such as Rust, C++, or Go.Experience in traffic management and HTTP proxies, such as nginx and envoy.LocationThis position requires in-person attendance in London, UK. While we typically work from the office five days a week, we do provide flexibility for remote work when necessary. Candidates should be prepared to attend late meetings at least once a week to coordinate with our global teams.

Feb 4, 2026
Apply
companyWayve Technologies Ltd. logo
Full-time|On-site|London

Join Wayve Technologies as a Staff Cloud Site Reliability Engineer and play a pivotal role in shaping the future of autonomous driving technology. In this position, you will leverage your expertise to enhance the reliability, performance, and scalability of our cloud infrastructure. Collaborate with cross-functional teams to design robust systems that can handle high traffic and ensure seamless operation.

Mar 11, 2026
Apply
companyPayward Services, Inc. logo
Full-time|On-site|United Kingdom

Role overview Payward Services, Inc. seeks a Senior Site Reliability Engineer in the United Kingdom. This role centers on maintaining and improving the reliability and performance of the company’s core systems. The position plays a key part in supporting seamless user experiences by strengthening infrastructure and tackling technical challenges as they arise. What you will do Collaborate with teams across Payward Services to enhance service offerings Identify and resolve complex issues affecting reliability and performance Shape and refine infrastructure to support system resilience and long-term stability Location This position is based in the United Kingdom.

Apr 24, 2026
Apply
companyFreelancer.com logo
Full-time|On-site|London, England, United Kingdom

Join our dynamic Systems Engineering team as a pivotal and trusted DevOps Engineer / Site Reliability Engineer. Collaborating closely with software engineers, you will design and implement mission-critical services and systems. Your role will involve managing infrastructure and services at scale, employing a diverse array of cutting-edge technologies that support our high-traffic, real-time Freelancer.com marketplace as well as various other business products deployed on Amazon Web Services. Our technology stack includes Nginx, MySQL, Redis, ElasticSearch, RabbitMQ, Consul, Docker, and Kubernetes. We aim to build highly resilient, dynamically scaling, self-healing systems by automating and monitoring all processes using tools such as Terraform, Puppet, Prometheus, Grafana, Kibana, and Jenkins.

Dec 3, 2025

Sign in to browse more jobs

Create account — see all 6,730 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.