Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Senior
Qualifications
7+ years of experience in Site Reliability Engineering or infrastructure roles, focusing on the enhancement of production systems at scale. Extensive expertise in MySQL, including schema design, performance tuning, and operational tooling. Proficient in cloud-native technologies (GCP experience is beneficial) and Terraform. Strong programming skills in Go and Bash for scripting and systems tasks. Expertise in observability, incident management, and debugging distributed systems. A proactive approach with a focus on action and results.
At Customer.io, we empower over 8,000 businesses, from agile startups to established global brands, to engage their audiences through billions of daily communications including emails, push notifications, in-app messages, and SMS. Our platform is designed to facilitate automated messaging that resonates with recipients.
We enhance team communications by leveraging real-time behavioral insights. Our technology stack includes Go, React, Ember, and AI, enabling us to deploy swiftly while ensuring scalability and reliability.
We are seeking a Senior Site Reliability Engineer to join our team in scaling our infrastructure, minimizing operational overhead, and enhancing system reliability as we expand. If you have experience with high-scale systems and a passion for improving platforms for both developers and customers, we invite you to apply.
About Customer.io
Customer.io is a leading platform in automated messaging, trusted by thousands of companies worldwide. Our mission is to help businesses build meaningful relationships with their customers through intelligent communication strategies powered by data.
About UsAt xLabs, we specialize in providing essential infrastructure and contributing to open-source technologies that empower the innovators of the Internet of Value. Our primary focus is on developing engineering tools and delivering resilient infrastructure that facilitates the seamless creation of decentralized applications. As key contributors to Wormh…
Blockchain.com is at the forefront of revolutionizing finance, providing millions globally with secure access to cryptocurrency. Established in 2011, we have gained the trust of over 90 million wallet holders and more than 40 million verified users, facilitating over $1 trillion in crypto transactions.Blockchain is the world's premier software platform for digital assets. We operate the largest production blockchain platform globally, driven by our passion for coding and building an open, accessible, and equitable financial future, one innovative software solution at a time.We are seeking a Site Reliability Engineer to join our Core team. This role involves advocating for infrastructure best practices across our organization, enabling us to securely scale a distributed financial platform that serves millions daily.Our distributed financial platform addresses some of the most fascinating challenges in the crypto space for our vast customer base and is experiencing rapid growth. The Site Reliability Engineering (SRE) team at Blockchain merges software and systems engineering to create a platform that simplifies complexity, enhancing security, reliability, and swift product delivery.The SRE organization at Blockchain is a dynamic environment focused on continual improvement. We foster a culture where team members can propose, discuss, design, and implement changes with a high degree of autonomy. We value abstract thinking to develop exceptionally effective tools and strive to eliminate toil.As a member of the Core team, you will gain a comprehensive understanding of our products' infrastructure needs. Your role will include establishing and maintaining innovative engineering solutions to enhance our customers' experience through the development of essential tools. Importantly, you will also mentor and guide developer teams to deliver new features in a rapid, secure, and scalable manner.
P2P.org is a leading institutional staking provider with a total value locked (TVL) of over $10 billion and more than 20% market share in the restaking sector. The team focuses on research and infrastructure development to maximize annual percentage returns (APR) and strengthen security. For example, average net return rates (NRR) at P2P.org exceed the market by 10% for ETH and SOL, and by 20% for DOT. P2P.org is launching new blockchain networks including TON, Avail, Monad, Babylon, Story, and Berachain, alongside yield products and aggregators for stablecoins. The company is the largest operator in restaking by market share. Clients such as BitGo, Copper, Crypto.com, Ledger, ByBit, Bitget, OKX, HTX, Bitvavo, and SBI trust P2P.org for its customer-focused approach and broad product suite, which includes unified APIs, customizable dApps, and widgets. The product portfolio is expanding to include real-world assets (RWA), data solutions, new yield opportunities, and services for banks, exchanges, custodians, and wallets. P2P.org brings together professionals from around the world. The remote team shares a commitment to decentralized finance and works to build a fairer financial system. Team members code, learn, create, and collaborate to shape the future of finance. The company has built a strong reputation and network, keeping customer satisfaction at the center while developing new technologies to advance its brand. Role Overview: Senior Site Reliability Engineer (SRE 3) P2P.org is hiring a Senior Site Reliability Engineer to join the Launch Team. This group is responsible for bringing new blockchain networks into production, from initial design through deployment, stability, observability, and production readiness. The SRE will work at the intersection of infrastructure, protocol engineering, and operations. The goal is to help P2P.org scale to support multiple networks with reliability. What You Will Do Design, build, and operate infrastructure for new blockchain networks Collaborate with protocol, infrastructure, and security teams Oversee network launches from start to finish Ensure solutions are reliable, repeatable, and meet platform standards Location This is a remote role based in the EU.
Join PostHog as a Site Reliability Engineer specializing in Infrastructure. In this role, you will be pivotal in ensuring the reliability, performance, and scalability of our systems. You'll collaborate with cross-functional teams to design and implement robust infrastructure solutions that support our growing user base.
Become a Force for Good with Axon.At Axon, our mission is to Protect Life. We are innovators tackling society’s most pressing safety and justice challenges through our integrated ecosystem of devices and cloud software. Like our products, we thrive on collaboration, connecting with transparency and empathy, and embracing diverse perspectives from our customers, communities, and each other.Working at Axon is fast-paced, challenging, and purposeful. Here, you will take the initiative and make a tangible impact. Constantly develop your skills as you dedicate yourself to a mission that matters within a company that values your contributions.Your ContributionJoin us in revolutionizing infrastructure automation for critical law enforcement systems. As a Senior Site Reliability Engineer, you will lead the creation of a cutting-edge infrastructure provisioning and automation platform. This platform allows engineering teams to independently access cloud infrastructure, ensuring safety and efficiency while minimizing manual interventions and operational risks.Your role will involve hands-on contributions to build and enhance systems leveraging automation and intelligent agents to generate, validate, test, and manage infrastructure at scale. We seek an engineer with a strong software development background, proficiency in programming languages such as Go or Python, and extensive experience in designing and operating cloud platforms, with a drive to enhance developer productivity, reliability, and platform robustness.Work Location:This position is based in our London office and follows a hybrid work schedule. We emphasize in-person collaboration, requiring team members to be onsite from Tuesday to Friday, with the option to work remotely on Mondays, unless a workplace accommodation has been approved. We believe that connection fuels innovation, and our in-office culture is designed to promote meaningful teamwork, mentorship, and collective success.Key ResponsibilitiesDevelop robust, user-friendly foundational platforms and tools that enable engineering teams to provision infrastructure quickly, consistently, and securely across diverse cloud providers.Write efficient, maintainable, and clear code in Go.Promote and uphold Infrastructure as Code (IaC) best practices and coding standards.Utilize strong problem-solving skills to troubleshoot issues in cloud-native distributed systems.Influence and educate the engineering organization on adopting new and improved architectural patterns.Provide comprehensive documentation to facilitate self-service by engineers.
Join our dynamic team as a Senior Site Reliability Engineer at Bumble Inc., where your expertise in Linux and system-level operations will be pivotal in managing complex production environments. We seek a proactive engineer capable of independently troubleshooting incidents, leading post-incident recovery efforts, and implementing enhancements to boost overall system stability, performance, and observability. This role is ideal for hands-on SREs with a solid foundation in Linux infrastructure and third-party system operations, focusing on optimizing large-scale environments of over 5,000 hosts utilizing technologies such as Kafka, Redis, and Kubernetes. Please note, this position centers on operational excellence rather than application development, requiring deep technical acumen and advanced troubleshooting capabilities.
About Customer.ioAt Customer.io, we empower over 8,000 businesses, from agile startups to established global brands, to engage their audiences through billions of daily communications including emails, push notifications, in-app messages, and SMS. Our platform is designed to facilitate automated messaging that resonates with recipients.We enhance team communications by leveraging real-time behavioral insights. Our technology stack includes Go, React, Ember, and AI, enabling us to deploy swiftly while ensuring scalability and reliability.We are seeking a Senior Site Reliability Engineer to join our team in scaling our infrastructure, minimizing operational overhead, and enhancing system reliability as we expand. If you have experience with high-scale systems and a passion for improving platforms for both developers and customers, we invite you to apply.
Location: London, Waterloo (Hybrid, 4 days in-office - Wednesday is our designated work from home day, though you are welcome to join us in the office on Wednesdays if you prefer)At getground, we are revolutionizing one of the world's most significant asset classes: property. With over £2 billion in assets on our platform and a community of more than 30,000 users across 70 countries, we are shaping the future of asset ownership and tackling wealth inequality.Our innovative product streamlines property investing from start to finish, making real estate investment accessible to everyone.Your Key Responsibilities:Collaborating within cross-functional product teams to transition infrastructure and reliability initiatives from concept to live deployment.Thriving in a dynamic environment where autonomy and ownership are fundamental to our operations.Developing and sustaining a robust, scalable infrastructure within our GCP cloud ecosystem. Utilizing Kubernetes, Terraform, Cloudflare, and cutting-edge observability tools to ensure seamless platform functionality.Working closely with engineering teams to formulate CI/CD pipelines, enhance deployment methodologies, and advocate for reliability as a core engineering principle.Contributing to the establishment of SRE practices for a rapidly growing fintech platform. Mentoring fellow engineers as we expand our teams and influence.Your Day-to-Day Activities:Designing, implementing, and maintaining cloud infrastructure on Google Cloud Platform (GCP), ensuring it meets scalability, reliability, and security standards.Taking ownership of our Kubernetes clusters and containerization strategy, including Docker image optimization, cluster management, and deployment orchestration.Creating and optimizing Infrastructure as Code using Terraform, producing modular, testable, and well-documented configurations that adapt to our rapid growth.Managing and enhancing our Cloudflare infrastructure, including Workers for edge computing, DNS, CDN, security policies, and performance optimization.Implementing AI-powered product features in isolated and secure serverless environments.Establishing comprehensive monitoring and observability with Prometheus and Grafana, defining SLIs/SLOs, and proactively identifying potential issues before they affect users.Designing and maintaining CI/CD pipelines with appropriate quality gates, testing strategies, and deployment methodologies (blue-green, canary) to facilitate rapid deployments.
Join Wayve Technologies as a Staff Cloud Site Reliability Engineer and play a pivotal role in shaping the future of autonomous driving technology. In this position, you will leverage your expertise to enhance the reliability, performance, and scalability of our cloud infrastructure. Collaborate with cross-functional teams to design robust systems that can handle high traffic and ensure seamless operation.
Role overview Payward Services, Inc. seeks a Senior Site Reliability Engineer in the United Kingdom. This role centers on maintaining and improving the reliability and performance of the company’s core systems. The position plays a key part in supporting seamless user experiences by strengthening infrastructure and tackling technical challenges as they arise. What you will do Collaborate with teams across Payward Services to enhance service offerings Identify and resolve complex issues affecting reliability and performance Shape and refine infrastructure to support system resilience and long-term stability Location This position is based in the United Kingdom.
Join our dynamic team at Blockchain as an Infrastructure Security Engineer. In this pivotal role, you will be responsible for designing, implementing, and maintaining robust security architectures to safeguard our cutting-edge blockchain technologies. Your expertise will be essential in identifying vulnerabilities, managing security incidents, and ensuring compliance with industry standards.We are looking for a passionate individual who thrives in a fast-paced environment and is eager to contribute to the future of decentralized technologies.
About Axon Axon’s mission is to safeguard life. The company develops devices and cloud-based software focused on public safety and justice. Teams at Axon work together to address complex challenges, valuing transparency, empathy, and a range of perspectives from users, communities, and colleagues. Role Overview: Senior Site Reliability Engineer I This position sits within the Site Reliability Engineering (SRE) team. The main focus: tackle real-time challenges across Axon’s mission-critical, cloud-native services. The work centers on maintaining the reliability and quality customers expect. Collaboration is key, both within the SRE group and across the wider engineering organization, to help product teams deliver new features consistently. Work Location and Flexibility This role is based in London, England, United Kingdom. Axon uses a hybrid working model. Team members are expected onsite from Tuesday to Friday, with remote work on Mondays (unless a workplace accommodation is granted). The company emphasizes in-person collaboration to support teamwork, mentorship, and shared success.
Role Overview Sectigo is hiring a Site Reliability Engineer in Manchester. This role focuses on maintaining and improving the reliability, availability, and performance of Sectigo's systems. The position sits within the development team and involves close collaboration to strengthen infrastructure and support scalable applications. What You Will Do Work with development and operations teams to ensure systems remain reliable and available Enhance infrastructure to support growing and scalable applications Contribute technical expertise to ongoing projects and operational improvements What Sectigo Looks For Technical background in site reliability or related fields Experience supporting scalable systems Commitment to operational excellence Strong teamwork and communication skills Location This position is based in Manchester.
Role overview Nice Ltd. is seeking a Site Reliability Engineer to join the team remotely from anywhere in the United Kingdom. The primary focus of this position is to enhance the reliability, scalability, and performance of the company’s services. Collaboration is central to the role, as you will work with colleagues across various teams to design, build, and maintain systems that deliver smooth user experiences. What you will do Work with cross-functional teams to develop and maintain dependable systems Automate operational tasks to streamline processes and minimize manual intervention Identify and resolve technical issues that affect infrastructure and services Use both software engineering and systems administration skills to drive ongoing improvements Impact The work in this role directly supports Nice Ltd.’s commitment to delivering reliable, high-quality services to customers each day.
Role Overview accesso is hiring a Site Reliability Engineer in the United Kingdom. This role focuses on keeping accesso’s solutions reliable, available, and performing well for clients. The Site Reliability Engineer works closely with teams across the company to design, build, and maintain systems that support a wide range of users. What You Will Do Collaborate with cross-functional teams to create and maintain reliable systems Monitor system health and performance proactively Troubleshoot technical issues as they arise Develop and implement automation to improve operational efficiency
At Orgvue, we are at the forefront of organizational design and planning software, harnessing the transformative power of data visualization and modeling to help organizations become more adaptable and high-performing. Our platform empowers HR, finance, and business leaders to make swift, informed workforce decisions in an ever-evolving landscape.Trusted by some of the world's largest enterprises and renowned management consulting firms, Orgvue enables organizations to visualize and proactively shape their futures. Headquartered in London, we also have offices in Philadelphia, The Hague, Toronto, and Sydney.We are currently on the lookout for a Principal Site Reliability Engineer to join our team as a senior technical leader specializing in scaling and fortifying our AWS and Kubernetes-based infrastructure.Role OverviewIn this pivotal role, you will collaborate with product, platform, and operations teams to ensure our systems are reliable, observable, and resilient, even at scale. This position marries hands-on technical proficiency with strategic foresight, enabling us to cultivate a world-class reliability culture and a strong engineering framework for growth. We seek an individual with robust technical skills, exceptional communication abilities, and a passion for cross-team collaboration.Key ResponsibilitiesEstablish and uphold SLOs, SLIs, and error budgets across vital servicesDesign and execute a comprehensive cloud infrastructure and tooling strategyElevate SRE practices organization-wideImplement effective observability metrics, logs, and traces using our observability toolsLead the team in creating automated, self-healing systemsManage and refine our incident response protocols, including on-call practices and a post-mortem cultureMentor engineers throughout the organization on reliability best practices, operational readiness, and scalable infrastructureDrive Infrastructure as Code (IaC) initiatives using Terraform, Kubernetes, CloudFormation, and GitOps methodologiesWork closely with security, DevOps, and software teams to guarantee compliance, scalability, and operational excellenceAssess and introduce tools, patterns, and practices that enhance the performance and reliability of our SaaS platformQualificationsProven experience leading SRE transformationsExtensive hands-on expertise with Kubernetes (EKS preferred) in production settingsStrong proficiency with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)Expertise in Infrastructure as Code utilizing tools such as Terraform, with familiarity in GitOps workflowsSolid background in observability: metrics, visualization, logging, and tracingUnderst...
About Neo4j Neo4j builds a graph intelligence platform used by 84 of the Fortune 100 and supported by the world’s largest graph community. The platform powers knowledge graphs for AI, delivers reliable graph capabilities across cloud environments, and integrates with a wide range of systems. Neo4j’s technology is designed for precision, accountability, and governance, helping organizations turn data into actionable insights for intelligent applications and AI systems. Engineered for seamless operation in any cloud, Neo4j supports dynamic, personalized, and autonomous AI solutions. The focus is on delivering swift results, contextual knowledge, and solutions that improve both customer and employee experiences. Our Vision Neo4j’s mission is to help the world understand data. As business and society become more interconnected, Neo4j’s technology enables organizations to find and understand relationships within their data. The company pioneered the graph database category and continues to lead in helping teams innovate and stay competitive. About the Site Reliability Engineering Team The Site Reliability Engineering (SRE) team supports Neo4j’s Database as a Service (DBaaS) product, Neo4j Aura. Aura operates globally across all major cloud providers, running hundreds of Kubernetes clusters and managing thousands of Neo4j instances in production. This team is redefining SRE within Neo4j Aura. Rather than simply reacting to incidents, the SRE group empowers teams to design for reliability from the start. The work centers on building tools, practices, and a culture that embed SRE principles into the foundation of Aura’s operations. Collaboration with product teams and a commitment to resilience and engineering excellence are central to the team’s approach. What You Will Do Automate for insight and scale: Build systems that enable fast, safe, and scalable troubleshooting across thousands of Neo4j instances. This includes developing internal tools that provide actionable insights. Location London
Role overview JobGether seeks an Engineering Manager to lead its Site Reliability Engineering team in the UK. This position carries responsibility for the reliability and performance of core systems, ensuring consistent service delivery and high system uptime across the company. What you will do Lead and support a team of site reliability engineers Mentor engineers, helping them develop professionally Work with teams across the organization to strengthen system reliability Proactively identify and address reliability and performance concerns before users are affected Establish and maintain best practices in reliability engineering Encourage ongoing improvement within the team Location This role is based in the UK.
About AiraloAlo! Airalo is the world’s pioneering eSIM store, dedicated to empowering individuals with seamless connectivity in over 200 countries and regions worldwide. We are on a mission to transform the telecommunications landscape through innovative digital services. As a travel-tech enterprise, we pride ourselves on fostering a diverse, inclusive, and equitable workplace, with a team that spans across 50+ countries and six continents. Our shared vision is to redefine the way people connect globally.For detailed insights about our company culture and values, please explore our Public Handbook: https://airalo-public.notion.site/airalo-public-handbookAbout YouWe are in search of individuals who are passionate about delivering high-quality work and value the impact of their contributions to the team's success. You are self-motivated and thrive without the need for micromanagement. Every day, you strive to grow as an individual while nurturing a collaborative team atmosphere. Authenticity, honesty, positivity, and kindness are principles you uphold. Your communication is clear and concise, and you excel at managing multiple projects with a strong analytical mindset and meticulous attention to detail. You embrace diversity and are respectful of different cultural backgrounds.About the RolePosition: Full-time / EmployeeLocation: Remote-firstBenefits: Comprehensive Health Insurance, work-from-anywhere stipend, annual wellness and learning credits, an all-expenses-paid annual company retreat in stunning locations, and additional perks.On-Call Responsibilities:As a crucial aspect of this role, you will participate in our on-call rotation. This ensures our global operations maintain 24/7 service reliability, enabling uninterrupted service for our customers across all time zones.- Paid Rotation: Standby fees and overtime pay offered.- Delayed Start: No on-call duties for the first six months.- Rest & Recovery: Guaranteed downtime and flexible hours following night incidents.- Shared Workload: Rotations split between weekdays and weekends to alleviate fatigue.For comprehensive details, refer to the On-Call Policy in the Airalo Handbook: https://airalo-public.notion.site/our-approach-to-engineering-on-call-policyWe are excited to welcome a Senior Site Reliability Engineer to our dynamic engineering team.
About UsAt Heidi Health, we believe that healthcare deserves a more harmonious approach—one that ensures care remains continuous and deeply personalized. Our innovative AI Care Partner collaborates with healthcare providers to enhance the care experience for patients and clinicians alike.Our diverse team includes doctors, engineers, designers, researchers, and creatives, all dedicated to creating tools that empower clinicians to focus on what matters most: their patients.In just 18 months, we've reclaimed over 18 million hours for healthcare professionals, facilitating 73 million patient visits across 116 countries. Currently, our technology supports more than two million patient visits weekly worldwide.With nearly $100 million in funding, we are expanding our presence in the US, UK, Canada, and Europe, partnering with prestigious health systems such as the NHS, Beth Israel Lahey Health, and Monash Health.The OpportunityJoin our core Platform/SRE team, where you will take charge of production reliability. This role involves active incident response, on-call duties, system reliability, and daily operational oversight of Heidi’s platform.We welcome applications from mid-level SREs eager to embrace greater responsibility, as well as senior SREs who relish hands-on operational roles. This position emphasizes operational involvement and aims to maintain the health of real systems in production.Your ResponsibilitiesEngage in on-call and incident response: Address production incidents, assist in service restoration, and facilitate clear communication during incidents, escalating to leading incidents over time.Enhance operational reliability: Identify recurring issues and reliability risks, driving improvements through better alerting, automation, system enhancements, and process refinements.Manage production environment components: Operate and enhance Kubernetes clusters, cloud infrastructure, and core platform services, increasing responsibility as expertise grows.Boost observability: Refine dashboards, alerts, logs, and traces to enable earlier detection and faster diagnosis of issues, concentrating on actionable insights.Minimize operational toil: Automate repetitive tasks, streamline runbooks, and enhance tooling to facilitate smoother and safer on-call and daily operations.