Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Senior
Qualifications
Ideal Candidate ProfileExtensive experience in Site Reliability Engineering (SRE), DevOps, Software Engineering, or Systems Engineering. Exceptional troubleshooting skills. Strong analytical capability in system design. Excellent communication abilities. Proficient knowledge of major cloud platforms, especially Google Cloud. Solid understanding of SQL. Experience with containers, Kubernetes, and tools like Kustomize and Helm. Familiarity with Service Mesh technologies, preferably Istio. Networking knowledge, including DNS, TLS, certificates, and ingress management.
About the job
About the Role
Join Hopper's dynamic Cloud FinOps team as a Senior Site Reliability Engineer. We oversee an extensive infrastructure within Google Cloud, empowering hundreds of engineers to deliver exceptional experiences to millions of users globally.
If you are enthusiastic about automation and optimizing systems for performance and reliability, we want to hear from you.
You will focus on building scalable, secure, and optimized infrastructure while solving practical problems with straightforward, cost-effective solutions.
Daily Responsibilities
Engage in projects that enhance cost efficiency, such as:
Minimizing network egress costs by eliminating unnecessary headers.
Optimizing data storage solutions based on usage patterns, such as implementing cold storage for infrequently accessed data.
Ensuring optimal autoscaling configurations for databases and compute resources.
Enhance current cost attribution processes to provide transparency for all teams regarding their expenditures.
Participate in incident support, including on-call rotation for platform incidents, collaborating with teams across the Americas and Europe to ensure continuous support.
Contribute to a small but highly efficient team of SREs.
About Hopper
Hopper is revolutionizing the travel industry by using advanced technology and data to create a seamless experience for millions of users. Join us in our mission to provide a first-class experience powered by a robust cloud infrastructure.
About the RoleHopper is seeking a skilled Senior Site Reliability Engineer to join our innovative Cloud FinOps team. Our team manages extensive infrastructure within Google Cloud that supports hundreds of engineers, delivering exceptional experiences to millions of users globally.Do you have a passion for automation and a commitment to optimizing systems? We…
About the RoleJoin Hopper's dynamic Cloud FinOps team as a Senior Site Reliability Engineer. We oversee an extensive infrastructure within Google Cloud, empowering hundreds of engineers to deliver exceptional experiences to millions of users globally.If you are enthusiastic about automation and optimizing systems for performance and reliability, we want to hear from you.You will focus on building scalable, secure, and optimized infrastructure while solving practical problems with straightforward, cost-effective solutions.Daily ResponsibilitiesEngage in projects that enhance cost efficiency, such as:Minimizing network egress costs by eliminating unnecessary headers.Optimizing data storage solutions based on usage patterns, such as implementing cold storage for infrequently accessed data.Ensuring optimal autoscaling configurations for databases and compute resources.Enhance current cost attribution processes to provide transparency for all teams regarding their expenditures.Participate in incident support, including on-call rotation for platform incidents, collaborating with teams across the Americas and Europe to ensure continuous support.Contribute to a small but highly efficient team of SREs.
About ClickHouseAs a leader in the cloud space, ClickHouse was recognized on the prestigious 2025 Forbes Cloud 100 list. We are a rapidly growing and innovative private cloud company that has garnered a strong customer base of over 3,000, achieving a remarkable 250% year-over-year growth in annual recurring revenue (ARR). Our expertise spans real-time analytics, data warehousing, observability, and AI workloads.Our recent success was underscored by a $400 million Series D funding round, with notable clients such as Capital One, Lovable, Decagon, Polymarket, and Airwallex adopting or expanding their use of our platform. These companies join well-known AI innovators and global brands including Meta, Cursor, Sony, and Tesla.Join us on our mission to revolutionize data utilization across industries!About the RoleWe are expanding our dedicated Site Reliability Engineering team to enhance the reliability and security of our services. As a Senior Site Reliability Engineer, you will play a crucial role in establishing processes that guarantee the reliability, availability, scalability, and performance of our cloud infrastructure. Your collaboration with teams such as Control Plane, Data Plane, Core, Security, Support, and Operations will be vital in designing and implementing robust distributed systems. You will lead incident management and response efforts, conduct blameless post-mortem analysis, and drive continuous improvement initiatives for our Cloud services. Utilize your software engineering skills to create platforms and tools that enhance operational and engineering efficiencies at ClickHouse Cloud. This is a unique opportunity to significantly influence our high-performance, scalable ClickHouse Cloud.What You'll DoCollaborate with engineering teams to design and implement scalable, secure, and highly available systems.Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.Ensure comprehensive monitoring and alerting for all infrastructure components to facilitate prompt incident detection and resolution.Refine incident response processes and conduct post-mortem analyses for outages, coordinating with support teams to communicate with affected customers.Continuously enhance the reliability and performance of ClickHouse services.Drive Chaos Engineering initiatives to proactively identify potential system weaknesses.
Full-time|$158K/yr - $227K/yr|Remote|Remote - United States; United States of America
ABOUT JUUL LABS: At Juul Labs, we are dedicated to revolutionizing the experience of adult smokers by transitioning them away from traditional combustible cigarettes. Our mission is to eliminate their use and prevent underage access to our products. We tackle this global health challenge with a focus on quality, innovation, and research. Supported by prominent technology investors, we aim for excellence not only in our products but also in our talent acquisition. We embrace diversity and are united by our mission. We are seeking the world's best engineers, scientists, designers, product managers, operations experts, and customer service professionals. If you are ready to advance your career with us, we encourage you to explore this opportunity. ROLE OVERVIEW: As a Senior Site Reliability Engineer (SRE), you will take ownership of the operational stability and performance of Juul's hybrid cloud infrastructure (Nutanix, AWS/GCP). Your responsibilities will include leading automation initiatives, ensuring reliability in architecture, and serving as the go-to expert for critical incident escalation to guarantee a scalable and efficient platform. Nutanix Platform Management Responsibilities: Design, deploy, and maintain enterprise-scale Nutanix AHV clusters and manage Prism Central for multi-cluster operations. Exhibit expert-level proficiency with Nutanix CLI (nCLI and acli) for advanced operations and automation. Create automation scripts using Nutanix REST APIs, Python SDK, PowerShell, and Terraform. Manage VM templates, golden images, and standardized deployment catalogs. Design disaster recovery solutions utilizing Leap, Protection Domains, and metro clustering. Implement network micro-segmentation with Nutanix Flow, including RBAC and encryption tactics. Lead Level 3 troubleshooting through advanced diagnostics and log analysis. Configure high availability and optimize performance for critical workloads. Oversee AHV networking with OVS bridges, VLANs, and implement resource reservations. Architect and maintain hybrid cloud solutions across Nutanix HCI, AWS, and GCP environments. Cloud Platform Engineering Responsibilities: Further responsibilities in cloud platform engineering will be communicated during the interview process to ensure alignment with your expertise.
Join AbbVie, a global leader in biopharmaceutical innovation, as a Senior Site Reliability Engineer. In this role, you will be instrumental in enhancing our cloud infrastructure, ensuring optimal performance and reliability of our applications. Collaborate with cross-functional teams to design, develop, and implement solutions that support our mission to improve lives.
Join circleso as a Senior Site Reliability Engineer and be at the forefront of ensuring the reliability, availability, and performance of our cloud-based services. You will work closely with development teams to design, implement, and maintain scalable systems while proactively identifying and resolving issues.
Summary The Wikimedia Foundation is on the lookout for a talented Senior Site Reliability Engineer to enhance and maintain the infrastructure that powers the world’s most beloved encyclopedia, Wikipedia, serving millions globally. Our Site Reliability Engineering (SRE) team is dedicated to ensuring that our globally recognized top-10 website operates smoothly while innovating to further our mission: to empower everyone to share in the sum of all knowledge. As a member of the SRE team, you will join a diverse and globally distributed group of engineers passionate about exploring, experimenting, and adopting new technologies. We believe in transparency, sharing our documentation, code, and configuration as open source. Our production systems are powered entirely by open-source software, and we encourage you to review our work without any login requirements. If you are intrigued by the challenge of improving the reliability and delivery of one of the Internet’s top websites and thrive in a remote-first environment, we invite you to consider joining us.
Responsibilities in Shipping & HandlingArchitect, scale, and secure infrastructure to meet evolving business demands, employing fault-tolerant designs, performance testing, profiling, and strategic capacity planning.Develop, implement, and sustain automation, monitoring, and alerting systems, alongside disaster recovery protocols.Promote scalability and maintainability through microservices architecture, decoupling concerns, effective data modeling, job queuing, and application layering.Enhance and oversee our CI/CD pipeline to ensure seamless and secure production deployments via automated testing and verification.Evaluate and confirm system performance and accuracy concerning response times and throughput.Engage in peer reviews and testing, contributing to automated testing suites and participating in design reviews for new features, products, and systems.Partake in an on-call rotation for system support.
Position: Senior Site Reliability Engineer Location: Seattle, WADuration: 12 monthsInterview: In-person for local candidates or via Phone + SkypeAs a Senior Site Reliability Engineer, you will play a pivotal role in the ongoing maintenance and administration of enterprise-level internet systems. Your primary responsibility will be to diagnose and resolve operational issues, ensuring the seamless functioning of our infrastructure. You will also be tasked with developing tools and scripts to enhance these processes.Collaboration with various teams will be essential to document our enterprise infrastructure and monitoring systems effectively. Additionally, you'll oversee the planning and execution of projects ranging from small to large scale within our Technology teams, reporting directly to your manager. This role demands a high level of technical expertise in both traditional enterprise systems and cutting-edge cloud-native applications.If you share our belief that a simple cup of coffee can transform lives and enhance experiences, we invite you to join us in delivering exceptional services to customers worldwide.
About UsAt Vultr, we are dedicated to revolutionizing cloud infrastructure by making it accessible, efficient, and cost-effective for enterprises and AI innovators globally. With 32 strategically located data centers worldwide, we proudly serve hundreds of thousands of customers across 185 countries, offering dynamic solutions such as Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage. As of December 2024, Vultr achieved a remarkable $3.5 billion valuation through equity financing, solidifying our position as the largest privately-held cloud infrastructure provider.Our Commitment to EmployeesComprehensive medical benefits with 100% company-paid premiums for employee-only plans, including dental and vision coverage.A robust 401(k) plan with 100% matching up to 4%, featuring immediate vesting.Annual professional development reimbursement of $2,500.Generous leave policy including 11 holidays, paid time off accrual, and rollover options.Increased PTO at 3-year and 10-year anniversaries, a one-month paid sabbatical every five years, and annual anniversary bonuses.$500 for remote office setup in the first year and $400 each subsequent year for new equipment.Internet reimbursement of up to $75 per month.Gym membership reimbursement up to $50 per month.Company-paid subscription to Wellable for wellness initiatives.Join Our TeamVultr is seeking a Senior Site Reliability Engineer in our Core Cloud Engineering team, reporting directly to the Director of Core Cloud Engineering. This position requires extensive knowledge in large-scale distributed systems, infrastructure automation, and hypervisor platform operations. The ideal candidate will excel in systems engineering with an emphasis on reliability, scalability, and observability to ensure our cloud services deliver optimal performance and resilience for our 1.5 million users.Key ResponsibilitiesProduction Control Plane Operations: Manage and scale Vultr’s control plane, ensuring consistent availability, accuracy, and performance across our global data centers.Hypervisor & Infrastructure Reliability: Develop, implement, and sustain automation processes for managing hypervisor fleets (KVM, QEMU, libvirt) and their supporting infrastructure.
As a Cloud Site Reliability Engineer, you will be responsible for deploying innovative solutions within the public cloud environment, specifically utilizing AWS services. You will create and manage configuration templates designed for scalable infrastructure, including AWS components like EFS, EC2, and RDS. Collaborating closely with the Scrum Master, you will ensure the project requirements are met within an agile development setting.Key Responsibilities:• Contribute to architectural design to enhance system consistency, security, maintainability, and flexibility.• Assist architects in creating highly scalable and automated deployments for diverse applications.• Develop configuration templates using established architectural blueprints.• Ensure the development of robust and scalable services across public cloud platforms, including AWS and GCP.• Monitor and assess system performance to ensure optimal operation.
AI is revolutionizing the operational landscape for businesses, yet many enterprises find themselves hindered in their efforts to effectively implement AI tools, agents, and workflows. At Runlayer, we are dedicated to dismantling these barriers.Our innovative team has developed AI Actions for OpenAI, delivered Zapier Agents to millions, and launched the first remote MCP server in partnership with Anthropic. With the co-creator of MCP on our cap table, we are establishing the essential platform that enterprises need to leverage AI securely and effectively.Runlayer serves as a unified platform for MCPs, Skills, and Agents. We provide purpose-built security, fine-grained governance, and complete observability, enabling organizations to advance their AI initiatives with confidence. With $11M raised from Khosla Ventures and Felicis, we proudly support clients such as Gusto, Instacart, and Opendoor.As a compact team of 25, primarily engineers, we thrive on rapid deployment and innovation. If you aspire to be at the forefront of AI implementation, now is the time to join us.In the role of Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and scalability of Runlayer's infrastructure as we expand to meet the needs of our enterprise customers across both cloud and on-prem environments.Why You'll Thrive HereImpact: Construct the foundational infrastructure for the enterprise MCP platform, directly facilitating large-scale AI adoption.Excellence: Collaborate closely with founders and a small, experienced engineering team, delivering swiftly in a high-growth setting.Ownership: Take full responsibility for reliability from database performance to incident response and CI/CD pipelines.What You'll DoOversee the reliability and performance of our cloud infrastructure across AWS (ECS, Aurora, CloudWatch) and GCP.Manage and optimize Kubernetes clusters and container orchestration.Lead database reliability engineering efforts, including performance tuning and scaling.Develop and maintain CI/CD pipelines for efficient and secure deployments.Conduct incident response and participate in on-call rotations.Collaborate with product engineers to design scalable and resilient systems.What We're Looking ForProven experience with AWS services including ECS, Aurora, and CloudWatch.Expertise in Kubernetes management and container orchestration.Strong background in database reliability engineering.Solid understanding of CI/CD methodologies and tools.Effective incident response skills and a proactive approach to system reliability.Ability to work collaboratively in a fast-paced environment with a focus on innovation.
Join Zilliz, a pioneering startup at the forefront of developing cutting-edge vector database solutions designed for enterprise-grade AI applications. Founded by the visionary engineers behind Milvus, the leading open-source vector database, we are on a mission to revolutionize data management for AI applications, making vector databases accessible to every organization. At Zilliz, you will play a crucial role in shaping the future of AI.
Unifonic operates as a remote-first company in the CPaaS sector, providing communication solutions to over 5,000 businesses. With a team of 500, Unifonic supports clients in building stronger customer connections. The Engineering team at Unifonic is responsible for designing, building, and maintaining the systems that power the company’s products. Team members collaborate closely with other departments to ensure technology aligns with customer needs. Creativity and new ideas are encouraged across the group. Role overview The Senior Site Reliability Engineer joins the Production Operations (Live) team. This role centers on ensuring the reliability, scalability, and resilience of Unifonic’s cloud infrastructure and distributed messaging platforms. The SRE team works to keep systems running smoothly at all times and continually seeks ways to improve performance and stability. What you will do Maintain the reliability, uptime, and scalability of key production services around the clock. Participate in the on-call rotation, respond to incidents, troubleshoot live production issues, and lead post-incident reviews. Create and update operational playbooks and escalation paths to help reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). Monitor service level objectives (SLOs), conduct chaos testing, plan for capacity, and address reliability risks as they arise.
Join our dynamic team at akuity as a Senior Site Reliability Engineer, where you'll play a pivotal role in enhancing the reliability and performance of our systems. In this exciting remote position, you will collaborate with cross-functional teams to implement innovative solutions that ensure seamless service delivery.Your expertise will be vital in monitoring system health, optimizing performance, and troubleshooting issues to provide exceptional user experiences. If you are passionate about building scalable and robust infrastructures, we want to hear from you!
Join Our Team:At HavocAI, we are at the forefront of collaborative autonomy, leading the way in the development of autonomous surface vessels for a variety of defense and commercial maritime operations. Our mission is to rapidly expand and innovate solutions that address complex human challenges, while prioritizing life-saving technologies. We are in search of passionate individuals committed to pushing boundaries and making a meaningful impact.Role OverviewWe are looking for a Senior Site Reliability Engineer (SRE) with a minimum of 7 years of experience in designing, operating, and scaling robust distributed systems. In this pivotal role, you will serve as a technical leader in our Cloud Platform team, ensuring the reliability, performance, and resilience of critical services that support autonomy, simulation, and data-heavy workloads.You will collaborate with various teams, including Cloud Platform, DevOps, Data Engineering, and Autonomy, to define reliability standards, enhance operational maturity, and create systems that effectively scale under real-world conditions. The ideal candidate will possess deep technical expertise, demonstrate composure under pressure, and be experienced in managing end-to-end reliability outcomes.
Company BackgroundCensys is dedicated to building the most comprehensive and reliable map of the Internet. Our mission is to empower users with real-time Internet intelligence and actionable threat insights, catering to global governments, over 50% of the Fortune 500, and leading threat intelligence providers worldwide.LocationThis is a fully remote position within the United States.Role SummaryAs a Senior Site Reliability Engineer (SRE) on the Infrastructure and Operations team, you will play a crucial role in designing, building, and deploying tools that enhance the efficiency of our development teams and production applications. We are seeking skilled engineers who are passionate about cloud-native technologies and committed to improving our microservice architecture's reliability and operational maturity.Focusing on Developer Efficiency and Experience, you will help streamline engineering workflows, support our Software Development Life Cycle (SDLC), and empower developers to confidently build, deploy, and manage their services within the platform.What You'll DoDevelop and maintain tools to support applications running on Kubernetes and Google Cloud Platform.Collaborate with development teams to facilitate the building, shipping, and deploying of services and applications, ensuring resilience and reliability.Monitor and ensure the smooth operation of our production environments, assisting developers in debugging complex issues and capturing the four golden signals of performance.Contribute to the creation of a self-service platform that accelerates developer velocity, including service catalogs, repository tooling, and comprehensive documentation.Participate in a shared on-call rotation, embracing end-to-end service ownership alongside development teams.
About Hashgraph:Hashgraph is an innovative and rapidly growing software company dedicated to supporting, developing, and maintaining Hedera, an open-source proof-of-stake platform. Hedera is EVM-compatible and designed to cater to the demands of enterprise and web3 applications, focusing on speed, security, stability, and sustainability. The public network of Hedera is governed by leading organizations across 11 sectors and 14 regions, ensuring robust oversight of the decentralized platform's development and direction.About the RoleWe are seeking a Senior Site Reliability Engineer to join the HashSphere engineering team. In this pivotal role, you will assist in designing, building, and integrating essential product features for enterprises utilizing Hiero, our private distributed ledger technology. This greenfield project is at the forefront of decentralized systems and cloud technologies, with a strong emphasis on security, privacy, and scalability.Your expertise in distributed systems engineering, coupled with your software development skills and knowledge of industry-standard SRE and DevOps practices, will be crucial in delivering core platform services. You will contribute to a highly scalable, mission-critical infrastructure product utilized by some of the largest organizations in finance, supply chain, and healthcare sectors.If you possess experience in designing scalable, reliable, and secure distributed system architectures within AWS, GCP, or Azure, and are eager to collaborate with a passionate team to build pioneering technology, this could be the perfect opportunity for you.
Join Crexi as a Senior Site Reliability Engineer, where you will play a crucial role in maintaining and enhancing our infrastructure. You will be responsible for ensuring our systems are reliable, scalable, and secure. Collaborate with cross-functional teams to implement best practices in site reliability engineering, contribute to incident response, and drive automation initiatives. If you are passionate about optimizing system performance and enhancing user experience, we want to hear from you!
Full-time|$133.1K/yr - $148K/yr|Remote|New York City, NY
Site Reliability Engineer Overview: Join Weedmaps as a Site Reliability Engineer and collaborate across departments, including application, infrastructure, and quality teams, to elevate the performance, reliability, resilience, and scalability of our web services at Weedmaps.com. As a cloud-native organization, we run 100% of our services in Docker on Kubernetes within AWS's public cloud. Our operations utilize observability, monitoring, CI/CD automation, and custom tooling, enabling us to deploy multiple production releases daily. Your daily responsibilities will focus on applying your engineering expertise to enhance system monitoring, minimize developer toil, configure CI workflows, and optimize our deployment pipelines. You will serve as a knowledge reference for development teams, ensuring they utilize consistent tools for metrics, logging, building, and deployment. Collaborating closely with both development and infrastructure teams, you will identify critical service-specific metrics that require monitoring, and you will help application development teams create libraries for seamless service instrumentation. The impact you'll make: Collaborate with stakeholders to establish and promote best practices for monitoring and CI/CD pipelines. Troubleshoot issues related to deployment within our CI pipeline. Actively promote the DevOps culture at Weedmaps. Identify opportunities for automation and advocate for the codification of processes. Promote best practices regarding collaboration, reliability, security, and performance across all partner teams. Take ownership of application configuration and scaling for specified services, ensuring adherence to organizational practices. Develop and optimize synthetic monitoring flows. What you've accomplished: A minimum of 2 years of development experience in startup or mid-sized environments. Proficiency in programming languages such as Python, Go, Node, Ruby, or Elixir. Knowledge of containerization technologies, particularly Docker (Kubernetes experience is a plus). Strong communication skills, a positive demeanor, and the ability to provide and receive constructive feedback. Professional experience with cloud-native observability standards including OpenMetrics, OpenTracing, and OpenCensus. Expertise in using and configuring modern CI/CD workflows. Deep understanding of SLIs, SLOs, and SLAs at both service and business levels. Familiarity with golden signals and their significance in monitoring.