Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Entry Level
Qualifications
Key ResponsibilitiesDesign and manage the infrastructure for our SaaS offerings on AWS through Terraform. Perform IT-related tasks including user onboarding and account management. Contribute to software projects that enhance our products, focusing on areas like authentication, reliability, and observability. Required QualificationsSome prior experience in Site Reliability Engineering or cloud operations, preferably with AWS. A minimum of 1 year in software development or scripting. A willingness to work across both cloud infrastructure and IT as needed. Strong attention to detail with a passion for building high-quality systems.
About the job
At HomeVision, we're redefining the landscape of real estate valuation, striving to foster a more efficient, transparent, and equitable housing market. Utilizing cutting-edge technologies such as Natural Language Processing (NLP), computer vision, and large language models (LLMs), we enhance the appraisal process, empowering appraisers to work with greater efficiency. Supported by Initialized Capital, we are rapidly expanding and in search of a Site Reliability Engineer (SRE) to join our team and assist in scaling our innovative solutions.
This role presents an exciting opportunity to embark on an engineering career, providing exposure to a diverse range of skills. You'll collaborate with a dedicated team responsible for overseeing all facets of our technology infrastructure, including AWS, IT, and AI. Your primary focus will be on maintaining the security and scalability of our platform, while also engaging in larger projects that support product enhancements and AI integration.
About HomeVision
HomeVision is at the forefront of transforming the real estate industry by modernizing the valuation process. Our mission is to create a housing market that is fairer and more efficient, leveraging advanced technologies to achieve our goals.
At HomeVision, we're redefining the landscape of real estate valuation, striving to foster a more efficient, transparent, and equitable housing market. Utilizing cutting-edge technologies such as Natural Language Processing (NLP), computer vision, and large language models (LLMs), we enhance the appraisal process, empowering appraisers to work with greater e…
Summary The Wikimedia Foundation is on the lookout for a talented Senior Site Reliability Engineer to enhance and maintain the infrastructure that powers the world’s most beloved encyclopedia, Wikipedia, serving millions globally. Our Site Reliability Engineering (SRE) team is dedicated to ensuring that our globally recognized top-10 website operates smoothly while innovating to further our mission: to empower everyone to share in the sum of all knowledge. As a member of the SRE team, you will join a diverse and globally distributed group of engineers passionate about exploring, experimenting, and adopting new technologies. We believe in transparency, sharing our documentation, code, and configuration as open source. Our production systems are powered entirely by open-source software, and we encourage you to review our work without any login requirements. If you are intrigued by the challenge of improving the reliability and delivery of one of the Internet’s top websites and thrive in a remote-first environment, we invite you to consider joining us.
Responsibilities in Shipping & HandlingArchitect, scale, and secure infrastructure to meet evolving business demands, employing fault-tolerant designs, performance testing, profiling, and strategic capacity planning.Develop, implement, and sustain automation, monitoring, and alerting systems, alongside disaster recovery protocols.Promote scalability and maintainability through microservices architecture, decoupling concerns, effective data modeling, job queuing, and application layering.Enhance and oversee our CI/CD pipeline to ensure seamless and secure production deployments via automated testing and verification.Evaluate and confirm system performance and accuracy concerning response times and throughput.Engage in peer reviews and testing, contributing to automated testing suites and participating in design reviews for new features, products, and systems.Partake in an on-call rotation for system support.
AI is revolutionizing the operational landscape for businesses, yet many enterprises find themselves hindered in their efforts to effectively implement AI tools, agents, and workflows. At Runlayer, we are dedicated to dismantling these barriers.Our innovative team has developed AI Actions for OpenAI, delivered Zapier Agents to millions, and launched the first remote MCP server in partnership with Anthropic. With the co-creator of MCP on our cap table, we are establishing the essential platform that enterprises need to leverage AI securely and effectively.Runlayer serves as a unified platform for MCPs, Skills, and Agents. We provide purpose-built security, fine-grained governance, and complete observability, enabling organizations to advance their AI initiatives with confidence. With $11M raised from Khosla Ventures and Felicis, we proudly support clients such as Gusto, Instacart, and Opendoor.As a compact team of 25, primarily engineers, we thrive on rapid deployment and innovation. If you aspire to be at the forefront of AI implementation, now is the time to join us.In the role of Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and scalability of Runlayer's infrastructure as we expand to meet the needs of our enterprise customers across both cloud and on-prem environments.Why You'll Thrive HereImpact: Construct the foundational infrastructure for the enterprise MCP platform, directly facilitating large-scale AI adoption.Excellence: Collaborate closely with founders and a small, experienced engineering team, delivering swiftly in a high-growth setting.Ownership: Take full responsibility for reliability from database performance to incident response and CI/CD pipelines.What You'll DoOversee the reliability and performance of our cloud infrastructure across AWS (ECS, Aurora, CloudWatch) and GCP.Manage and optimize Kubernetes clusters and container orchestration.Lead database reliability engineering efforts, including performance tuning and scaling.Develop and maintain CI/CD pipelines for efficient and secure deployments.Conduct incident response and participate in on-call rotations.Collaborate with product engineers to design scalable and resilient systems.What We're Looking ForProven experience with AWS services including ECS, Aurora, and CloudWatch.Expertise in Kubernetes management and container orchestration.Strong background in database reliability engineering.Solid understanding of CI/CD methodologies and tools.Effective incident response skills and a proactive approach to system reliability.Ability to work collaboratively in a fast-paced environment with a focus on innovation.
Join our dynamic team at akuity as a Senior Site Reliability Engineer, where you'll play a pivotal role in enhancing the reliability and performance of our systems. In this exciting remote position, you will collaborate with cross-functional teams to implement innovative solutions that ensure seamless service delivery.Your expertise will be vital in monitoring system health, optimizing performance, and troubleshooting issues to provide exceptional user experiences. If you are passionate about building scalable and robust infrastructures, we want to hear from you!
Unifonic operates as a remote-first company in the CPaaS sector, providing communication solutions to over 5,000 businesses. With a team of 500, Unifonic supports clients in building stronger customer connections. The Engineering team at Unifonic is responsible for designing, building, and maintaining the systems that power the company’s products. Team members collaborate closely with other departments to ensure technology aligns with customer needs. Creativity and new ideas are encouraged across the group. Role overview The Senior Site Reliability Engineer joins the Production Operations (Live) team. This role centers on ensuring the reliability, scalability, and resilience of Unifonic’s cloud infrastructure and distributed messaging platforms. The SRE team works to keep systems running smoothly at all times and continually seeks ways to improve performance and stability. What you will do Maintain the reliability, uptime, and scalability of key production services around the clock. Participate in the on-call rotation, respond to incidents, troubleshoot live production issues, and lead post-incident reviews. Create and update operational playbooks and escalation paths to help reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). Monitor service level objectives (SLOs), conduct chaos testing, plan for capacity, and address reliability risks as they arise.
As a Principal Site Reliability Engineer at Arcadia, you will play a pivotal role in ensuring the reliability, scalability, and performance of our systems. You will lead initiatives to design and implement robust solutions while collaborating with cross-functional teams to drive operational excellence.
About Hashgraph:Hashgraph is an innovative and rapidly growing software company dedicated to supporting, developing, and maintaining Hedera, an open-source proof-of-stake platform. Hedera is EVM-compatible and designed to cater to the demands of enterprise and web3 applications, focusing on speed, security, stability, and sustainability. The public network of Hedera is governed by leading organizations across 11 sectors and 14 regions, ensuring robust oversight of the decentralized platform's development and direction.About the RoleWe are seeking a Senior Site Reliability Engineer to join the HashSphere engineering team. In this pivotal role, you will assist in designing, building, and integrating essential product features for enterprises utilizing Hiero, our private distributed ledger technology. This greenfield project is at the forefront of decentralized systems and cloud technologies, with a strong emphasis on security, privacy, and scalability.Your expertise in distributed systems engineering, coupled with your software development skills and knowledge of industry-standard SRE and DevOps practices, will be crucial in delivering core platform services. You will contribute to a highly scalable, mission-critical infrastructure product utilized by some of the largest organizations in finance, supply chain, and healthcare sectors.If you possess experience in designing scalable, reliable, and secure distributed system architectures within AWS, GCP, or Azure, and are eager to collaborate with a passionate team to build pioneering technology, this could be the perfect opportunity for you.
Full-time|$158K/yr - $227K/yr|Remote|Remote - United States; United States of America
ABOUT JUUL LABS: At Juul Labs, we are dedicated to revolutionizing the experience of adult smokers by transitioning them away from traditional combustible cigarettes. Our mission is to eliminate their use and prevent underage access to our products. We tackle this global health challenge with a focus on quality, innovation, and research. Supported by prominent technology investors, we aim for excellence not only in our products but also in our talent acquisition. We embrace diversity and are united by our mission. We are seeking the world's best engineers, scientists, designers, product managers, operations experts, and customer service professionals. If you are ready to advance your career with us, we encourage you to explore this opportunity. ROLE OVERVIEW: As a Senior Site Reliability Engineer (SRE), you will take ownership of the operational stability and performance of Juul's hybrid cloud infrastructure (Nutanix, AWS/GCP). Your responsibilities will include leading automation initiatives, ensuring reliability in architecture, and serving as the go-to expert for critical incident escalation to guarantee a scalable and efficient platform. Nutanix Platform Management Responsibilities: Design, deploy, and maintain enterprise-scale Nutanix AHV clusters and manage Prism Central for multi-cluster operations. Exhibit expert-level proficiency with Nutanix CLI (nCLI and acli) for advanced operations and automation. Create automation scripts using Nutanix REST APIs, Python SDK, PowerShell, and Terraform. Manage VM templates, golden images, and standardized deployment catalogs. Design disaster recovery solutions utilizing Leap, Protection Domains, and metro clustering. Implement network micro-segmentation with Nutanix Flow, including RBAC and encryption tactics. Lead Level 3 troubleshooting through advanced diagnostics and log analysis. Configure high availability and optimize performance for critical workloads. Oversee AHV networking with OVS bridges, VLANs, and implement resource reservations. Architect and maintain hybrid cloud solutions across Nutanix HCI, AWS, and GCP environments. Cloud Platform Engineering Responsibilities: Further responsibilities in cloud platform engineering will be communicated during the interview process to ensure alignment with your expertise.
Join Arista Networks as a Site Reliability Engineer and play a critical role in ensuring the reliability and performance of our cutting-edge cloud networking solutions. In this fully remote position, you will collaborate with cross-functional teams to enhance our systems and foster a culture of continuous improvement.
Full-time|$133.1K/yr - $148K/yr|Remote|New York City, NY
Site Reliability Engineer Overview: Join Weedmaps as a Site Reliability Engineer and collaborate across departments, including application, infrastructure, and quality teams, to elevate the performance, reliability, resilience, and scalability of our web services at Weedmaps.com. As a cloud-native organization, we run 100% of our services in Docker on Kubernetes within AWS's public cloud. Our operations utilize observability, monitoring, CI/CD automation, and custom tooling, enabling us to deploy multiple production releases daily. Your daily responsibilities will focus on applying your engineering expertise to enhance system monitoring, minimize developer toil, configure CI workflows, and optimize our deployment pipelines. You will serve as a knowledge reference for development teams, ensuring they utilize consistent tools for metrics, logging, building, and deployment. Collaborating closely with both development and infrastructure teams, you will identify critical service-specific metrics that require monitoring, and you will help application development teams create libraries for seamless service instrumentation. The impact you'll make: Collaborate with stakeholders to establish and promote best practices for monitoring and CI/CD pipelines. Troubleshoot issues related to deployment within our CI pipeline. Actively promote the DevOps culture at Weedmaps. Identify opportunities for automation and advocate for the codification of processes. Promote best practices regarding collaboration, reliability, security, and performance across all partner teams. Take ownership of application configuration and scaling for specified services, ensuring adherence to organizational practices. Develop and optimize synthetic monitoring flows. What you've accomplished: A minimum of 2 years of development experience in startup or mid-sized environments. Proficiency in programming languages such as Python, Go, Node, Ruby, or Elixir. Knowledge of containerization technologies, particularly Docker (Kubernetes experience is a plus). Strong communication skills, a positive demeanor, and the ability to provide and receive constructive feedback. Professional experience with cloud-native observability standards including OpenMetrics, OpenTracing, and OpenCensus. Expertise in using and configuring modern CI/CD workflows. Deep understanding of SLIs, SLOs, and SLAs at both service and business levels. Familiarity with golden signals and their significance in monitoring.
At HomeVision, we are pioneering innovations in real estate valuation to foster a more efficient, transparent, and equitable housing market. By harnessing advanced technologies such as Natural Language Processing (NLP), computer vision, and large language models (LLMs), we are transforming the appraisal process, enabling appraisers to enhance their productivity. Backed by Initialized Capital, we are experiencing rapid growth and are on the lookout for a dynamic Site Reliability Engineer (SRE) to aid in our scaling efforts.Key ResponsibilitiesDesign and manage the infrastructure supporting our SaaS offerings, predominantly utilizing AWS.Develop tools and oversee platform components to assist our development teams.Engage in software development initiatives, focusing on areas such as authentication, reliability, and observability.Support daily operations, including setting up testing environments and overseeing deployments.Address IT-related tasks like user onboarding and account management.Maintain a flexible work schedule while ensuring availability until 6 PM Pacific Time for internal support and monitoring.QualificationsMinimum of 2 years of experience in Site Reliability Engineering or cloud operations, with AWS experience preferred.At least 1 year of software development experience.A data-driven mindset.A readiness to work across cloud infrastructure and IT as required.Meticulous attention to detail and a commitment to creating high-quality systems.Eligibility CriteriaCandidates must reside in the US or Puerto Rico.Currently, we are unable to sponsor work visas; thus, candidates must be authorized to work in the US without sponsorship.Preferred QualificationsFamiliarity with Terraform or other Infrastructure as Code (IaC) tools.Interest and experience in database administration.Candidates located in Seattle or San Francisco will receive additional consideration.Our OfferingsCompetitive salary, equity, and comprehensive health benefits.Significant ownership and autonomy in your role.Support for your professional development and growth.A fully remote and flexible work environment.We request that no recruiters or automated submissions apply.
Site Reliability Engineer (SRE)Global (UTC-3 preferred)At Axiom, our mission is to empower developers by providing swift and insightful access to their data. As a remote-first, globally distributed organization, we are dedicated to creating a cloud-native, serverless data analytics platform. Axiom revolutionizes the way developers and organizations manage their data, allowing for unlimited data transmission with economical storage solutions and rapid querying capabilities.As a Site Reliability Engineer at Axiom, you will play a crucial role in ensuring exceptional reliability and performance for our customers. Working alongside backend engineers and product teams, you will focus on designing and maintaining scalable and dependable systems. Our SRE philosophy emphasizes automation, measurement, and continuous enhancement of system reliability and efficiency.Your core responsibilities include:Design and maintain a robust, secure, and scalable infrastructure for Axiom Cloud.Collaborate with engineering teams to establish and refine service level objectives.Assist in disaster recovery planning, capacity engineering, performance analysis, and system optimization.Promote best practices for code deployments, contributing to the education of the wider development team.Implement tools and solutions that enhance system reliability and minimize manual efforts.Investigate and resolve service incidents, contributing to postmortems and root cause analysis.Cultivate a culture of monitoring, alerting, and observability within the organization.
As an Associate Principal Engineer specializing in Performance and Site Reliability at Nagarro, you will play a pivotal role in shaping our engineering practices and ensuring the reliability and performance of our systems. You will collaborate with cross-functional teams to design, implement, and optimize scalable solutions that meet our clients’ needs. Your expertise will contribute to enhancing the user experience and system efficiency.
Are you prepared to transform the advertising landscape? At Cognitiv, we are not merely another AdTech firm—we are pioneers reshaping media buying with our advanced Deep Learning Advertising Platform. Since our inception in 2015, we have been leveraging state-of-the-art deep learning technologies and data science to redefine how brands engage with their audiences. Our mission is clear: to infuse intelligence into advertising, delivering unmatched precision, relevance, and impact at scale. Our innovative platform provides advertisers with unparalleled flexibility—whether activating Dynamic Deals through their preferred DSP, utilizing our managed service DSP, or tapping into our groundbreaking ContextGPT product. Joining Cognitiv means being at the forefront of AI-driven advertising solutions, leading change, and achieving remarkable growth in a fast-paced industry. We are currently expanding!The RoleWe are seeking a Senior Site Reliability Engineer to enhance our global network of datacenters and elevate service management across Cognitiv. Your primary focus will be on rapidly expanding our hybrid cloud infrastructure. As a growing organization, we strive to adhere to industry best practices. This position requires an experienced engineer who is eager to learn our environment quickly and help shape our long-term service management strategy.This role will be based in our Bellevue, WA office with a hybrid work schedule of 3 days in-office (Monday/Tuesday/Wednesday) and 2 days remote (Thursday/Friday).ResponsibilitiesDesign, implement, and maintain infrastructure across a widening footprint of co-located deployments.Assess existing physical and network architectures to ensure long-term scalability and growth.Collaborate with engineering and product teams to accurately scope projects based on core business requirements.Lead company-wide initiatives to enhance service management surrounding deployments, monitoring, and disaster recovery.Oversee and maintain shared infrastructure within our AWS environment.RequirementsUnderstanding of contemporary datacenter practices with experience in configuring multi-datacenter deployments.Extensive knowledge of AWS infrastructure, networking, and management practices.Demonstrated experience with infrastructure as code and related tools.
InvestorFlow stands out as the premier provider of an industry-specialized CRM built on Salesforce, complemented by digital portals that empower alternative asset firms to discover opportunities, cultivate and manage relationships, and transform insights into actionable strategies, all while enhancing productivity and transparency.We are seeking a dedicated Senior Site Reliability Engineer who will play a key role in ensuring our systems' reliability through operational excellence, configuration-as-code modifications, and active collaboration with Engineering and DevOps teams. This role involves participating in architectural design reviews, validating reliability standards, auditing production systems, and confirming that systems meet SRE production-readiness criteria. While the SRE will not be responsible for building infrastructure or Infrastructure-as-Code (IaC), familiarity with IaC concepts (especially Terraform/HCL) is advantageous for assessing and influencing configurations.
Join Our Team at Customer.io At Customer.io, we empower over 8,000 companies—ranging from innovative startups to established global brands—to send billions of tailored emails, push notifications, in-app messages, and SMS daily. Our platform drives automated communication that resonates with users. Utilizing real-time behavioral data, we enable teams to craft smarter, more relevant messages. Our tech stack includes Go, React, Ember, and cutting-edge AI, allowing us to deliver quickly and scale confidently. We are seeking a Senior Site Reliability Engineer to enhance our infrastructure, minimize operational challenges, and boost reliability as we continue to grow. If you possess experience with high-scale systems and have a passion for optimizing platforms for both developers and customers, we want to connect with you!
Join Our Team:At HavocAI, we are at the forefront of collaborative autonomy, leading the way in the development of autonomous surface vessels for a variety of defense and commercial maritime operations. Our mission is to rapidly expand and innovate solutions that address complex human challenges, while prioritizing life-saving technologies. We are in search of passionate individuals committed to pushing boundaries and making a meaningful impact.Role OverviewWe are looking for a Senior Site Reliability Engineer (SRE) with a minimum of 7 years of experience in designing, operating, and scaling robust distributed systems. In this pivotal role, you will serve as a technical leader in our Cloud Platform team, ensuring the reliability, performance, and resilience of critical services that support autonomy, simulation, and data-heavy workloads.You will collaborate with various teams, including Cloud Platform, DevOps, Data Engineering, and Autonomy, to define reliability standards, enhance operational maturity, and create systems that effectively scale under real-world conditions. The ideal candidate will possess deep technical expertise, demonstrate composure under pressure, and be experienced in managing end-to-end reliability outcomes.
Company BackgroundCensys is dedicated to building the most comprehensive and reliable map of the Internet. Our mission is to empower users with real-time Internet intelligence and actionable threat insights, catering to global governments, over 50% of the Fortune 500, and leading threat intelligence providers worldwide.LocationThis is a fully remote position within the United States.Role SummaryAs a Senior Site Reliability Engineer (SRE) on the Infrastructure and Operations team, you will play a crucial role in designing, building, and deploying tools that enhance the efficiency of our development teams and production applications. We are seeking skilled engineers who are passionate about cloud-native technologies and committed to improving our microservice architecture's reliability and operational maturity.Focusing on Developer Efficiency and Experience, you will help streamline engineering workflows, support our Software Development Life Cycle (SDLC), and empower developers to confidently build, deploy, and manage their services within the platform.What You'll DoDevelop and maintain tools to support applications running on Kubernetes and Google Cloud Platform.Collaborate with development teams to facilitate the building, shipping, and deploying of services and applications, ensuring resilience and reliability.Monitor and ensure the smooth operation of our production environments, assisting developers in debugging complex issues and capturing the four golden signals of performance.Contribute to the creation of a self-service platform that accelerates developer velocity, including service catalogs, repository tooling, and comprehensive documentation.Participate in a shared on-call rotation, embracing end-to-end service ownership alongside development teams.
Join our dynamic team at ditto as a Site Reliability Engineer, where you'll play a pivotal role in enhancing our platform's performance and reliability. You'll collaborate with cross-functional teams to ensure the seamless operation of our services while implementing best practices in automation, monitoring, and incident response.
Dec 17, 2025
Sign in to browse more jobs
Create account — see all 74,468 results
1 - 20 of 74,468 Jobs
Search for Associate Site Reliability Engineer - Remote
At HomeVision, we're redefining the landscape of real estate valuation, striving to foster a more efficient, transparent, and equitable housing market. Utilizing cutting-edge technologies such as Natural Language Processing (NLP), computer vision, and large language models (LLMs), we enhance the appraisal process, empowering appraisers to work with greater e…
Summary The Wikimedia Foundation is on the lookout for a talented Senior Site Reliability Engineer to enhance and maintain the infrastructure that powers the world’s most beloved encyclopedia, Wikipedia, serving millions globally. Our Site Reliability Engineering (SRE) team is dedicated to ensuring that our globally recognized top-10 website operates smoothly while innovating to further our mission: to empower everyone to share in the sum of all knowledge. As a member of the SRE team, you will join a diverse and globally distributed group of engineers passionate about exploring, experimenting, and adopting new technologies. We believe in transparency, sharing our documentation, code, and configuration as open source. Our production systems are powered entirely by open-source software, and we encourage you to review our work without any login requirements. If you are intrigued by the challenge of improving the reliability and delivery of one of the Internet’s top websites and thrive in a remote-first environment, we invite you to consider joining us.
Responsibilities in Shipping & HandlingArchitect, scale, and secure infrastructure to meet evolving business demands, employing fault-tolerant designs, performance testing, profiling, and strategic capacity planning.Develop, implement, and sustain automation, monitoring, and alerting systems, alongside disaster recovery protocols.Promote scalability and maintainability through microservices architecture, decoupling concerns, effective data modeling, job queuing, and application layering.Enhance and oversee our CI/CD pipeline to ensure seamless and secure production deployments via automated testing and verification.Evaluate and confirm system performance and accuracy concerning response times and throughput.Engage in peer reviews and testing, contributing to automated testing suites and participating in design reviews for new features, products, and systems.Partake in an on-call rotation for system support.
AI is revolutionizing the operational landscape for businesses, yet many enterprises find themselves hindered in their efforts to effectively implement AI tools, agents, and workflows. At Runlayer, we are dedicated to dismantling these barriers.Our innovative team has developed AI Actions for OpenAI, delivered Zapier Agents to millions, and launched the first remote MCP server in partnership with Anthropic. With the co-creator of MCP on our cap table, we are establishing the essential platform that enterprises need to leverage AI securely and effectively.Runlayer serves as a unified platform for MCPs, Skills, and Agents. We provide purpose-built security, fine-grained governance, and complete observability, enabling organizations to advance their AI initiatives with confidence. With $11M raised from Khosla Ventures and Felicis, we proudly support clients such as Gusto, Instacart, and Opendoor.As a compact team of 25, primarily engineers, we thrive on rapid deployment and innovation. If you aspire to be at the forefront of AI implementation, now is the time to join us.In the role of Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and scalability of Runlayer's infrastructure as we expand to meet the needs of our enterprise customers across both cloud and on-prem environments.Why You'll Thrive HereImpact: Construct the foundational infrastructure for the enterprise MCP platform, directly facilitating large-scale AI adoption.Excellence: Collaborate closely with founders and a small, experienced engineering team, delivering swiftly in a high-growth setting.Ownership: Take full responsibility for reliability from database performance to incident response and CI/CD pipelines.What You'll DoOversee the reliability and performance of our cloud infrastructure across AWS (ECS, Aurora, CloudWatch) and GCP.Manage and optimize Kubernetes clusters and container orchestration.Lead database reliability engineering efforts, including performance tuning and scaling.Develop and maintain CI/CD pipelines for efficient and secure deployments.Conduct incident response and participate in on-call rotations.Collaborate with product engineers to design scalable and resilient systems.What We're Looking ForProven experience with AWS services including ECS, Aurora, and CloudWatch.Expertise in Kubernetes management and container orchestration.Strong background in database reliability engineering.Solid understanding of CI/CD methodologies and tools.Effective incident response skills and a proactive approach to system reliability.Ability to work collaboratively in a fast-paced environment with a focus on innovation.
Join our dynamic team at akuity as a Senior Site Reliability Engineer, where you'll play a pivotal role in enhancing the reliability and performance of our systems. In this exciting remote position, you will collaborate with cross-functional teams to implement innovative solutions that ensure seamless service delivery.Your expertise will be vital in monitoring system health, optimizing performance, and troubleshooting issues to provide exceptional user experiences. If you are passionate about building scalable and robust infrastructures, we want to hear from you!
Unifonic operates as a remote-first company in the CPaaS sector, providing communication solutions to over 5,000 businesses. With a team of 500, Unifonic supports clients in building stronger customer connections. The Engineering team at Unifonic is responsible for designing, building, and maintaining the systems that power the company’s products. Team members collaborate closely with other departments to ensure technology aligns with customer needs. Creativity and new ideas are encouraged across the group. Role overview The Senior Site Reliability Engineer joins the Production Operations (Live) team. This role centers on ensuring the reliability, scalability, and resilience of Unifonic’s cloud infrastructure and distributed messaging platforms. The SRE team works to keep systems running smoothly at all times and continually seeks ways to improve performance and stability. What you will do Maintain the reliability, uptime, and scalability of key production services around the clock. Participate in the on-call rotation, respond to incidents, troubleshoot live production issues, and lead post-incident reviews. Create and update operational playbooks and escalation paths to help reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). Monitor service level objectives (SLOs), conduct chaos testing, plan for capacity, and address reliability risks as they arise.
As a Principal Site Reliability Engineer at Arcadia, you will play a pivotal role in ensuring the reliability, scalability, and performance of our systems. You will lead initiatives to design and implement robust solutions while collaborating with cross-functional teams to drive operational excellence.
About Hashgraph:Hashgraph is an innovative and rapidly growing software company dedicated to supporting, developing, and maintaining Hedera, an open-source proof-of-stake platform. Hedera is EVM-compatible and designed to cater to the demands of enterprise and web3 applications, focusing on speed, security, stability, and sustainability. The public network of Hedera is governed by leading organizations across 11 sectors and 14 regions, ensuring robust oversight of the decentralized platform's development and direction.About the RoleWe are seeking a Senior Site Reliability Engineer to join the HashSphere engineering team. In this pivotal role, you will assist in designing, building, and integrating essential product features for enterprises utilizing Hiero, our private distributed ledger technology. This greenfield project is at the forefront of decentralized systems and cloud technologies, with a strong emphasis on security, privacy, and scalability.Your expertise in distributed systems engineering, coupled with your software development skills and knowledge of industry-standard SRE and DevOps practices, will be crucial in delivering core platform services. You will contribute to a highly scalable, mission-critical infrastructure product utilized by some of the largest organizations in finance, supply chain, and healthcare sectors.If you possess experience in designing scalable, reliable, and secure distributed system architectures within AWS, GCP, or Azure, and are eager to collaborate with a passionate team to build pioneering technology, this could be the perfect opportunity for you.
Full-time|$158K/yr - $227K/yr|Remote|Remote - United States; United States of America
ABOUT JUUL LABS: At Juul Labs, we are dedicated to revolutionizing the experience of adult smokers by transitioning them away from traditional combustible cigarettes. Our mission is to eliminate their use and prevent underage access to our products. We tackle this global health challenge with a focus on quality, innovation, and research. Supported by prominent technology investors, we aim for excellence not only in our products but also in our talent acquisition. We embrace diversity and are united by our mission. We are seeking the world's best engineers, scientists, designers, product managers, operations experts, and customer service professionals. If you are ready to advance your career with us, we encourage you to explore this opportunity. ROLE OVERVIEW: As a Senior Site Reliability Engineer (SRE), you will take ownership of the operational stability and performance of Juul's hybrid cloud infrastructure (Nutanix, AWS/GCP). Your responsibilities will include leading automation initiatives, ensuring reliability in architecture, and serving as the go-to expert for critical incident escalation to guarantee a scalable and efficient platform. Nutanix Platform Management Responsibilities: Design, deploy, and maintain enterprise-scale Nutanix AHV clusters and manage Prism Central for multi-cluster operations. Exhibit expert-level proficiency with Nutanix CLI (nCLI and acli) for advanced operations and automation. Create automation scripts using Nutanix REST APIs, Python SDK, PowerShell, and Terraform. Manage VM templates, golden images, and standardized deployment catalogs. Design disaster recovery solutions utilizing Leap, Protection Domains, and metro clustering. Implement network micro-segmentation with Nutanix Flow, including RBAC and encryption tactics. Lead Level 3 troubleshooting through advanced diagnostics and log analysis. Configure high availability and optimize performance for critical workloads. Oversee AHV networking with OVS bridges, VLANs, and implement resource reservations. Architect and maintain hybrid cloud solutions across Nutanix HCI, AWS, and GCP environments. Cloud Platform Engineering Responsibilities: Further responsibilities in cloud platform engineering will be communicated during the interview process to ensure alignment with your expertise.
Join Arista Networks as a Site Reliability Engineer and play a critical role in ensuring the reliability and performance of our cutting-edge cloud networking solutions. In this fully remote position, you will collaborate with cross-functional teams to enhance our systems and foster a culture of continuous improvement.
Full-time|$133.1K/yr - $148K/yr|Remote|New York City, NY
Site Reliability Engineer Overview: Join Weedmaps as a Site Reliability Engineer and collaborate across departments, including application, infrastructure, and quality teams, to elevate the performance, reliability, resilience, and scalability of our web services at Weedmaps.com. As a cloud-native organization, we run 100% of our services in Docker on Kubernetes within AWS's public cloud. Our operations utilize observability, monitoring, CI/CD automation, and custom tooling, enabling us to deploy multiple production releases daily. Your daily responsibilities will focus on applying your engineering expertise to enhance system monitoring, minimize developer toil, configure CI workflows, and optimize our deployment pipelines. You will serve as a knowledge reference for development teams, ensuring they utilize consistent tools for metrics, logging, building, and deployment. Collaborating closely with both development and infrastructure teams, you will identify critical service-specific metrics that require monitoring, and you will help application development teams create libraries for seamless service instrumentation. The impact you'll make: Collaborate with stakeholders to establish and promote best practices for monitoring and CI/CD pipelines. Troubleshoot issues related to deployment within our CI pipeline. Actively promote the DevOps culture at Weedmaps. Identify opportunities for automation and advocate for the codification of processes. Promote best practices regarding collaboration, reliability, security, and performance across all partner teams. Take ownership of application configuration and scaling for specified services, ensuring adherence to organizational practices. Develop and optimize synthetic monitoring flows. What you've accomplished: A minimum of 2 years of development experience in startup or mid-sized environments. Proficiency in programming languages such as Python, Go, Node, Ruby, or Elixir. Knowledge of containerization technologies, particularly Docker (Kubernetes experience is a plus). Strong communication skills, a positive demeanor, and the ability to provide and receive constructive feedback. Professional experience with cloud-native observability standards including OpenMetrics, OpenTracing, and OpenCensus. Expertise in using and configuring modern CI/CD workflows. Deep understanding of SLIs, SLOs, and SLAs at both service and business levels. Familiarity with golden signals and their significance in monitoring.
At HomeVision, we are pioneering innovations in real estate valuation to foster a more efficient, transparent, and equitable housing market. By harnessing advanced technologies such as Natural Language Processing (NLP), computer vision, and large language models (LLMs), we are transforming the appraisal process, enabling appraisers to enhance their productivity. Backed by Initialized Capital, we are experiencing rapid growth and are on the lookout for a dynamic Site Reliability Engineer (SRE) to aid in our scaling efforts.Key ResponsibilitiesDesign and manage the infrastructure supporting our SaaS offerings, predominantly utilizing AWS.Develop tools and oversee platform components to assist our development teams.Engage in software development initiatives, focusing on areas such as authentication, reliability, and observability.Support daily operations, including setting up testing environments and overseeing deployments.Address IT-related tasks like user onboarding and account management.Maintain a flexible work schedule while ensuring availability until 6 PM Pacific Time for internal support and monitoring.QualificationsMinimum of 2 years of experience in Site Reliability Engineering or cloud operations, with AWS experience preferred.At least 1 year of software development experience.A data-driven mindset.A readiness to work across cloud infrastructure and IT as required.Meticulous attention to detail and a commitment to creating high-quality systems.Eligibility CriteriaCandidates must reside in the US or Puerto Rico.Currently, we are unable to sponsor work visas; thus, candidates must be authorized to work in the US without sponsorship.Preferred QualificationsFamiliarity with Terraform or other Infrastructure as Code (IaC) tools.Interest and experience in database administration.Candidates located in Seattle or San Francisco will receive additional consideration.Our OfferingsCompetitive salary, equity, and comprehensive health benefits.Significant ownership and autonomy in your role.Support for your professional development and growth.A fully remote and flexible work environment.We request that no recruiters or automated submissions apply.
Site Reliability Engineer (SRE)Global (UTC-3 preferred)At Axiom, our mission is to empower developers by providing swift and insightful access to their data. As a remote-first, globally distributed organization, we are dedicated to creating a cloud-native, serverless data analytics platform. Axiom revolutionizes the way developers and organizations manage their data, allowing for unlimited data transmission with economical storage solutions and rapid querying capabilities.As a Site Reliability Engineer at Axiom, you will play a crucial role in ensuring exceptional reliability and performance for our customers. Working alongside backend engineers and product teams, you will focus on designing and maintaining scalable and dependable systems. Our SRE philosophy emphasizes automation, measurement, and continuous enhancement of system reliability and efficiency.Your core responsibilities include:Design and maintain a robust, secure, and scalable infrastructure for Axiom Cloud.Collaborate with engineering teams to establish and refine service level objectives.Assist in disaster recovery planning, capacity engineering, performance analysis, and system optimization.Promote best practices for code deployments, contributing to the education of the wider development team.Implement tools and solutions that enhance system reliability and minimize manual efforts.Investigate and resolve service incidents, contributing to postmortems and root cause analysis.Cultivate a culture of monitoring, alerting, and observability within the organization.
As an Associate Principal Engineer specializing in Performance and Site Reliability at Nagarro, you will play a pivotal role in shaping our engineering practices and ensuring the reliability and performance of our systems. You will collaborate with cross-functional teams to design, implement, and optimize scalable solutions that meet our clients’ needs. Your expertise will contribute to enhancing the user experience and system efficiency.
Are you prepared to transform the advertising landscape? At Cognitiv, we are not merely another AdTech firm—we are pioneers reshaping media buying with our advanced Deep Learning Advertising Platform. Since our inception in 2015, we have been leveraging state-of-the-art deep learning technologies and data science to redefine how brands engage with their audiences. Our mission is clear: to infuse intelligence into advertising, delivering unmatched precision, relevance, and impact at scale. Our innovative platform provides advertisers with unparalleled flexibility—whether activating Dynamic Deals through their preferred DSP, utilizing our managed service DSP, or tapping into our groundbreaking ContextGPT product. Joining Cognitiv means being at the forefront of AI-driven advertising solutions, leading change, and achieving remarkable growth in a fast-paced industry. We are currently expanding!The RoleWe are seeking a Senior Site Reliability Engineer to enhance our global network of datacenters and elevate service management across Cognitiv. Your primary focus will be on rapidly expanding our hybrid cloud infrastructure. As a growing organization, we strive to adhere to industry best practices. This position requires an experienced engineer who is eager to learn our environment quickly and help shape our long-term service management strategy.This role will be based in our Bellevue, WA office with a hybrid work schedule of 3 days in-office (Monday/Tuesday/Wednesday) and 2 days remote (Thursday/Friday).ResponsibilitiesDesign, implement, and maintain infrastructure across a widening footprint of co-located deployments.Assess existing physical and network architectures to ensure long-term scalability and growth.Collaborate with engineering and product teams to accurately scope projects based on core business requirements.Lead company-wide initiatives to enhance service management surrounding deployments, monitoring, and disaster recovery.Oversee and maintain shared infrastructure within our AWS environment.RequirementsUnderstanding of contemporary datacenter practices with experience in configuring multi-datacenter deployments.Extensive knowledge of AWS infrastructure, networking, and management practices.Demonstrated experience with infrastructure as code and related tools.
InvestorFlow stands out as the premier provider of an industry-specialized CRM built on Salesforce, complemented by digital portals that empower alternative asset firms to discover opportunities, cultivate and manage relationships, and transform insights into actionable strategies, all while enhancing productivity and transparency.We are seeking a dedicated Senior Site Reliability Engineer who will play a key role in ensuring our systems' reliability through operational excellence, configuration-as-code modifications, and active collaboration with Engineering and DevOps teams. This role involves participating in architectural design reviews, validating reliability standards, auditing production systems, and confirming that systems meet SRE production-readiness criteria. While the SRE will not be responsible for building infrastructure or Infrastructure-as-Code (IaC), familiarity with IaC concepts (especially Terraform/HCL) is advantageous for assessing and influencing configurations.
Join Our Team at Customer.io At Customer.io, we empower over 8,000 companies—ranging from innovative startups to established global brands—to send billions of tailored emails, push notifications, in-app messages, and SMS daily. Our platform drives automated communication that resonates with users. Utilizing real-time behavioral data, we enable teams to craft smarter, more relevant messages. Our tech stack includes Go, React, Ember, and cutting-edge AI, allowing us to deliver quickly and scale confidently. We are seeking a Senior Site Reliability Engineer to enhance our infrastructure, minimize operational challenges, and boost reliability as we continue to grow. If you possess experience with high-scale systems and have a passion for optimizing platforms for both developers and customers, we want to connect with you!
Join Our Team:At HavocAI, we are at the forefront of collaborative autonomy, leading the way in the development of autonomous surface vessels for a variety of defense and commercial maritime operations. Our mission is to rapidly expand and innovate solutions that address complex human challenges, while prioritizing life-saving technologies. We are in search of passionate individuals committed to pushing boundaries and making a meaningful impact.Role OverviewWe are looking for a Senior Site Reliability Engineer (SRE) with a minimum of 7 years of experience in designing, operating, and scaling robust distributed systems. In this pivotal role, you will serve as a technical leader in our Cloud Platform team, ensuring the reliability, performance, and resilience of critical services that support autonomy, simulation, and data-heavy workloads.You will collaborate with various teams, including Cloud Platform, DevOps, Data Engineering, and Autonomy, to define reliability standards, enhance operational maturity, and create systems that effectively scale under real-world conditions. The ideal candidate will possess deep technical expertise, demonstrate composure under pressure, and be experienced in managing end-to-end reliability outcomes.
Company BackgroundCensys is dedicated to building the most comprehensive and reliable map of the Internet. Our mission is to empower users with real-time Internet intelligence and actionable threat insights, catering to global governments, over 50% of the Fortune 500, and leading threat intelligence providers worldwide.LocationThis is a fully remote position within the United States.Role SummaryAs a Senior Site Reliability Engineer (SRE) on the Infrastructure and Operations team, you will play a crucial role in designing, building, and deploying tools that enhance the efficiency of our development teams and production applications. We are seeking skilled engineers who are passionate about cloud-native technologies and committed to improving our microservice architecture's reliability and operational maturity.Focusing on Developer Efficiency and Experience, you will help streamline engineering workflows, support our Software Development Life Cycle (SDLC), and empower developers to confidently build, deploy, and manage their services within the platform.What You'll DoDevelop and maintain tools to support applications running on Kubernetes and Google Cloud Platform.Collaborate with development teams to facilitate the building, shipping, and deploying of services and applications, ensuring resilience and reliability.Monitor and ensure the smooth operation of our production environments, assisting developers in debugging complex issues and capturing the four golden signals of performance.Contribute to the creation of a self-service platform that accelerates developer velocity, including service catalogs, repository tooling, and comprehensive documentation.Participate in a shared on-call rotation, embracing end-to-end service ownership alongside development teams.
Join our dynamic team at ditto as a Site Reliability Engineer, where you'll play a pivotal role in enhancing our platform's performance and reliability. You'll collaborate with cross-functional teams to ensure the seamless operation of our services while implementing best practices in automation, monitoring, and incident response.
Dec 17, 2025
Sign in to browse more jobs
Create account — see all 74,468 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.