Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Senior
Qualifications
Ideal Candidate ProfileRobust experience in Site Reliability Engineering (SRE), DevOps, Software Engineering, or Systems Engineering. Strong troubleshooting abilities. Proficient in system design with excellent analytical skills. Effective communication skills. Familiarity with major cloud platforms, particularly Google Cloud. Proficient in SQL. Experience with containers, Kubernetes, and tools like Kustomize and Helm. Knowledge of service mesh technologies, preferably Istio. Understanding of networking concepts, including DNS, TLS, certificates, and ingress configurations.
About the job
About the Position
Join the innovative Cloud FinOps team at Hopper as a Senior Site Reliability Engineer. In this role, you'll oversee a vast infrastructure on Google Cloud, supporting hundreds of engineers and delivering exceptional experiences to millions of users globally.
Your enthusiasm for automation and system optimization will be crucial as you work to create scalable, reliable, and secure infrastructure.
You will tackle problems pragmatically, developing solutions that are not only effective but also user-friendly and economical.
Daily Responsibilities
Drive cost efficiency projects, including:
Minimizing network egress costs by eliminating unnecessary headers.
Optimizing data storage by ensuring efficient use of warehouse data, such as utilizing cold storage for infrequently accessed buckets.
Enhancing autoscaling for both databases and compute services.
Enhance cost attribution processes to provide all teams with transparent cost visibility.
Participate in incident support and share on-call responsibilities for platform incidents, collaborating with a geographically diverse engineering team.
Contribute to a dynamic and efficient team of SREs.
About Hopper
Hopper is a leading technology company that leverages data-driven insights to empower travelers across the globe. Our mission is to create a seamless and enjoyable travel experience for everyone. Join us as we innovate and enhance our platform to serve millions of users worldwide.
About the PositionJoin the innovative Cloud FinOps team at Hopper as a Senior Site Reliability Engineer. In this role, you'll oversee a vast infrastructure on Google Cloud, supporting hundreds of engineers and delivering exceptional experiences to millions of users globally.Your enthusiasm for automation and system optimization will be crucial as you work to…
Rivian and Volkswagen Group Technologies are working together to set new standards for software-defined vehicles. This partnership focuses on electric vehicles, advanced operating systems, zonal controllers, and cloud connectivity. The goal is to create vehicles that are more connected, intelligent, and sustainable by drawing on deep expertise in connectivity, artificial intelligence, and security. Role overview The Senior Site Reliability Engineer for Developer Platform will design, build, and maintain infrastructure that powers build pipelines. This work ensures teams can efficiently produce and deliver firmware for next-generation vehicles. The role plays a key part in developing the technical backbone of the organization and fostering a DevOps mindset across teams. What you will do Architect and develop infrastructure to support software build pipelines Maintain and improve systems that enable firmware delivery Collaborate with engineering teams to implement architectural plans Promote DevOps practices within the organization Location This position is based in Vancouver, British Columbia.
About ClickHouseRecognized on the 2025 Forbes Cloud 100 list, ClickHouse stands out as a leading innovator in the realm of private cloud technology. With a rapidly expanding customer base exceeding 3,000 and an astounding annual recurring revenue (ARR) growth of over 250% year-on-year, ClickHouse is at the forefront of real-time analytics, data warehousing, observability, and AI workloads.Our recent $400M Series D financing round validates our sustained momentum. Notable clients such as Capital One, Lovable, Decagon, Polymarket, and Airwallex have recently adopted or expanded their use of our platform, joining a prestigious roster of AI pioneers and global brands including Meta, Cursor, Sony, and Tesla.Join us in our mission to revolutionize the way companies leverage data!About the RoleAs we enhance our commitment to delivering dependable and secure services, we are expanding our Site Reliability Engineering team. In this role, you will spearhead initiatives to maintain and improve the reliability, availability, scalability, and performance of our cloud infrastructure. Collaborate across various teams, including Control Plane, Data Plane, Core, Security, Support, and Operations, to design and implement robust, secure, and highly available distributed systems. You will take charge of incident management and response processes, conducting blameless postmortems and driving continuous improvements in our Cloud services. Your software engineering expertise will be vital in developing tools and platforms to enhance operational and engineering efficiencies within ClickHouse Cloud. This is a unique opportunity to make a substantial impact on our high-performance, elastic ClickHouse Cloud.Your ResponsibilitiesCollaborate with diverse engineering teams at ClickHouse to architect and implement scalable, secure, and high-availability systems.Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.Ensure all infrastructure components within ClickHouse Cloud, including Data Plane, Control Plane, and ClickHouse Core, have effective monitoring and alerting systems in place for timely incident detection and resolution.Refine incident response processes and post-mortem analyses for outages in ClickHouse Cloud, including communication with impacted customers through the support team.Continuously enhance the reliability and performance of ClickHouse services.
As a Senior Site Reliability Engineer at jobgether, the focus is on maintaining and improving the reliability and performance of cloud infrastructure and services. This position is based in Canada and works closely with multiple teams across the company. Role overview The Senior Site Reliability Engineer monitors systems, implements improvements, and automates key processes. The goal is to support a platform that scales smoothly as demands grow. What you will do Ensure the ongoing reliability and performance of cloud-based systems Collaborate with other teams to address infrastructure needs and challenges Automate operational processes to reduce manual work and improve efficiency Identify and implement ways to improve scalability across the platform
Data is the new gold! At Lightspeed, we empower our data teams to construct and manage cutting-edge data and AI infrastructure platforms, alongside a robust governance framework that ensures seamless data flow across the organization. Our focus lies in data security, reliability, and high availability.Please note: As a global organization serving clients beyond Quebec, proficiency in English is a prerequisite for this role.Role:Work collaboratively with cross-functional data teams to design and implement scalable, reliable cloud infrastructure solutions that prioritize security and cost-efficiency.Advocate for a comprehensive security approach that encompasses infrastructure, supply chain, and third-party integrations, ensuring the protection of our entire data ecosystem.Contribute significantly to the development of self-service workflows for data and infrastructure, enabling teams to access resources efficiently and boost productivity.Promote and uphold best practices in Infrastructure as Code (IaC), ensuring High Availability, Disaster Recovery, and Security are integral to all design and deployment efforts.Additionally, you will:Engage in daily support and troubleshooting activities.Collaborate with the wider team to achieve organizational goals, even if it involves tasks outside your primary responsibilities.
At Lightspeed, we believe that data is the new gold! Our mission is to empower our data teams to build and sustain a cutting-edge data and AI infrastructure platform alongside a robust governance framework, ensuring seamless data flow across the organization. Our commitment to data security, reliability, and high availability is our driving force.Important Note: As an international company with employees and clients beyond Quebec, proficiency in English is essential for this role.Key Responsibilities:Work collaboratively with cross-functional data teams to design and implement cloud infrastructure solutions that are scalable, reliable, secure, and cost-effective.Take a comprehensive approach to security, addressing all aspects from infrastructure and supply chain to integration with third-party systems, ensuring the integrity of the data ecosystem.Drive the development of self-service workflows for data and infrastructure, enabling teams to efficiently access and use resources, thus enhancing agility and productivity.Promote and uphold the highest standards of Infrastructure as Code (IaC), ensuring that High Availability, Disaster Recovery, and Security are prioritized in all infrastructure design and deployment efforts.Additionally, you will:Engage in daily support and troubleshooting tasks.Collaborate with a wider team to achieve organizational goals, even if it requires stepping outside your defined role.
Full-time|CA$183K/yr - CA$203K/yr|Remote|Canada - Remote (BC, ON, AB, or NS only)
Transforming the Grocery IndustryAt Instacart, we believe in sharing love through food, ensuring everyone has access to their favorite groceries and quality time with loved ones. We don’t just see grocery delivery as a necessity; we recognize the exciting complexities and opportunities it presents to meet the diverse needs of our community. We provide an essential service that customers depend on for groceries and household goods, while also offering safe and flexible earning opportunities to our Personal Shoppers.Instacart has become a vital resource for millions, and we’re assembling a dynamic team to propel our shopping cart forward. If you are ready to deliver your best work, we invite you to join our team.Flex First Work EnvironmentWe embrace a flexible approach in how we perform our best work. Our team members can choose their work location—whether from home, an office, or their favorite coffee shop—while fostering connections and community through regular in-person events. Discover more about our flexible work approach.OverviewAbout the RoleAs a Senior Site Reliability Engineer II, you will be instrumental in ensuring the stability and performance of our platform. You will tackle challenges head-on, ensuring optimal performance and fostering a culture that emphasizes reliable and effective practices. We are seeking a proactive individual who is adept at solving complex problems and is enthusiastic about exploring innovative solutions to support our teams and services.About the TeamThe Site Reliability Engineering (SRE) team merges software and systems engineering to design and maintain large-scale, distributed, and fault-tolerant systems. Our mission is to guarantee high reliability, optimal performance, and continuous improvement for Instacart’s critical internal services and customer-facing systems.The SRE team focuses on enhancing existing systems, constructing robust infrastructure, and automating processes to reduce manual efforts. Joining the SRE team means facing unique scaling challenges while applying your expertise in coding, algorithms, complexity analysis, and large-scale system design.
Full-time|CA$144K/yr - CA$200K/yr|Hybrid|Montreal; Toronto
The Storage Layer Services (SLS) team at MongoDB is embarking on an innovative journey to re-architect our cloud storage layer, forming the core of our next-generation cloud storage architecture. This newly established team is dedicated to creating high-performance, multi-tenant distributed storage services that not only enhance our current Atlas storage stack but also enable more efficient customer workloads. As a Senior Site Reliability Engineer, you will collaborate closely with teams responsible for these storage services to establish Service Level Objectives (SLOs), develop capacity plans, and guarantee the reliability, durability, and operational safety of the foundational storage layer supporting Atlas. By joining our small team of seasoned SREs, you will play an integral role in executing a multi-year roadmap for MongoDB’s cloud storage architecture. This position is open to candidates based in our Toronto or Montreal offices or those working remotely from anywhere in Canada, provided they are located in the Eastern or Central time zones.
Embracing the benefits of remote work, we at Tecsys promote a digital-first culture that enhances employee morale, boosts productivity, and reduces the environmental impact associated with commuting. Our commitment to remote work is complemented by our well-equipped offices and collaborative spaces, offering flexibility for our team to work in the most productive manner possible.About UsTecsys is a rapidly growing innovator in supply chain solutions, serving leading healthcare systems, hospitals, pharmacies, distributors, retailers, and 3PLs. We partner with industry leaders to revolutionize their supply chains through cutting-edge technology. If you enjoy overcoming challenges and are eager for continuous learning, Tecsys may be the perfect place for you!About the RoleWe are seeking a Site Reliability Engineer to join our Network and Security Operations Center (NOC), which is integral to ensuring platform reliability for our mission-critical SaaS environments. In this role, you will be responsible for maintaining, optimizing, and ensuring the reliability and performance of our cloud infrastructure across AWS and Kubernetes. Your focus will be on automation, observability, and continuous improvement. This position combines reliability engineering with incident command, granting you significant ownership of uptime, performance, and innovation. You will join a team of highly skilled professionals who value creative problem-solving, operational excellence, and continual enhancement through automation and resilience engineering.Your ResponsibilitiesCollaborate with Engineering teams to support services pre-launch through system design consulting, software platform development, capacity planning, and launch reviews.Drive innovation: Identify issues, propose creative solutions, and implement initiatives to simplify, scale, and strengthen the platform.Monitor and maintain live services by evaluating availability, latency, and overall system health.Enhance observability: Expand monitoring and alerting with Datadog; define SLOs/SLIs and create actionable dashboards to promote reliability.Automate processes: Develop and improve internal tools, IaC frameworks, and pipelines (e.g., Terraform, GitLab CI/CD) to minimize manual intervention and enable self-healing systems.Achieve sustainable system scaling through automation and advocate for changes that enhance reliability and velocity.Function as an orchestrator using Amazon Kiro: Execute multiple activities concurrently leveraging AI agents to expedite processes while personally validating outcomes.
Become a Force for Good with Axon.At Axon, we are driven by our mission to Protect Life. We are innovators, tackling society's most pressing safety and justice challenges through our advanced ecosystem of devices and cloud-based software. Just like our products, we thrive on collaboration, embracing diverse perspectives from our customers, communities, and each other.Working at Axon is dynamic, rewarding, and impactful. You will take initiative and drive substantial change, growing continually as you contribute to a mission that truly matters at a company where your contributions are valued.Your ImpactAs a vital member of the Site Reliability Engineering (SRE) team, you are dedicated to providing solutions to the real-time challenges faced by our mission-critical cloud-native services. You are committed to ensuring the high quality and reliability that our customers expect. Collaborating closely not only within the SRE team, your technical contributions will empower the entire engineering organization, enabling product teams to consistently deliver cutting-edge features.Location: Remote in CanadaYour ResponsibilitiesDevelop robust, user-friendly foundational platforms and tools that allow engineering teams to provision services quickly, consistently, securely, and cost-effectively.Implement best practices in cloud-native site reliability.Write clean, maintainable, and efficient code.Utilize strong problem-solving abilities to debug issues in cloud-native distributed systems.Guide and educate the engineering organization in adopting innovative architectural patterns.Create thorough documentation to facilitate self-service for engineers.Embrace calculated risks, advocate for new ideas, and enhance your craft.
Become a Force for Good with Axon.At Axon, our mission is to Protect Life. We tackle society's most pressing safety and justice challenges through our innovative ecosystem of devices and cloud software. We believe collaboration is key; we connect with honesty and empathy, valuing diverse perspectives from our customers, communities, and one another.Life at Axon is dynamic, demanding, and deeply rewarding. Here, you'll take the initiative and drive meaningful change while continuously growing in a mission-driven environment.Your ImpactAs a senior member of the APX Site Reliability Engineering (SRE) team, your passion for delivering solutions to real-time challenges faced by our mission-critical cloud-native services will shine through. You will ensure the high standards of quality, reliability, and security that our customers expect. Your contributions will not only be pivotal within the APX SRE team but will also empower the entire engineering organization to deliver cutting-edge features consistently.In this role, you will significantly influence Identity and Security by assisting teams in building and managing systems that safeguard user identity, enhance authentication and authorization processes, and comply with regulatory standards. You will collaborate closely with engineering, security, and identity stakeholders to elevate secure-by-default reliability practices across the organization.
Join Us in Powering Global Connections!At Kong, we believe that technology should connect rather than divide. If you’re passionate about building robust systems that facilitate seamless API connectivity, we want to hear from you!About the Position:As a Senior Site Reliability Engineer, you will be an integral part of our global Platform SRE team, dedicated to developing, maintaining, and scaling Kong's multi-region SaaS platform that underpins the world's API connectivity.You will design and automate production systems that cater to thousands of customers across AWS, GCP, and Azure. Your work will encompass everything from multi-region Kubernetes clusters to service mesh and gateway architectures, ensuring the utmost reliability, scalability, and security of our SaaS offerings.This hands-on role is ideal for engineers who thrive in environments where they can optimize production SaaS systems at scale, automate operations, and enhance performance, resilience, and deployment pipelines.Your Responsibilities Will Include:Overseeing and scaling Kong's global SaaS platform (Konnect) to ensure reliability, availability, and performance across various regions and cloud environments.Building, automating, and maintaining a Kubernetes-based infrastructure along with deployment workflows utilizing Terraform/Terragrunt, Helm, and ArgoCD.Designing, maintaining, and optimizing multi-region data and caching layers, including PostgreSQL, Redis, ClickHouse, and Druid, for high availability and low latency.Operating and enhancing Kong Gateway and Kong Mesh environments that support hybrid and distributed architectures.Developing and maintaining CI/CD pipelines and GitOps workflows to automate service delivery and ensure consistent infrastructure modifications.Enhancing observability and incident response readiness through tools such as Datadog, Prometheus, Grafana, and Thanos, while defining and tracking SLOs.Collaborating effectively with development and security teams to ensure smooth operation of SaaS services adhering to reliability, security, and regulatory standards.
Role overview StackAdapt seeks a Staff Engineer in Vancouver to focus on FinOps and Cost Platform initiatives. This position aims to improve financial operations and drive efficiency for engineering teams. Collaboration with cross-functional groups is a key part of the role. The Staff Engineer will help enhance platform features, implement cost-saving strategies, and streamline financial workflows. What you will do Work with teams from various functions to strengthen platform capabilities Identify and introduce measures that reduce costs Simplify and improve processes related to financial operations Location This position is based in Vancouver.
Full-time|CA$144K/yr - CA$200K/yr|Hybrid|Toronto; Vancouver
The TeamAt MongoDB, our Platform Engineering division within Site Reliability Engineering (SRE) is tasked with managing essential infrastructure and operational functions that empower our engineering teams. This includes our robust, multi-cloud Kubernetes infrastructure, deployment systems, and advanced observability and alerting mechanisms.The Fabric team is at the forefront of enabling secure communication across systems and from the public internet. Our responsibilities involve designing network architecture, implementing service mesh solutions, and optimizing edge load balancing to ensure the safety of customer data in transit. This team is vital in developing and maintaining a dependable and globally connected multi-cloud network that underpins MongoDB products.This position can be based in our Toronto or Vancouver offices, or you can work completely remotely from anywhere in North America. We provide flexible hybrid work arrangements for those in our offices.
About Syndio Syndio is a Series C technology company based in Calgary, Alberta, focused on helping organizations create smarter, fairer compensation strategies. Our platform uses advanced technology and ethical AI to support decision-making, simplify compliance, and provide insights that help companies maintain equitable pay practices worldwide. Syndio analyzes compensation data for more than 10 million employees across many countries, working with leading enterprises to ensure fair and defensible pay. Role Overview: Senior Site Reliability Engineer The Senior Site Reliability Engineer (SRE) will help design, implement, maintain, and evolve solutions that improve the reliability and availability of Syndio’s applications and systems. This role blends software engineering with systems engineering, focusing on eliminating single points of failure, maximizing observability, and responding quickly to incidents. The SRE will work closely with other engineers and teams, sharing ownership and promoting a culture of collaboration and continuous learning. What You Will Do Design and maintain systems that support high availability and reliability for Syndio’s cloud-based applications. Apply software engineering principles to infrastructure and operations challenges. Identify and resolve single points of failure in the stack. Maximize observability and monitoring across platforms. Respond to and resolve failures efficiently to minimize downtime. Explore and implement new tools and techniques to improve reliability and performance. Work across platform, data, security, and software engineering as needed. Manage Kubernetes applications and infrastructure, primarily using Kubernetes and Terraform in a fully cloud-based environment. What We’re Looking For Experience managing Kubernetes applications in an SRE or similar capacity. Comfort working with Terraform and cloud-native environments. Interest in SRE practices and methodologies, with a drive to learn and adapt. Ability to work in a startup environment and handle tasks that may extend beyond traditional SRE responsibilities. Collaborative mindset and willingness to share ownership of systems and solutions. Why Join Syndio as an SRE? Play a key role in a growing engineering organization. Work on meaningful challenges that impact fair pay for millions of employees worldwide. Grow your skills across platform, data, security, and software engineering. Be part of a team that values learning, innovation, and ethical technology. Location: Calgary, Alberta, Canada
Veeva Systems Inc. is a pioneering mission-driven company in the industry cloud, dedicated to accelerating the delivery of therapies to patients within the life sciences sector. As one of the fastest-growing SaaS companies ever, we achieved over $2 billion in revenue last fiscal year and maintain significant growth potential.Our core values drive us: Do the Right Thing, Customer Success, Employee Success, and Speed. Not just a public company, we made history in 2021 by becoming a public benefit corporation (PBC), committed to balancing the interests of our customers, employees, society, and investors.As a Work Anywhere organization, we embrace flexible work arrangements, allowing you to excel in an environment that suits you best, whether that’s from home or in the office.Join us in transforming the life sciences industry, and make a meaningful impact on our customers, employees, and communities.
Full-time|CA$243K/yr - CA$297K/yr|On-site|Toronto, ON
At Relay, we empower self-made business owners with a digital banking platform that transforms financial management into a source of clarity, confidence, and control. Our mission is to replace financial uncertainty with genuine visibility, enabling entrepreneurs to convert their hard work into enduring success. By alleviating the stress of cash flow management, we provide the tools necessary for owners to operate robust and resilient businesses.As Relay continues its growth trajectory, the reliability, performance, and resilience of our platform have become integral to both our customer experience and overall business success.This senior leadership position is crucial in steering a team of Site Reliability Engineers while shaping how reliability strategies influence engineering and product decisions throughout the organization. You will determine the future direction of the SRE function, promote operational excellence, and assist the company in anticipating and managing scale challenges before they pose risks.If you thrive on tackling complex systems, leading organizations, and building resilient platforms that customers depend on daily, we are eager to connect with you!Key ResponsibilitiesLead and enhance Relay’s Site Reliability Engineering function, establishing strategic direction as the company scales.Define and implement a long-term reliability roadmap, making informed trade-offs under real business and capacity constraints.Act as the senior reliability voice in discussions involving engineering and product leadership.Influence the integration of reliability considerations into product planning, architectural decisions, and delivery processes.Serve as a senior escalation point during critical production incidents, ensuring effective communication and thorough follow-up actions.Enhance Relay’s observability, performance, and operational maturity practices across teams.Establish and uphold standards concerning SLOs, operational readiness, incident management, and continuous improvement.Collaborate with stakeholders in Engineering, Product, Data, and Finance to balance velocity, risk, performance, and cost.Build and nurture a high-performing SRE organization capable of supporting future growth.
Welcome to OktaAt Okta, we are redefining identity management. We empower individuals to securely access any technology, from any device or application, fostering a transformative approach to business security and growth. Our innovative solutions, including the Okta Platform and Auth0 Platform, prioritize identity at the heart of operational success.We value diverse perspectives and experiences, seeking lifelong learners who contribute to our dynamic culture.Join us as we shape a future where identity is truly in your hands.Are you driven to tackle complex data challenges and make a significant impact? Do you want to collaborate with a passionate team of cloud engineers and architects? If yes, we want to hear from you!The Auth0 platform manages over 100 million logins daily for clients worldwide and is rapidly expanding. As part of the Data Platform team, you will be instrumental in developing and managing essential data services that enable scalability, reliability, efficiency, and operational excellence. In your role as Senior Manager, you will collaborate with engineers across departments, guide the platform roadmap, and establish the foundational infrastructure for Auth0's future growth.As a leader, your passion for developing high-performing teams and your ability to coordinate across organizations will make you an ideal fit for this position!Your Responsibilities Include:Leading a diverse, agile software development team focused on delivering value with expertise in distributed systems, cloud infrastructure, and site reliability engineering.Fostering a culture of discovery, learning, and experimentation within a geographically distributed team through continuous coaching and mentoring.Collaborating closely with architects and engineers to design scalable, robust, and extensible services using modern technologies such as Go, Node.js, Kubernetes, Docker, AWS, and Azure.Building and managing data streaming teams utilizing event-driven architecture and Kafka.Partnering with product management and engineering leadership to define a platform roadmap that supports the next generation of identity products, overseeing planning, execution, and delivery of data platform services.Implementing process improvements to drive operational excellence and efficiency during a period of significant growth.
Full-time|On-site|Vancouver, British Columbia, Canada
Role overview Employer Direct Healthcare seeks a Senior Cloud Platform Engineer based in Vancouver, British Columbia. This position centers on building and refining cloud infrastructure that underpins the company’s healthcare services. The engineer will design, implement, and optimize cloud systems, working closely with colleagues across departments. Reliable and efficient cloud platforms are essential to the company’s mission, and this role plays a direct part in supporting the delivery and quality of care. Key responsibilities Design and implement cloud infrastructure solutions Optimize existing cloud systems for reliability and efficiency Collaborate with teams across the organization to maintain and improve cloud platforms Impact Work in this role directly affects the company’s ability to deliver high-quality healthcare services. The Senior Cloud Platform Engineer helps ensure that technology supports both care teams and patients effectively.
Pinterest is hiring a Senior Site Reliability Engineer in Toronto, ON, Canada. The focus of this role is to ensure that Pinterest’s services remain reliable, scalable, and perform well as the platform grows. Working closely with software engineers, this position involves designing and implementing solutions that strengthen system reliability and efficiency. Key responsibilities Partner with engineering teams to maintain and enhance the reliability of Pinterest’s services Design and implement improvements to support scalability and performance Troubleshoot and resolve service issues to reduce downtime Requirements Extensive experience in site reliability engineering or a closely related field Strong technical background with proven problem-solving abilities Comfort working alongside software engineers to improve systems This position is located in Toronto, ON, Canada.
Apr 24, 2026
Sign in to browse more jobs
Create account — see all 6,968 results
1 - 20 of 6,968 Jobs
Search for Senior Site Reliability Engineer - Platform & Cloud FinOps (Remote)
About the PositionJoin the innovative Cloud FinOps team at Hopper as a Senior Site Reliability Engineer. In this role, you'll oversee a vast infrastructure on Google Cloud, supporting hundreds of engineers and delivering exceptional experiences to millions of users globally.Your enthusiasm for automation and system optimization will be crucial as you work to…
Rivian and Volkswagen Group Technologies are working together to set new standards for software-defined vehicles. This partnership focuses on electric vehicles, advanced operating systems, zonal controllers, and cloud connectivity. The goal is to create vehicles that are more connected, intelligent, and sustainable by drawing on deep expertise in connectivity, artificial intelligence, and security. Role overview The Senior Site Reliability Engineer for Developer Platform will design, build, and maintain infrastructure that powers build pipelines. This work ensures teams can efficiently produce and deliver firmware for next-generation vehicles. The role plays a key part in developing the technical backbone of the organization and fostering a DevOps mindset across teams. What you will do Architect and develop infrastructure to support software build pipelines Maintain and improve systems that enable firmware delivery Collaborate with engineering teams to implement architectural plans Promote DevOps practices within the organization Location This position is based in Vancouver, British Columbia.
About ClickHouseRecognized on the 2025 Forbes Cloud 100 list, ClickHouse stands out as a leading innovator in the realm of private cloud technology. With a rapidly expanding customer base exceeding 3,000 and an astounding annual recurring revenue (ARR) growth of over 250% year-on-year, ClickHouse is at the forefront of real-time analytics, data warehousing, observability, and AI workloads.Our recent $400M Series D financing round validates our sustained momentum. Notable clients such as Capital One, Lovable, Decagon, Polymarket, and Airwallex have recently adopted or expanded their use of our platform, joining a prestigious roster of AI pioneers and global brands including Meta, Cursor, Sony, and Tesla.Join us in our mission to revolutionize the way companies leverage data!About the RoleAs we enhance our commitment to delivering dependable and secure services, we are expanding our Site Reliability Engineering team. In this role, you will spearhead initiatives to maintain and improve the reliability, availability, scalability, and performance of our cloud infrastructure. Collaborate across various teams, including Control Plane, Data Plane, Core, Security, Support, and Operations, to design and implement robust, secure, and highly available distributed systems. You will take charge of incident management and response processes, conducting blameless postmortems and driving continuous improvements in our Cloud services. Your software engineering expertise will be vital in developing tools and platforms to enhance operational and engineering efficiencies within ClickHouse Cloud. This is a unique opportunity to make a substantial impact on our high-performance, elastic ClickHouse Cloud.Your ResponsibilitiesCollaborate with diverse engineering teams at ClickHouse to architect and implement scalable, secure, and high-availability systems.Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.Ensure all infrastructure components within ClickHouse Cloud, including Data Plane, Control Plane, and ClickHouse Core, have effective monitoring and alerting systems in place for timely incident detection and resolution.Refine incident response processes and post-mortem analyses for outages in ClickHouse Cloud, including communication with impacted customers through the support team.Continuously enhance the reliability and performance of ClickHouse services.
As a Senior Site Reliability Engineer at jobgether, the focus is on maintaining and improving the reliability and performance of cloud infrastructure and services. This position is based in Canada and works closely with multiple teams across the company. Role overview The Senior Site Reliability Engineer monitors systems, implements improvements, and automates key processes. The goal is to support a platform that scales smoothly as demands grow. What you will do Ensure the ongoing reliability and performance of cloud-based systems Collaborate with other teams to address infrastructure needs and challenges Automate operational processes to reduce manual work and improve efficiency Identify and implement ways to improve scalability across the platform
Data is the new gold! At Lightspeed, we empower our data teams to construct and manage cutting-edge data and AI infrastructure platforms, alongside a robust governance framework that ensures seamless data flow across the organization. Our focus lies in data security, reliability, and high availability.Please note: As a global organization serving clients beyond Quebec, proficiency in English is a prerequisite for this role.Role:Work collaboratively with cross-functional data teams to design and implement scalable, reliable cloud infrastructure solutions that prioritize security and cost-efficiency.Advocate for a comprehensive security approach that encompasses infrastructure, supply chain, and third-party integrations, ensuring the protection of our entire data ecosystem.Contribute significantly to the development of self-service workflows for data and infrastructure, enabling teams to access resources efficiently and boost productivity.Promote and uphold best practices in Infrastructure as Code (IaC), ensuring High Availability, Disaster Recovery, and Security are integral to all design and deployment efforts.Additionally, you will:Engage in daily support and troubleshooting activities.Collaborate with the wider team to achieve organizational goals, even if it involves tasks outside your primary responsibilities.
At Lightspeed, we believe that data is the new gold! Our mission is to empower our data teams to build and sustain a cutting-edge data and AI infrastructure platform alongside a robust governance framework, ensuring seamless data flow across the organization. Our commitment to data security, reliability, and high availability is our driving force.Important Note: As an international company with employees and clients beyond Quebec, proficiency in English is essential for this role.Key Responsibilities:Work collaboratively with cross-functional data teams to design and implement cloud infrastructure solutions that are scalable, reliable, secure, and cost-effective.Take a comprehensive approach to security, addressing all aspects from infrastructure and supply chain to integration with third-party systems, ensuring the integrity of the data ecosystem.Drive the development of self-service workflows for data and infrastructure, enabling teams to efficiently access and use resources, thus enhancing agility and productivity.Promote and uphold the highest standards of Infrastructure as Code (IaC), ensuring that High Availability, Disaster Recovery, and Security are prioritized in all infrastructure design and deployment efforts.Additionally, you will:Engage in daily support and troubleshooting tasks.Collaborate with a wider team to achieve organizational goals, even if it requires stepping outside your defined role.
Full-time|CA$183K/yr - CA$203K/yr|Remote|Canada - Remote (BC, ON, AB, or NS only)
Transforming the Grocery IndustryAt Instacart, we believe in sharing love through food, ensuring everyone has access to their favorite groceries and quality time with loved ones. We don’t just see grocery delivery as a necessity; we recognize the exciting complexities and opportunities it presents to meet the diverse needs of our community. We provide an essential service that customers depend on for groceries and household goods, while also offering safe and flexible earning opportunities to our Personal Shoppers.Instacart has become a vital resource for millions, and we’re assembling a dynamic team to propel our shopping cart forward. If you are ready to deliver your best work, we invite you to join our team.Flex First Work EnvironmentWe embrace a flexible approach in how we perform our best work. Our team members can choose their work location—whether from home, an office, or their favorite coffee shop—while fostering connections and community through regular in-person events. Discover more about our flexible work approach.OverviewAbout the RoleAs a Senior Site Reliability Engineer II, you will be instrumental in ensuring the stability and performance of our platform. You will tackle challenges head-on, ensuring optimal performance and fostering a culture that emphasizes reliable and effective practices. We are seeking a proactive individual who is adept at solving complex problems and is enthusiastic about exploring innovative solutions to support our teams and services.About the TeamThe Site Reliability Engineering (SRE) team merges software and systems engineering to design and maintain large-scale, distributed, and fault-tolerant systems. Our mission is to guarantee high reliability, optimal performance, and continuous improvement for Instacart’s critical internal services and customer-facing systems.The SRE team focuses on enhancing existing systems, constructing robust infrastructure, and automating processes to reduce manual efforts. Joining the SRE team means facing unique scaling challenges while applying your expertise in coding, algorithms, complexity analysis, and large-scale system design.
Full-time|CA$144K/yr - CA$200K/yr|Hybrid|Montreal; Toronto
The Storage Layer Services (SLS) team at MongoDB is embarking on an innovative journey to re-architect our cloud storage layer, forming the core of our next-generation cloud storage architecture. This newly established team is dedicated to creating high-performance, multi-tenant distributed storage services that not only enhance our current Atlas storage stack but also enable more efficient customer workloads. As a Senior Site Reliability Engineer, you will collaborate closely with teams responsible for these storage services to establish Service Level Objectives (SLOs), develop capacity plans, and guarantee the reliability, durability, and operational safety of the foundational storage layer supporting Atlas. By joining our small team of seasoned SREs, you will play an integral role in executing a multi-year roadmap for MongoDB’s cloud storage architecture. This position is open to candidates based in our Toronto or Montreal offices or those working remotely from anywhere in Canada, provided they are located in the Eastern or Central time zones.
Embracing the benefits of remote work, we at Tecsys promote a digital-first culture that enhances employee morale, boosts productivity, and reduces the environmental impact associated with commuting. Our commitment to remote work is complemented by our well-equipped offices and collaborative spaces, offering flexibility for our team to work in the most productive manner possible.About UsTecsys is a rapidly growing innovator in supply chain solutions, serving leading healthcare systems, hospitals, pharmacies, distributors, retailers, and 3PLs. We partner with industry leaders to revolutionize their supply chains through cutting-edge technology. If you enjoy overcoming challenges and are eager for continuous learning, Tecsys may be the perfect place for you!About the RoleWe are seeking a Site Reliability Engineer to join our Network and Security Operations Center (NOC), which is integral to ensuring platform reliability for our mission-critical SaaS environments. In this role, you will be responsible for maintaining, optimizing, and ensuring the reliability and performance of our cloud infrastructure across AWS and Kubernetes. Your focus will be on automation, observability, and continuous improvement. This position combines reliability engineering with incident command, granting you significant ownership of uptime, performance, and innovation. You will join a team of highly skilled professionals who value creative problem-solving, operational excellence, and continual enhancement through automation and resilience engineering.Your ResponsibilitiesCollaborate with Engineering teams to support services pre-launch through system design consulting, software platform development, capacity planning, and launch reviews.Drive innovation: Identify issues, propose creative solutions, and implement initiatives to simplify, scale, and strengthen the platform.Monitor and maintain live services by evaluating availability, latency, and overall system health.Enhance observability: Expand monitoring and alerting with Datadog; define SLOs/SLIs and create actionable dashboards to promote reliability.Automate processes: Develop and improve internal tools, IaC frameworks, and pipelines (e.g., Terraform, GitLab CI/CD) to minimize manual intervention and enable self-healing systems.Achieve sustainable system scaling through automation and advocate for changes that enhance reliability and velocity.Function as an orchestrator using Amazon Kiro: Execute multiple activities concurrently leveraging AI agents to expedite processes while personally validating outcomes.
Become a Force for Good with Axon.At Axon, we are driven by our mission to Protect Life. We are innovators, tackling society's most pressing safety and justice challenges through our advanced ecosystem of devices and cloud-based software. Just like our products, we thrive on collaboration, embracing diverse perspectives from our customers, communities, and each other.Working at Axon is dynamic, rewarding, and impactful. You will take initiative and drive substantial change, growing continually as you contribute to a mission that truly matters at a company where your contributions are valued.Your ImpactAs a vital member of the Site Reliability Engineering (SRE) team, you are dedicated to providing solutions to the real-time challenges faced by our mission-critical cloud-native services. You are committed to ensuring the high quality and reliability that our customers expect. Collaborating closely not only within the SRE team, your technical contributions will empower the entire engineering organization, enabling product teams to consistently deliver cutting-edge features.Location: Remote in CanadaYour ResponsibilitiesDevelop robust, user-friendly foundational platforms and tools that allow engineering teams to provision services quickly, consistently, securely, and cost-effectively.Implement best practices in cloud-native site reliability.Write clean, maintainable, and efficient code.Utilize strong problem-solving abilities to debug issues in cloud-native distributed systems.Guide and educate the engineering organization in adopting innovative architectural patterns.Create thorough documentation to facilitate self-service for engineers.Embrace calculated risks, advocate for new ideas, and enhance your craft.
Become a Force for Good with Axon.At Axon, our mission is to Protect Life. We tackle society's most pressing safety and justice challenges through our innovative ecosystem of devices and cloud software. We believe collaboration is key; we connect with honesty and empathy, valuing diverse perspectives from our customers, communities, and one another.Life at Axon is dynamic, demanding, and deeply rewarding. Here, you'll take the initiative and drive meaningful change while continuously growing in a mission-driven environment.Your ImpactAs a senior member of the APX Site Reliability Engineering (SRE) team, your passion for delivering solutions to real-time challenges faced by our mission-critical cloud-native services will shine through. You will ensure the high standards of quality, reliability, and security that our customers expect. Your contributions will not only be pivotal within the APX SRE team but will also empower the entire engineering organization to deliver cutting-edge features consistently.In this role, you will significantly influence Identity and Security by assisting teams in building and managing systems that safeguard user identity, enhance authentication and authorization processes, and comply with regulatory standards. You will collaborate closely with engineering, security, and identity stakeholders to elevate secure-by-default reliability practices across the organization.
Join Us in Powering Global Connections!At Kong, we believe that technology should connect rather than divide. If you’re passionate about building robust systems that facilitate seamless API connectivity, we want to hear from you!About the Position:As a Senior Site Reliability Engineer, you will be an integral part of our global Platform SRE team, dedicated to developing, maintaining, and scaling Kong's multi-region SaaS platform that underpins the world's API connectivity.You will design and automate production systems that cater to thousands of customers across AWS, GCP, and Azure. Your work will encompass everything from multi-region Kubernetes clusters to service mesh and gateway architectures, ensuring the utmost reliability, scalability, and security of our SaaS offerings.This hands-on role is ideal for engineers who thrive in environments where they can optimize production SaaS systems at scale, automate operations, and enhance performance, resilience, and deployment pipelines.Your Responsibilities Will Include:Overseeing and scaling Kong's global SaaS platform (Konnect) to ensure reliability, availability, and performance across various regions and cloud environments.Building, automating, and maintaining a Kubernetes-based infrastructure along with deployment workflows utilizing Terraform/Terragrunt, Helm, and ArgoCD.Designing, maintaining, and optimizing multi-region data and caching layers, including PostgreSQL, Redis, ClickHouse, and Druid, for high availability and low latency.Operating and enhancing Kong Gateway and Kong Mesh environments that support hybrid and distributed architectures.Developing and maintaining CI/CD pipelines and GitOps workflows to automate service delivery and ensure consistent infrastructure modifications.Enhancing observability and incident response readiness through tools such as Datadog, Prometheus, Grafana, and Thanos, while defining and tracking SLOs.Collaborating effectively with development and security teams to ensure smooth operation of SaaS services adhering to reliability, security, and regulatory standards.
Role overview StackAdapt seeks a Staff Engineer in Vancouver to focus on FinOps and Cost Platform initiatives. This position aims to improve financial operations and drive efficiency for engineering teams. Collaboration with cross-functional groups is a key part of the role. The Staff Engineer will help enhance platform features, implement cost-saving strategies, and streamline financial workflows. What you will do Work with teams from various functions to strengthen platform capabilities Identify and introduce measures that reduce costs Simplify and improve processes related to financial operations Location This position is based in Vancouver.
Full-time|CA$144K/yr - CA$200K/yr|Hybrid|Toronto; Vancouver
The TeamAt MongoDB, our Platform Engineering division within Site Reliability Engineering (SRE) is tasked with managing essential infrastructure and operational functions that empower our engineering teams. This includes our robust, multi-cloud Kubernetes infrastructure, deployment systems, and advanced observability and alerting mechanisms.The Fabric team is at the forefront of enabling secure communication across systems and from the public internet. Our responsibilities involve designing network architecture, implementing service mesh solutions, and optimizing edge load balancing to ensure the safety of customer data in transit. This team is vital in developing and maintaining a dependable and globally connected multi-cloud network that underpins MongoDB products.This position can be based in our Toronto or Vancouver offices, or you can work completely remotely from anywhere in North America. We provide flexible hybrid work arrangements for those in our offices.
About Syndio Syndio is a Series C technology company based in Calgary, Alberta, focused on helping organizations create smarter, fairer compensation strategies. Our platform uses advanced technology and ethical AI to support decision-making, simplify compliance, and provide insights that help companies maintain equitable pay practices worldwide. Syndio analyzes compensation data for more than 10 million employees across many countries, working with leading enterprises to ensure fair and defensible pay. Role Overview: Senior Site Reliability Engineer The Senior Site Reliability Engineer (SRE) will help design, implement, maintain, and evolve solutions that improve the reliability and availability of Syndio’s applications and systems. This role blends software engineering with systems engineering, focusing on eliminating single points of failure, maximizing observability, and responding quickly to incidents. The SRE will work closely with other engineers and teams, sharing ownership and promoting a culture of collaboration and continuous learning. What You Will Do Design and maintain systems that support high availability and reliability for Syndio’s cloud-based applications. Apply software engineering principles to infrastructure and operations challenges. Identify and resolve single points of failure in the stack. Maximize observability and monitoring across platforms. Respond to and resolve failures efficiently to minimize downtime. Explore and implement new tools and techniques to improve reliability and performance. Work across platform, data, security, and software engineering as needed. Manage Kubernetes applications and infrastructure, primarily using Kubernetes and Terraform in a fully cloud-based environment. What We’re Looking For Experience managing Kubernetes applications in an SRE or similar capacity. Comfort working with Terraform and cloud-native environments. Interest in SRE practices and methodologies, with a drive to learn and adapt. Ability to work in a startup environment and handle tasks that may extend beyond traditional SRE responsibilities. Collaborative mindset and willingness to share ownership of systems and solutions. Why Join Syndio as an SRE? Play a key role in a growing engineering organization. Work on meaningful challenges that impact fair pay for millions of employees worldwide. Grow your skills across platform, data, security, and software engineering. Be part of a team that values learning, innovation, and ethical technology. Location: Calgary, Alberta, Canada
Veeva Systems Inc. is a pioneering mission-driven company in the industry cloud, dedicated to accelerating the delivery of therapies to patients within the life sciences sector. As one of the fastest-growing SaaS companies ever, we achieved over $2 billion in revenue last fiscal year and maintain significant growth potential.Our core values drive us: Do the Right Thing, Customer Success, Employee Success, and Speed. Not just a public company, we made history in 2021 by becoming a public benefit corporation (PBC), committed to balancing the interests of our customers, employees, society, and investors.As a Work Anywhere organization, we embrace flexible work arrangements, allowing you to excel in an environment that suits you best, whether that’s from home or in the office.Join us in transforming the life sciences industry, and make a meaningful impact on our customers, employees, and communities.
Full-time|CA$243K/yr - CA$297K/yr|On-site|Toronto, ON
At Relay, we empower self-made business owners with a digital banking platform that transforms financial management into a source of clarity, confidence, and control. Our mission is to replace financial uncertainty with genuine visibility, enabling entrepreneurs to convert their hard work into enduring success. By alleviating the stress of cash flow management, we provide the tools necessary for owners to operate robust and resilient businesses.As Relay continues its growth trajectory, the reliability, performance, and resilience of our platform have become integral to both our customer experience and overall business success.This senior leadership position is crucial in steering a team of Site Reliability Engineers while shaping how reliability strategies influence engineering and product decisions throughout the organization. You will determine the future direction of the SRE function, promote operational excellence, and assist the company in anticipating and managing scale challenges before they pose risks.If you thrive on tackling complex systems, leading organizations, and building resilient platforms that customers depend on daily, we are eager to connect with you!Key ResponsibilitiesLead and enhance Relay’s Site Reliability Engineering function, establishing strategic direction as the company scales.Define and implement a long-term reliability roadmap, making informed trade-offs under real business and capacity constraints.Act as the senior reliability voice in discussions involving engineering and product leadership.Influence the integration of reliability considerations into product planning, architectural decisions, and delivery processes.Serve as a senior escalation point during critical production incidents, ensuring effective communication and thorough follow-up actions.Enhance Relay’s observability, performance, and operational maturity practices across teams.Establish and uphold standards concerning SLOs, operational readiness, incident management, and continuous improvement.Collaborate with stakeholders in Engineering, Product, Data, and Finance to balance velocity, risk, performance, and cost.Build and nurture a high-performing SRE organization capable of supporting future growth.
Welcome to OktaAt Okta, we are redefining identity management. We empower individuals to securely access any technology, from any device or application, fostering a transformative approach to business security and growth. Our innovative solutions, including the Okta Platform and Auth0 Platform, prioritize identity at the heart of operational success.We value diverse perspectives and experiences, seeking lifelong learners who contribute to our dynamic culture.Join us as we shape a future where identity is truly in your hands.Are you driven to tackle complex data challenges and make a significant impact? Do you want to collaborate with a passionate team of cloud engineers and architects? If yes, we want to hear from you!The Auth0 platform manages over 100 million logins daily for clients worldwide and is rapidly expanding. As part of the Data Platform team, you will be instrumental in developing and managing essential data services that enable scalability, reliability, efficiency, and operational excellence. In your role as Senior Manager, you will collaborate with engineers across departments, guide the platform roadmap, and establish the foundational infrastructure for Auth0's future growth.As a leader, your passion for developing high-performing teams and your ability to coordinate across organizations will make you an ideal fit for this position!Your Responsibilities Include:Leading a diverse, agile software development team focused on delivering value with expertise in distributed systems, cloud infrastructure, and site reliability engineering.Fostering a culture of discovery, learning, and experimentation within a geographically distributed team through continuous coaching and mentoring.Collaborating closely with architects and engineers to design scalable, robust, and extensible services using modern technologies such as Go, Node.js, Kubernetes, Docker, AWS, and Azure.Building and managing data streaming teams utilizing event-driven architecture and Kafka.Partnering with product management and engineering leadership to define a platform roadmap that supports the next generation of identity products, overseeing planning, execution, and delivery of data platform services.Implementing process improvements to drive operational excellence and efficiency during a period of significant growth.
Full-time|On-site|Vancouver, British Columbia, Canada
Role overview Employer Direct Healthcare seeks a Senior Cloud Platform Engineer based in Vancouver, British Columbia. This position centers on building and refining cloud infrastructure that underpins the company’s healthcare services. The engineer will design, implement, and optimize cloud systems, working closely with colleagues across departments. Reliable and efficient cloud platforms are essential to the company’s mission, and this role plays a direct part in supporting the delivery and quality of care. Key responsibilities Design and implement cloud infrastructure solutions Optimize existing cloud systems for reliability and efficiency Collaborate with teams across the organization to maintain and improve cloud platforms Impact Work in this role directly affects the company’s ability to deliver high-quality healthcare services. The Senior Cloud Platform Engineer helps ensure that technology supports both care teams and patients effectively.
Pinterest is hiring a Senior Site Reliability Engineer in Toronto, ON, Canada. The focus of this role is to ensure that Pinterest’s services remain reliable, scalable, and perform well as the platform grows. Working closely with software engineers, this position involves designing and implementing solutions that strengthen system reliability and efficiency. Key responsibilities Partner with engineering teams to maintain and enhance the reliability of Pinterest’s services Design and implement improvements to support scalability and performance Troubleshoot and resolve service issues to reduce downtime Requirements Extensive experience in site reliability engineering or a closely related field Strong technical background with proven problem-solving abilities Comfort working alongside software engineers to improve systems This position is located in Toronto, ON, Canada.
Apr 24, 2026
Sign in to browse more jobs
Create account — see all 6,968 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.