Site Reliability Engineer at TextNow | Canada
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Entry Level
Qualifications
About TextNow Inc.
TextNow is committed to making communication accessible to everyone. We are at the forefront of transforming phone service into a free and democratic resource for all. Our team is composed of innovative thinkers dedicated to creating a more connected world.
Similar jobs
Browse all companies, explore by city & role, or SEO search pages.
Search for Site Reliability Engineer at Tecsys | Remote
5,319 results
Embracing the benefits of remote work, we at Tecsys promote a digital-first culture that enhances employee morale, boosts productivity, and reduces the environmental impact associated with commuting. Our commitment to remote work is complemented by our well-equipped offices and collaborative spaces, offering flexibility for our team to work in the most produ…
Tecsys Inc.
At Tecsys, we recognize the transformative power of remote work on employee well-being and the environment. Our commitment to remote work fosters enhanced employee morale, productivity, and reduced commuting times. We are proud to be a remote-first organization, supported by cutting-edge technologies and programs that create a fantastic foundation for our team. Our flexible remote environment, complemented by well-located offices and collaborative workspaces, empowers our staff to work in ways that maximize their productivity.About TecsysTecsys is a rapidly growing innovator in supply chain solutions for leading healthcare systems, hospitals, pharmacies, distributors, retailers, and 3PLs. We collaborate with industry leaders to transform their supply chains through technology. If you thrive on tackling challenges and seek continuous learning opportunities, we invite you to join our dynamic team!Position OverviewWe are in search of an Infrastructure Reliability Engineer to join our Network Operations and Security Center (NOC) team, which is pivotal to the reliability of our critical SaaS platforms. In this role, you will contribute to the maintenance, optimization, and assurance of the reliability and performance of the systems that drive our cloud infrastructure on AWS and Kubernetes. A strong focus will be placed on automation, observability, and continuous improvement.This position amalgamates reliability engineering with incident management, placing you in a key role responsible for availability, performance, and innovation. You will be part of a highly skilled team that values creative problem-solving, operational excellence, and the continuous enhancement of resilience through automation and engineering.Your ResponsibilitiesCollaborate with engineering teams to support services prior to their launch through activities such as systems design consultation, platform and software framework development, capacity planning, and launch reviews.Continuously innovate by identifying weaknesses, proposing creative solutions, and driving initiatives that simplify, scale, and strengthen the platform.Maintain services post-launch by measuring and monitoring availability, latency, and overall system health.Ensure optimized observability: enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that yield reliability outcomes.Develop and enhance...
Join our innovative team at Tecsys as a Senior Instructional Designer, where your expertise will drive the development of cutting-edge learning solutions! While this role is remote-first, we encourage candidates from the Montreal and Greater Montreal area to apply.At Tecsys, we embrace the benefits of remote work, including enhanced productivity and employee wellbeing. Our digital-first approach allows our team to thrive in an environment that fosters collaboration and flexibility.About TecsysTecsys is a rapidly expanding leader in providing advanced supply chain solutions tailored for healthcare systems, hospitals, pharmacies, and distributors. Our commitment to transforming supply chains through technology sets us apart, and if you're passionate about tackling complex challenges with opportunities for continuous learning, Tecsys is the perfect fit for you!Your RoleAs a Senior Instructional Designer, you will play a crucial role in our organizational success by enhancing employee and client training experiences. Collaborating closely with business leaders and subject matter experts, you will create and implement modern learning strategies that propel both individual and organizational growth.Key ResponsibilitiesLearning StrategyLead the design and execution of product training strategies and curriculum frameworks for complex software ecosystems.Innovate with digital-first learning approaches to improve educational experiences.Align learning strategies with organizational goals to enhance professional and leadership capabilities.Co-design an organizational skills framework to promote upskilling and career development.Program Design, Development & MeasurementConduct data-driven needs assessments to identify skill gaps and learning priorities.Utilize instructional design best practices to create engaging, learner-centered content (e-learning, workshops, simulations).Design and develop targeted competency development programs for employees and managers.
About ClickHouseRecognized on the 2025 Forbes Cloud 100 list, ClickHouse stands out as a leading innovator in the realm of private cloud technology. With a rapidly expanding customer base exceeding 3,000 and an astounding annual recurring revenue (ARR) growth of over 250% year-on-year, ClickHouse is at the forefront of real-time analytics, data warehousing, observability, and AI workloads.Our recent $400M Series D financing round validates our sustained momentum. Notable clients such as Capital One, Lovable, Decagon, Polymarket, and Airwallex have recently adopted or expanded their use of our platform, joining a prestigious roster of AI pioneers and global brands including Meta, Cursor, Sony, and Tesla.Join us in our mission to revolutionize the way companies leverage data!About the RoleAs we enhance our commitment to delivering dependable and secure services, we are expanding our Site Reliability Engineering team. In this role, you will spearhead initiatives to maintain and improve the reliability, availability, scalability, and performance of our cloud infrastructure. Collaborate across various teams, including Control Plane, Data Plane, Core, Security, Support, and Operations, to design and implement robust, secure, and highly available distributed systems. You will take charge of incident management and response processes, conducting blameless postmortems and driving continuous improvements in our Cloud services. Your software engineering expertise will be vital in developing tools and platforms to enhance operational and engineering efficiencies within ClickHouse Cloud. This is a unique opportunity to make a substantial impact on our high-performance, elastic ClickHouse Cloud.Your ResponsibilitiesCollaborate with diverse engineering teams at ClickHouse to architect and implement scalable, secure, and high-availability systems.Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.Ensure all infrastructure components within ClickHouse Cloud, including Data Plane, Control Plane, and ClickHouse Core, have effective monitoring and alerting systems in place for timely incident detection and resolution.Refine incident response processes and post-mortem analyses for outages in ClickHouse Cloud, including communication with impacted customers through the support team.Continuously enhance the reliability and performance of ClickHouse services.
Embracing the benefits of remote work, including enhanced employee satisfaction, increased productivity, and a positive impact on well-being and the environment, Tecsys proudly operates as a digital-first organization. Our digital-centric work culture, complemented by our strategically located offices and collaborative workspaces, empowers our team with the flexibility to work in the most effective manner for their productivity.About UsTecsys is a rapidly expanding innovator providing advanced supply chain solutions to leading healthcare systems, hospitals, pharmacy businesses, distributors, retailers, and third-party logistics providers. We collaborate with industry leaders to revolutionize their supply chains through pioneering technology. If you are passionate about overcoming intriguing challenges and seeking continuous learning opportunities, Tecsys may be the perfect place for you!About the RoleWe are on the lookout for a Security Governance, Risk and Compliance Specialist who will play a crucial role in defining how security can enhance business initiatives and ensure adherence to security best practices alongside relevant contractual and regulatory requirements. The ideal candidate will assist in implementing a robust security risk management framework. This position will also involve managing vendor risk and business continuity programs. As a security subject matter expert, you will propose enhancements to mitigate, contain, and reduce identified risks while participating in various business and security initiatives aimed at boosting Tecsys's security maturity.Key ResponsibilitiesSupport the ongoing development of a security risk management framework.Work collaboratively with technical teams to develop, implement, and monitor corrective action plans for security compliance issues or audit deficiencies.Engage with stakeholders to define processes, automate, and continuously monitor information security controls, exceptions, risks, testing, and evidence gathering.Create reporting metrics and dashboards.Assist in identifying cyber risks and addressing governance gaps and process inefficiencies.Participate in internal and external security and compliance assessment initiatives such as SOC 2, PCI-DSS, NIST, and FedRAMP.Review and enhance the vendor risk management program.Monitor existing controls and conduct periodic audits and reviews to ensure their effectiveness and operating efficiency while identifying and addressing potential issues.Collaborate with internal IT and business teams to identify cyber risks and prioritize security compliance improvements.As a security expert, support IT and cyber teams in implementing controls to meet security and privacy compliance requirements and best practices.
Instacart
Transforming the Grocery IndustryAt Instacart, we believe in sharing love through food, ensuring everyone has access to their favorite groceries and quality time with loved ones. We don’t just see grocery delivery as a necessity; we recognize the exciting complexities and opportunities it presents to meet the diverse needs of our community. We provide an essential service that customers depend on for groceries and household goods, while also offering safe and flexible earning opportunities to our Personal Shoppers.Instacart has become a vital resource for millions, and we’re assembling a dynamic team to propel our shopping cart forward. If you are ready to deliver your best work, we invite you to join our team.Flex First Work EnvironmentWe embrace a flexible approach in how we perform our best work. Our team members can choose their work location—whether from home, an office, or their favorite coffee shop—while fostering connections and community through regular in-person events. Discover more about our flexible work approach.OverviewAbout the RoleAs a Senior Site Reliability Engineer II, you will be instrumental in ensuring the stability and performance of our platform. You will tackle challenges head-on, ensuring optimal performance and fostering a culture that emphasizes reliable and effective practices. We are seeking a proactive individual who is adept at solving complex problems and is enthusiastic about exploring innovative solutions to support our teams and services.About the TeamThe Site Reliability Engineering (SRE) team merges software and systems engineering to design and maintain large-scale, distributed, and fault-tolerant systems. Our mission is to guarantee high reliability, optimal performance, and continuous improvement for Instacart’s critical internal services and customer-facing systems.The SRE team focuses on enhancing existing systems, constructing robust infrastructure, and automating processes to reduce manual efforts. Joining the SRE team means facing unique scaling challenges while applying your expertise in coding, algorithms, complexity analysis, and large-scale system design.
Axon Enterprise, Inc.
Become a Force for Good with Axon.At Axon, we are driven by our mission to Protect Life. We are innovators, tackling society's most pressing safety and justice challenges through our advanced ecosystem of devices and cloud-based software. Just like our products, we thrive on collaboration, embracing diverse perspectives from our customers, communities, and each other.Working at Axon is dynamic, rewarding, and impactful. You will take initiative and drive substantial change, growing continually as you contribute to a mission that truly matters at a company where your contributions are valued.Your ImpactAs a vital member of the Site Reliability Engineering (SRE) team, you are dedicated to providing solutions to the real-time challenges faced by our mission-critical cloud-native services. You are committed to ensuring the high quality and reliability that our customers expect. Collaborating closely not only within the SRE team, your technical contributions will empower the entire engineering organization, enabling product teams to consistently deliver cutting-edge features.Location: Remote in CanadaYour ResponsibilitiesDevelop robust, user-friendly foundational platforms and tools that allow engineering teams to provision services quickly, consistently, securely, and cost-effectively.Implement best practices in cloud-native site reliability.Write clean, maintainable, and efficient code.Utilize strong problem-solving abilities to debug issues in cloud-native distributed systems.Guide and educate the engineering organization in adopting innovative architectural patterns.Create thorough documentation to facilitate self-service for engineers.Embrace calculated risks, advocate for new ideas, and enhance your craft.
Location Details:This role is fully remote, allowing you to work from the comfort of your home. Occasional visits to a GoDaddy office may be required for team events or meetings.Join Our TeamAt GoDaddy, we believe in a future of work that adapts to the needs of each team. We are looking for a talented and driven Site Reliability Engineer (SRE) to enhance our evolving team. This position focuses on automating and maintaining our storage infrastructure with an emphasis on Ceph, ensuring the resilience, scalability, and performance of our systems.Your Responsibilities Include:Automating and overseeing daily operations of storage systems to meet application requirements.Creating and maintaining tools and automation scripts to enhance storage operations and boost efficiency.Monitoring system performance, diagnosing issues, and implementing solutions to guarantee high availability and reliability.Engaging in agile methodologies such as daily stand-ups, task tracking, code reviews, automated testing, and continuous integration/deployment.Proactively improving system reliability, performance, and capacity through effective monitoring, automation, and optimization.
Coalition
Join Coalition as a Staff Site Reliability Engineer, where you will play a pivotal role in enhancing the reliability and performance of our systems. We are seeking a passionate engineer who thrives in a collaborative environment and is dedicated to ensuring seamless operations across our platforms.
Join our innovative team at Newton as a Site Reliability Engineer, where you'll play a crucial role in ensuring the reliability and performance of our systems. In this fully remote position, you will collaborate with engineering and operations teams to develop solutions that enhance system uptime and efficiency.Your expertise will help us transition and maintain our infrastructure, ensuring our services are resilient and scalable. This is an exciting opportunity to contribute to a company that values innovation and teamwork.
Tyk Technologies
About Tyk TechnologiesTyk Technologies is at the forefront of API Management, paving the way for a connected world and enabling innovative products and services. Our platform transforms how organizations connect their systems and services, whether they are internal, external, public, or highly encrypted. We empower businesses in various sectors including retail, finance, telecommunications, healthcare, and media.Founded in 2015, Tyk has expanded globally with offices in London, Ontario, Atlanta, and Singapore, serving thousands of users worldwide. Our platform is trusted by renowned brands such as Lotte, Bell, T-Mobile, RBS, Capital One, and Vinci, with a diverse user base that spans every continent.Our VisionAt Tyk, we are committed to connecting every system in the world through our comprehensive API Management platform.Work Culture: Flexibility and ResponsibilityWe believe in providing unlimited paid holidays and the flexibility to work from anywhere. Our remote-first philosophy is built on the principles of autonomy and flexibility, enabling our employees to perform at their best and fostering a diverse team without barriers to location or working hours.The Role: Site Reliability EngineerWe are seeking a proactive Site Reliability Engineer to oversee, enhance, and provide support for our platform. Your curiosity and problem-solving skills will drive improvements, as you will be responsible for identifying reliability issues and collaborating with your team to address them. As the first line of incident management for our clients, you will define our response strategies.This position offers a unique opportunity to collaborate with a leading distributed team and shape the future of Tyk as we continue to expand our Cloud platform.
TextNow Inc.
At TextNow, we believe that communication should be accessible to everyone. Our mission is to democratize phone service, and we are reshaping the way the world connects. As the largest provider of free phone service in Canada, we are just getting started. Join our team and help us break down communication barriers, enabling conversations to flow freely for individuals everywhere.We are on the lookout for a dedicated Site Reliability Engineer to take charge of our infrastructure, monitoring systems, logging processes, CI/CD pipelines, and overall reliability. This position is critical in driving impact at scale.In this role, you will play a key part in shaping how TextNow designs and operates its systems within an AI-first environment, where intelligent tooling is a fundamental aspect of our engineering practices. Utilizing AI is not just encouraged; it's expected. From system design and architecture to implementation, testing, debugging, documentation, and operational analysis, you will leverage AI tools to boost development speed, enhance code quality, and make informed technical decisions. We provide a comprehensive suite of AI-powered development tools, and we expect you to continuously innovate in their application to elevate efficiency, clarity, and excellence in our products.
At Confluent, we are not just advancing technology; we are revolutionizing the flow of data and its potential applications. Our platform empowers businesses to utilize data in real-time, enabling them to adapt swiftly, innovate intelligently, and offer experiences that resonate with the fast-paced world around them.We seek individuals who thrive in collaborative environments, who are unafraid to pose challenging questions, provide constructive feedback, and support one another. Our team is built on a foundation of curiosity and collective ambition, where egos take a backseat to team efforts.Join us at Confluent as we unite as one team on our journey to enhance data streaming.About the Role:As a Staff Site Reliability Engineer specializing in Incident Management, you will play a crucial role in maintaining the reliability of Confluent Cloud, which processes millions of events per second across multiple cloud platforms like AWS, GCP, and Azure. You will leverage your deep systems thinking to preemptively address incidents that could disrupt our multi-cloud streaming services.Your work will blend technical expertise with strategic program ownership, dedicating about 75% of your time to engineering tasks such as automating processes, refining tools, analyzing failure patterns, and enhancing reliability. The remaining 25% will focus on coaching and collaboration, guiding teams through post-incident reviews and refining our incident response methodologies.You will be part of a global team that ensures continuous support, maintaining a sustainable workload through seamless transitions. This position falls within the Cloud Architecture and Reliability - Supportability division, a team committed to establishing and upholding reliability standards across our engineering efforts.
About FableFable collaborates with global enterprises to enhance accessibility for over one billion individuals with disabilities. Our esteemed clients include industry leaders such as Walmart, Slack, and Shopify. Recognized on the Forbes Accessibility 100 list in 2025, we have also been honored as one of Fast Company’s Most Innovative Companies in Design, receiving accolades from prestigious organizations including the World Summit Awards and the UN-endorsed Zero Project.About the RoleAs a Senior Site Reliability Engineer at Fable, you will be instrumental in ensuring the reliability, scalability, and efficiency of our platform during our growth phase. Our products empower organizations to create more accessible digital experiences, and the robustness of our infrastructure is key to achieving this mission. You will engage with various platform and product systems to ensure stability, performance, and cost-effectiveness, enabling teams to operate swiftly and securely.With the integration of AI capabilities in contemporary product experiences, you will also help prepare Fable’s infrastructure to handle AI workloads, balancing reliability, performance, and cost while enabling teams to innovate and scale new features safely.Reporting to the Director of Technical Operations, you will collaborate closely with teams across Engineering and Product. This role is perfect for those who thrive on hands-on technical work, take pride in system health, tooling, and operational excellence, and are eager to influence Fable’s infrastructure and reliability strategy moving forward.Key ResponsibilitiesReliability, Infrastructure & PlatformDesign, build, and maintain reliable, scalable, and secure infrastructure for Fable’s product services.Enhance system observability, monitoring, and alerting to ensure high availability and rapid incident response.Contribute to and refine SRE practices, including SLIs/SLOs, incident management, and postmortems.Support and optimize CI/CD pipelines and deployment processes.Identify and minimize operational complexity across systems and tooling.Collaborate across infrastructure and application layers to diagnose and resolve reliability and performance issues, making targeted improvements to application code when necessary.Support infrastructure and platform capabilities required for AI/ML-powered features, including considerations for scaling, performance, and reliability.Cost Efficiency & PerformanceMonitor and optimize infrastructure costs across cloud environments.
Parallel Domain
Parallel Domain is looking for a Senior Site Reliability Engineer to help build and maintain the infrastructure behind its high-fidelity simulation platform. This technology supports the development and validation of autonomous vehicles and robotic systems in safe, virtual environments. Role overview This is a hands-on engineering position focused on ensuring the smooth operation of large-scale, distributed simulation workloads. The role involves close collaboration with teams working on platforms, simulations, and machine learning projects. Day-to-day work centers on managing and scaling multi-region AWS infrastructure, deploying and maintaining Kubernetes clusters, and improving the reliability and security of deployment pipelines used by engineering teams. Key responsibilities Manage and scale AWS infrastructure across multiple regions Deploy, monitor, and optimize Kubernetes workloads Enhance reliability and security of deployment systems Support large-scale batch simulation and distributed workloads Collaborate with engineering teams across platforms, simulations, and machine learning Challenges and focus areas Multi-region GPU scheduling Running Windows workloads on Kubernetes Scaling batch simulation infrastructure This remote role is open to candidates based in Canada. The team values innovative thinkers who are eager to solve complex infrastructure problems and contribute directly to the evolution of autonomous system technology.
At Affirm, we're transforming the landscape of credit to create a more transparent and user-friendly experience, empowering consumers with the option to buy now and pay later—free from hidden fees and compounding interest.As the Director of Software Engineering focusing on Site Reliability Engineering, you will be responsible for establishing and enhancing world-class reliability practices. This includes managing incident responses, risk assessments, and operational lifecycle programs that ensure resilience is integrated into every phase of development. You will collaborate closely with leaders across Product, Security, Enterprise Risk, Legal, Compliance, and Engineering to proactively identify and address systemic risks throughout the organization.Your role will involve building, nurturing, and leading a diverse global team of Site Reliability Engineers (SREs), systems engineers, and full-stack engineers, while promoting a culture that emphasizes learning, innovation, and accountability. As a senior technical leader, you will combine a hands-on technical approach with strategic direction, enabling Affirm to innovate rapidly while preserving the trust of our users and partners on a large scale.
jobgether
As a Senior Site Reliability Engineer at jobgether, the focus is on maintaining and improving the reliability and performance of cloud infrastructure and services. This position is based in Canada and works closely with multiple teams across the company. Role overview The Senior Site Reliability Engineer monitors systems, implements improvements, and automates key processes. The goal is to support a platform that scales smoothly as demands grow. What you will do Ensure the ongoing reliability and performance of cloud-based systems Collaborate with other teams to address infrastructure needs and challenges Automate operational processes to reduce manual work and improve efficiency Identify and implement ways to improve scalability across the platform
Pinterest, Inc.
Pinterest is hiring a Senior Site Reliability Engineer in Toronto, ON, Canada. The focus of this role is to ensure that Pinterest’s services remain reliable, scalable, and perform well as the platform grows. Working closely with software engineers, this position involves designing and implementing solutions that strengthen system reliability and efficiency. Key responsibilities Partner with engineering teams to maintain and enhance the reliability of Pinterest’s services Design and implement improvements to support scalability and performance Troubleshoot and resolve service issues to reduce downtime Requirements Extensive experience in site reliability engineering or a closely related field Strong technical background with proven problem-solving abilities Comfort working alongside software engineers to improve systems This position is located in Toronto, ON, Canada.
MongoDB, Inc.
The TeamAt MongoDB, our Platform Engineering division within Site Reliability Engineering (SRE) is tasked with managing essential infrastructure and operational functions that empower our engineering teams. This includes our robust, multi-cloud Kubernetes infrastructure, deployment systems, and advanced observability and alerting mechanisms.The Fabric team is at the forefront of enabling secure communication across systems and from the public internet. Our responsibilities involve designing network architecture, implementing service mesh solutions, and optimizing edge load balancing to ensure the safety of customer data in transit. This team is vital in developing and maintaining a dependable and globally connected multi-cloud network that underpins MongoDB products.This position can be based in our Toronto or Vancouver offices, or you can work completely remotely from anywhere in North America. We provide flexible hybrid work arrangements for those in our offices.
About Syndio Syndio is a Series C technology company based in Calgary, Alberta, focused on helping organizations create smarter, fairer compensation strategies. Our platform uses advanced technology and ethical AI to support decision-making, simplify compliance, and provide insights that help companies maintain equitable pay practices worldwide. Syndio analyzes compensation data for more than 10 million employees across many countries, working with leading enterprises to ensure fair and defensible pay. Role Overview: Senior Site Reliability Engineer The Senior Site Reliability Engineer (SRE) will help design, implement, maintain, and evolve solutions that improve the reliability and availability of Syndio’s applications and systems. This role blends software engineering with systems engineering, focusing on eliminating single points of failure, maximizing observability, and responding quickly to incidents. The SRE will work closely with other engineers and teams, sharing ownership and promoting a culture of collaboration and continuous learning. What You Will Do Design and maintain systems that support high availability and reliability for Syndio’s cloud-based applications. Apply software engineering principles to infrastructure and operations challenges. Identify and resolve single points of failure in the stack. Maximize observability and monitoring across platforms. Respond to and resolve failures efficiently to minimize downtime. Explore and implement new tools and techniques to improve reliability and performance. Work across platform, data, security, and software engineering as needed. Manage Kubernetes applications and infrastructure, primarily using Kubernetes and Terraform in a fully cloud-based environment. What We’re Looking For Experience managing Kubernetes applications in an SRE or similar capacity. Comfort working with Terraform and cloud-native environments. Interest in SRE practices and methodologies, with a drive to learn and adapt. Ability to work in a startup environment and handle tasks that may extend beyond traditional SRE responsibilities. Collaborative mindset and willingness to share ownership of systems and solutions. Why Join Syndio as an SRE? Play a key role in a growing engineering organization. Work on meaningful challenges that impact fair pay for millions of employees worldwide. Grow your skills across platform, data, security, and software engineering. Be part of a team that values learning, innovation, and ethical technology. Location: Calgary, Alberta, Canada
Sign in to browse more jobs
Create account — see all 5,319 results
