Site Reliability Engineer at Tecsys | Remote

TecsysRemote — Montreal, Quebec, Canada

Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Experience Level

Experience

Qualifications

To succeed in this role, candidates should have a solid understanding of cloud infrastructure, experience with AWS and Kubernetes, and a passion for automation and system reliability. Familiarity with monitoring tools such as Datadog is also a plus. Candidates should be adept at problem-solving and possess excellent communication skills.

About the job

Embracing the benefits of remote work, we at Tecsys promote a digital-first culture that enhances employee morale, boosts productivity, and reduces the environmental impact associated with commuting. Our commitment to remote work is complemented by our well-equipped offices and collaborative spaces, offering flexibility for our team to work in the most productive manner possible.

About Us

Tecsys is a rapidly growing innovator in supply chain solutions, serving leading healthcare systems, hospitals, pharmacies, distributors, retailers, and 3PLs. We partner with industry leaders to revolutionize their supply chains through cutting-edge technology. If you enjoy overcoming challenges and are eager for continuous learning, Tecsys may be the perfect place for you!

About the Role

We are seeking a Site Reliability Engineer to join our Network and Security Operations Center (NOC), which is integral to ensuring platform reliability for our mission-critical SaaS environments. In this role, you will be responsible for maintaining, optimizing, and ensuring the reliability and performance of our cloud infrastructure across AWS and Kubernetes. Your focus will be on automation, observability, and continuous improvement. This position combines reliability engineering with incident command, granting you significant ownership of uptime, performance, and innovation. You will join a team of highly skilled professionals who value creative problem-solving, operational excellence, and continual enhancement through automation and resilience engineering.

Your Responsibilities

Collaborate with Engineering teams to support services pre-launch through system design consulting, software platform development, capacity planning, and launch reviews.
Drive innovation: Identify issues, propose creative solutions, and implement initiatives to simplify, scale, and strengthen the platform.
Monitor and maintain live services by evaluating availability, latency, and overall system health.
Enhance observability: Expand monitoring and alerting with Datadog; define SLOs/SLIs and create actionable dashboards to promote reliability.
Automate processes: Develop and improve internal tools, IaC frameworks, and pipelines (e.g., Terraform, GitLab CI/CD) to minimize manual intervention and enable self-healing systems.
Achieve sustainable system scaling through automation and advocate for changes that enhance reliability and velocity.
Function as an orchestrator using Amazon Kiro: Execute multiple activities concurrently leveraging AI agents to expedite processes while personally validating outcomes.

About Tecsys

Tecsys is a forward-thinking company revolutionizing supply chain management through innovative solutions tailored for healthcare and other industries. We pride ourselves on fostering a collaborative environment that encourages professional growth and continuous learning.