SiteTracker logoSiteTracker logo

Site Reliability Engineer at SiteTracker | Canada

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Mid to Senior

Qualifications

Qualifications: A strong background in Site Reliability Engineering, proficiency in AWS services, and a passion for building reliable systems are essential. Demonstrated experience in defining SLIs and SLOs and leading incident responses is highly valued.

About the job

Join Our Team as a Site Reliability Engineer

Seize the opportunity to create a robust reliability practice from the ground up at SiteTracker. You will be instrumental in establishing engineering standards, including Service Level Objectives (SLOs), error budgets, and observability, designed to safeguard our platform as we scale for enterprise clients and enhance our AI capabilities. With full autonomy to define strategy and the trust to implement it, your role will ensure our AI workloads (Evaluations, RAG, and LLM processing) consistently meet premier reliability benchmarks. If you thrive on solving challenges proactively and view toil as an engineering problem, this position promises to be a pivotal step in your career.

Your Responsibilities

As a Staff or Senior Staff Site Reliability Engineer, you will collaborate with existing engineers to shift our organization from a reactive approach to a proactive, methodical reliability practice. You will spearhead the intentional advancement of our infrastructure, recognizing the right moments to adopt new tools and transitioning from manual scripts and templates only when necessary. Whether architecting incident response frameworks or tackling unique reliability issues for AI agents, your contributions will amplify the effectiveness of the entire engineering team.

By approaching every challenge with a consultative perspective, you will inform technical decisions grounded in data rather than instinct, ensuring our multi-region or service mesh adoption roadmap is future-ready. You will not merely receive tasks; rather, you will take ownership of strategies for production-readiness and deployment safety, fostering the organizational trust essential for making reliability a key differentiator for our product.

Required Skills and Qualifications

Extensive SRE Expertise

  • Define SLIs and SLOs for critical user journeys to drive proactive engineering choices.
  • Lead live production incident response as an Incident Commander and conduct blameless postmortems that inspire actionable outcomes.
  • Develop observability tools that narrate a system's behavior, creating intuitive dashboards and actionable alerts.
  • Transform an organization from reactive incident management to a structured reliability practice, significantly improving paging volume.
  • Establish error-budget policies to inform data-driven decisions between feature deployment and reliability maintenance.

Advanced Technical Proficiency in AWS

  • Competently design and operate AWS services, VPC, IAM, compute (ECS/EC2/Lambda), managed data services, and load balancing.
  • Effectively manage our existing CloudFormation and bash scripts through GitHub Actions without automatically resorting to Terraform.

About SiteTracker

About SiteTracker: At SiteTracker, we are committed to redefining reliability in the tech world. Our innovative platform supports enterprise customers as we expand our AI capabilities, ensuring they receive the highest quality service. Join us in shaping the future of reliability engineering.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.