About the job
Join Our Team as a Site Reliability Engineer
Seize the opportunity to create a robust reliability practice from the ground up at SiteTracker. You will be instrumental in establishing engineering standards, including Service Level Objectives (SLOs), error budgets, and observability, designed to safeguard our platform as we scale for enterprise clients and enhance our AI capabilities. With full autonomy to define strategy and the trust to implement it, your role will ensure our AI workloads (Evaluations, RAG, and LLM processing) consistently meet premier reliability benchmarks. If you thrive on solving challenges proactively and view toil as an engineering problem, this position promises to be a pivotal step in your career.
Your Responsibilities
As a Staff or Senior Staff Site Reliability Engineer, you will collaborate with existing engineers to shift our organization from a reactive approach to a proactive, methodical reliability practice. You will spearhead the intentional advancement of our infrastructure, recognizing the right moments to adopt new tools and transitioning from manual scripts and templates only when necessary. Whether architecting incident response frameworks or tackling unique reliability issues for AI agents, your contributions will amplify the effectiveness of the entire engineering team.
By approaching every challenge with a consultative perspective, you will inform technical decisions grounded in data rather than instinct, ensuring our multi-region or service mesh adoption roadmap is future-ready. You will not merely receive tasks; rather, you will take ownership of strategies for production-readiness and deployment safety, fostering the organizational trust essential for making reliability a key differentiator for our product.
Required Skills and Qualifications
Extensive SRE Expertise
- Define SLIs and SLOs for critical user journeys to drive proactive engineering choices.
- Lead live production incident response as an Incident Commander and conduct blameless postmortems that inspire actionable outcomes.
- Develop observability tools that narrate a system's behavior, creating intuitive dashboards and actionable alerts.
- Transform an organization from reactive incident management to a structured reliability practice, significantly improving paging volume.
- Establish error-budget policies to inform data-driven decisions between feature deployment and reliability maintenance.
Advanced Technical Proficiency in AWS
- Competently design and operate AWS services, VPC, IAM, compute (ECS/EC2/Lambda), managed data services, and load balancing.
- Effectively manage our existing CloudFormation and bash scripts through GitHub Actions without automatically resorting to Terraform.
