About the job
Your Responsibilities:
Your Objectives:
- Ensure the reliability, performance, and scalability of production systems.
- Work closely with development and operations teams to implement SRE best practices and minimize manual repetitive tasks.
System Reliability:
- Monitor and manage the reliability of services and applications in production.
- Define and uphold key performance indicators (KPIs) and service level objectives (SLOs).
- Respond to incidents, conduct root cause analyses (RCA), and draft post-mortems.
Automation and Infrastructure:
- Automate manual and repetitive tasks to decrease toil.
- Develop and maintain infrastructure as code (IaC) using tools such as Terraform and Ansible.
- Implement and manage CI/CD pipelines and GitOps practices.
Observability and Monitoring:
- Establish monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Datadog).
- Monitor the “four golden signals”: latency, traffic, errors, and saturation.
- Configure alerts and notifications for potential incidents.
Collaboration and Communication:
- Collaborate closely with development and operations teams.
- Foster a blameless culture for incident analysis and learning.
- Communicate effectively with stakeholders regarding reliability and performance issues.

