About the job
What You'll Do:
Join our dynamic Data Infrastructure team as a Senior Site Reliability Engineer (SRE). In this pivotal role, you'll ensure the reliability, availability, and performance of our essential data systems hosted on AWS and GCP. Your expertise in cloud infrastructure, automation, and operational excellence will play a key role in supporting our product for a diverse global clientele.
As a Senior Site Reliability Engineer, your responsibilities will include:
- Designing, implementing, and maintaining robust and reliable data infrastructure services, encompassing SQL, NoSQL, Kafka, and Spark-based data layers. You will define and monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Participating in an on-call rotation to swiftly address incidents and ensure quick resolution of production issues. Conducting thorough post-incident reviews to pinpoint root causes and implement preventative measures.
- Managing and automating cloud infrastructure using Terraform and Helm, following GitOps principles.
- Implementing and sustaining comprehensive monitoring, logging, and tracing solutions to proactively identify and resolve performance and reliability issues.
- Monitoring and managing data infrastructure capacity, planning for future growth, and optimizing performance through tuning and automation.
- Developing and maintaining automation scripts and tools to streamline operational tasks, enhance efficiency, and minimize manual effort.
- Ensuring the security and compliance of data infrastructure services by implementing best practices for access control, data protection, and vulnerability management.
- Collaborating with development and data engineering teams to facilitate smooth deployments and operational support while maintaining thorough documentation of infrastructure configurations, processes, and procedures.
- Managing and maintaining distributed databases within a Kubernetes environment.
Our Tech Stack:
- Cloud-Based Infrastructure: Fully cloud-based with a Kubernetes-focused tech stack. Compute workloads operate in Kubernetes clusters across multiple regions.
- Infrastructure Management: Extensive use of Terraform and Helm, adhering to GitOps paradigms for managing cloud infrastructure and Kubernetes applications.
- Core Technologies: Significant utilization of Kafka, distributed PostgreSQL and Cassandra QL, Elasticsearch, and Databricks/Spark. Development of inter-cloud failover options to support multi-cloud strategies.
- Diverse Applications: Teams develop and deploy containerized applications for low-latency APIs, machine learning models, and data processing pipelines.

