Qualifications
Roles and Responsibilities
Develop automation platforms to manage infrastructure rollouts across multiple cloud providers.
Optimize the telemetry platform to pinpoint customer-impacting events and provide relevant data for debugging.
Collaborate with engineering teams to enhance service performance within the cloud architecture.
Debug live site incidents and conduct postmortem analyses and root cause investigations.
Participate in an SLA-driven on-call rotation, including after-hours, weekends, and rotating holidays.
Required Skills and Experience
5 years of proven experience as a Site Reliability Engineer.
Experience in infrastructure automation; scripting skills in Python or Bash are advantageous.
Familiarity with the Prometheus monitoring stack; experience with Grafana, Mimir, and Loki is a plus.
Solid understanding of Kubernetes and the container ecosystem.
Excellent collaboration and communication skills across teams.
Proficiency in at least one of the major cloud platforms: AWS, Azure, or Google Cloud.
Experience in debugging, diagnosing, and troubleshooting complex production software.
Bachelor's Degree in Computer Science or a related field.
About the job
At SingleStore, we are on the lookout for a passionate Site Reliability Engineer who will play a critical role in optimizing and scaling our managed service offerings across leading cloud platforms. This position places you at the forefront of innovative technology trends, working with a high-performance distributed database managed by Kubernetes and deployed in the cloud. It’s an incredible chance to redefine the landscape of cloud-focused SRE roles.
This is a development-centric position that requires an engineering mindset to tackle operational challenges. As part of our globally distributed team, you will drive SRE practices throughout the organization. Through infrastructure automation, you will facilitate the growth of our service across various cloud environments, emphasizing the elimination of manual processes. Additionally, you will utilize our monitoring platform to enhance customer experiences by systematically identifying and resolving issues affecting our users. As an SRE, your expertise will be vital in diagnosing platform issues, leveraging your in-depth knowledge of the SingleStore query engine and the underlying infrastructure.