About the job
Join us at the forefront of technology as we revolutionize the data storage industry. At Pure Storage, you will engage in innovative thinking, collaborate with the brightest minds in the field, and grow alongside us.
Our mission is to create transformative solutions that empower businesses to harness the full potential of their data. If you are ready to embrace limitless opportunities and make a significant impact, we invite you to be part of our journey.
In the current landscape, data is the new oil. For businesses aiming to maximize their data usage, establishing a robust infrastructure for data storage and querying is paramount. This is where Everpure's advanced hardware and software solutions come into play, unlocking the maximum value from your data.
THE ROLE
As a Site Reliability Engineer at Everpure, you will oversee the infrastructure, internal tools, and production services critical to our operations. You will collaborate with all internal engineering teams to ensure the reliability of services that support the development of innovative products and features across various environments, from data centers to public cloud platforms.
Your mission as a Reliability Engineer will involve redefining the resiliency of Everpure's vital infrastructure applications. You will take the lead in designing and implementing advanced observability solutions, shaping the future of application data management and incident response at Pure, utilizing cutting-edge technologies and AI to streamline engineering processes.
We are seeking engineers with a dual focus on software and systems skills, who are passionate about reliability, performance, and efficiency, and who possess experience in developing tools, services, and automations to enhance production services.
WHAT YOU'LL DO
- Design, operate, maintain, and troubleshoot enterprise systems, including databases, message queues, APIs, and distributed applications, utilizing data metrics such as SLOs and error budgets.
- Establish sustainable incident response practices and conduct blameless postmortems to mitigate future issues.
- Support services pre-launch through system design, software platform development, capacity planning, and launch reviews.
- Sustainably scale systems through scripting and automation, continuously improving operational reliability and efficiency.
- Collaborate across teams and time zones to deliver high-quality outcomes for our customers.

