About the job
Veeam Software is recognized as a leader in data management, ensuring that organizations harness the full potential of their data and AI solutions while maintaining security and resilience. With a global presence across 30 countries and protecting over 550,000 customers, Veeam empowers businesses to navigate the complexities of data security and AI risk management. Our mission is to drive innovation and impact for some of the world's top brands as we advance together.
About the Role
We are on the lookout for a Staff Site Reliability Engineer to take a pivotal role in our SRE team. In this position, you will be a hands-on technical leader, mentoring senior engineers, influencing product development, and ensuring our systems are designed for reliability, scalability, and observability from the ground up.
Your leadership will be crucial in driving strategic initiatives, mentoring others in SRE practices, and establishing architectural best practices across our platform. This role is essential for aligning teams, enforcing high standards, and scaling SRE principles throughout Veeam.
What You’ll Do
Reliability Engineering & Resilience:
- Serve as a technical authority in your field, mentoring senior engineers and guiding design choices that enhance service reliability and resilience.
- Lead the definition and enforcement of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets; ensure adherence across engineering teams.
- Collaborate with Staff peers across teams to align strategies and advocate for shared reliability standards and objectives.
- Work closely with development and product teams to proactively design for failure, build resilient architectures, and operationalize reliability from the outset.
Observability & Operational Excellence:
- Champion the company-wide adoption of observability best practices and tools.
- Ensure that metrics, logs, and traces deliver deep, actionable insights across systems.
- Lead complex incident responses, conduct postmortems, and drive systemic reliability improvements.
- Promote a culture of learning and continuous improvement through a blameless approach.

