About the job
Veeam is a leading provider of data and AI solutions, dedicated to helping organizations protect and manage their data effectively. Recognized as a pioneer in data resilience and security posture management, we empower businesses to navigate the complexities of identity, data, security, and AI risk. With our headquarters in Seattle and operations in over 30 countries, Veeam proudly safeguards the operations of more than 550,000 customers globally. Join our dynamic team and be part of a transformative journey as we advance together, fostering growth, learning, and making a significant impact for renowned brands around the world.
About the Role
As a Staff Site Reliability Engineer, you will take on a pivotal role as a hands-on technical leader within our Site Reliability Engineering (SRE) team. Your expertise will guide senior engineers, influence product development efforts, and ensure our systems are constructed to be reliable, scalable, and observable from the ground up.
You will spearhead strategic initiatives, mentor peers in SRE practices, and help define architectural best practices across our platform. This role is crucial for aligning teams, enforcing high standards, and scaling SRE principles globally at Veeam.
What You’ll Do
Reliability Engineering & Resilience:
- Serve as a technical authority, mentoring senior engineers and guiding design decisions to enhance service reliability and resilience.
- Lead the establishment and enforcement of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets; ensure adherence across engineering teams.
- Collaborate with fellow staff members across teams to unify strategy and promote shared reliability standards and objectives.
- Engage with development and product teams to proactively design for failure, construct resilient architectures, and operationalize reliability from inception.
Observability & Operational Excellence:
- Promote the organization-wide adoption of observability best practices and tools.
- Ensure that metrics, logs, and traces yield deep, actionable insights throughout systems.
- Lead complex incident responses, conduct postmortems, and drive systemic reliability enhancements.
- Encourage and uphold a blameless culture of learning and continuous improvement.

