About the job
At Veeam, we pride ourselves on being the Data and AI Trust Company, dedicated to empowering organizations in effectively managing their data and AI. Our mission is to ensure that data is fully comprehended, secured, and resilient, enabling the acceleration of safe AI at scale. As a recognized leader in data resilience and security posture management, Veeam stands at the forefront of the integration of identity, data, security, and AI risk management. With our headquarters in Seattle and a presence in over 30 countries, we safeguard the operations of over 550,000 customers globally. We invite you to join us on this journey of fearless innovation, growth, and making a significant impact for some of the world's most notable brands.
About the Role
We are seeking a highly skilled Senior Software Engineer - Reliability to take a hands-on leadership role within our Site Reliability Engineering (SRE) team. In this position, you will guide senior engineers, influence product development initiatives, and ensure our systems are designed for reliability, scalability, and observability from the ground up.
You will lead strategic projects, mentor peers in SRE methodologies, and contribute to defining architectural best practices across our platform. This role is crucial in aligning teams and fostering high standards while scaling SRE principles globally within Veeam.
What You’ll Do
Reliability Engineering & Resilience
- Design and enhance infrastructure for high availability, fault tolerance, and scalability across public cloud platforms, starting with Azure and preparing for future expansion.
- Establish and uphold SLIs, SLOs, and error budgets to define and enforce reliability objectives.
- Lead incident response initiatives, conduct blameless postmortems, and facilitate learning sessions to maximize knowledge sharing across the engineering team, driving systemic changes.
Observability & Operational Excellence
- Promote the adoption of comprehensive observability practices, ensuring that telemetry, logs, metrics, and tracing are actionable and thorough.
- Develop automated systems for monitoring and alerting to enhance operational excellence.

