About the job
Join the UiPath Team
The team at UiPath is passionate about harnessing the transformative potential of automation to redefine the way the world operates. We are dedicated to developing industry-leading enterprise software that empowers organizations.
To realize this vision, we seek individuals who are inquisitive, motivated, generous, and authentic. We value those who thrive in a dynamic, fast-paced environment and who genuinely care, about their colleagues, the mission of UiPath, and the broader impact of our work.
Are you ready to make a difference?
Your Role
As a Principal Site Reliability Engineer at UiPath, you will play a pivotal role in enhancing the reliability of our expansive, cloud-native systems. This position requires a comprehensive understanding of the full reliability spectrum, going beyond any single domain. You will define and drive the architecture, scalability, measurement, and automation of reliability across our systems.
This role focuses on shaping the reliability practices at UiPath rather than merely reacting to outages or coding. You will collaborate with engineering and platform teams to integrate reliability into our systems, workflows, and organizational culture. Your contributions will elevate our standards for monitoring, automation, and ensuring our systems can withstand real-world loads and failures.
You will take ownership of service reliability, observability, automation, and continuous improvement initiatives, partnering with teams in Romania and India as necessary.
Your Responsibilities at UiPath
Comprehensive Reliability Ownership: Develop and refine the reliability strategy for our distributed systems, ensuring a balance of availability, performance, velocity, and cost through well-defined SLIs/SLOs and error budgets.
Incident Management & Operational Excellence: Lead and actively participate in high-severity incidents, driving structured troubleshooting in uncertain situations and ensuring sustainable systemic enhancements.
Observability & Operational Insights: Advocate for robust observability practices to make service health and performance risks visible and actionable.
Automation, Tooling & Engineering Discipline: Automate manual operational tasks through effective tooling and self-service options while applying disciplined engineering methodologies.
Infrastructure, Cloud & IaC: Champion reliable and scalable cloud infrastructure utilizing Infrastructure as Code, collaborating with platform teams to establish best practices.
Technical Leadership & Organizational Impact: Influence strategic decisions to improve reliability outcomes and mentor team members to foster a culture of excellence.
