About the job
Why Join Nebius
Nebius is pioneering a transformative era in cloud computing, tailored to meet the demands of the global AI economy. We provide the essential tools and resources that empower our clients to address real-world challenges and revolutionize their industries without incurring substantial infrastructure costs or assembling large in-house AI/ML teams. Our workforce is engaged at the forefront of AI cloud infrastructure, collaborating with some of the most talented and innovative leaders and engineers in the industry.
Our Work Environment
Headquartered in Amsterdam and publicly traded on Nasdaq, Nebius boasts a worldwide presence with R&D centers across Europe, North America, and Israel. Our diverse team of over 1400 professionals includes more than 400 highly skilled engineers, well-versed in both hardware and software engineering, complemented by an in-house AI R&D team.
The Role
We are seeking a Network Site Reliability Engineer (NetSRE) to play a critical role in developing and maintaining the foundational infrastructure of Nebius, the Network, which is essential for all other services. This engineering-centric SRE position will involve defining clear reliability objectives, implementing the necessary tooling and automation to achieve them, while enhancing the operational safety of the network as we scale rapidly.
Your Responsibilities Will Include:
Establish and oversee reliability benchmarks for network services and critical pathways (including SLIs/SLOs, availability targets, and error budgets as applicable).
Enhance reliability across the entire network, focusing not just on services, but also on site readiness, inter-site connectivity (DCI), and operational protocols.
Lead incident response efforts in your areas, directing investigations/postmortems and transforming failures into sustainable solutions rather than recurring issues.
Develop and refine observability tools including actionable metrics, logs, traces, alerting systems, and expedited debugging processes.
