About the job
Why Choose Nebius?
Nebius is at the forefront of revolutionizing cloud computing, catering specifically to the global AI economy. Our mission is to provide our clients with the essential tools and resources needed to tackle real-world challenges and innovate industries, all without incurring hefty infrastructure expenses or the necessity of assembling large in-house AI/ML teams. Join us and collaborate with some of the brightest minds in AI cloud infrastructure, alongside seasoned leaders and engineers.
Where We Operate
Founded in Amsterdam and publicly traded on Nasdaq, Nebius boasts a worldwide presence with R&D centers located throughout Europe, North America, and Israel. Our workforce comprises over 1,400 dedicated professionals, including more than 400 highly skilled engineers proficient in both hardware and software engineering, complemented by a dedicated in-house AI R&D team.
Your Role
As a Senior Site Reliability Engineer (SRE) within the Compute Node team at Nebius AI Cloud, you will play a pivotal role in constructing and managing the cluster scheduler and node-level services that oversee and maintain virtual machines across our cloud regions. The focus of this role is on Linux systems engineering, virtualization, and operational reliability. You will work closely with the operating system and hypervisor, influencing the integration of reliability and observability within the Compute platform.Your Key Responsibilities:
- Guarantee the reliability, availability, and performance of compute nodes hosting virtual machines.
- Analyze and troubleshoot Linux systems at both user and kernel space, recognizing their capabilities, limitations, and trade-offs.
- Resolve intricate production issues involving CPU, memory, NUMA, cgroups, and scheduling.
- Engage hands-on with virtualization and containerization using QEMU/KVM and Linux-based technologies.
- Develop and enhance observability as a core capability of the node layer, including metrics, logs, traces, alerts, SLIs, and SLOs.
- Lead incident response efforts, conduct root-cause analyses, and perform postmortems, driving long-term enhancements in reliability.
- Work in close partnership with platform, kernel/hypervisor, GPU, and infrastructure teams to refine system design and operability.
