Nebius logoNebius logo

Senior Site Reliability Engineer (SRE) - Compute Node Team

NebiusAmsterdam, Netherlands; Remote - Europe
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Senior

Qualifications

We Expect You to Have:In-depth Linux expertise, including a comprehensive understanding of both user and kernel spaces. Knowledge of kernel subsystems and their intricacies.

About the job

Why Choose Nebius?
Nebius is at the forefront of revolutionizing cloud computing, catering specifically to the global AI economy. Our mission is to provide our clients with the essential tools and resources needed to tackle real-world challenges and innovate industries, all without incurring hefty infrastructure expenses or the necessity of assembling large in-house AI/ML teams. Join us and collaborate with some of the brightest minds in AI cloud infrastructure, alongside seasoned leaders and engineers.

Where We Operate
Founded in Amsterdam and publicly traded on Nasdaq, Nebius boasts a worldwide presence with R&D centers located throughout Europe, North America, and Israel. Our workforce comprises over 1,400 dedicated professionals, including more than 400 highly skilled engineers proficient in both hardware and software engineering, complemented by a dedicated in-house AI R&D team.

Your Role

As a Senior Site Reliability Engineer (SRE) within the Compute Node team at Nebius AI Cloud, you will play a pivotal role in constructing and managing the cluster scheduler and node-level services that oversee and maintain virtual machines across our cloud regions. The focus of this role is on Linux systems engineering, virtualization, and operational reliability. You will work closely with the operating system and hypervisor, influencing the integration of reliability and observability within the Compute platform.

Your Key Responsibilities:
  • Guarantee the reliability, availability, and performance of compute nodes hosting virtual machines.
  • Analyze and troubleshoot Linux systems at both user and kernel space, recognizing their capabilities, limitations, and trade-offs.
  • Resolve intricate production issues involving CPU, memory, NUMA, cgroups, and scheduling.
  • Engage hands-on with virtualization and containerization using QEMU/KVM and Linux-based technologies.
  • Develop and enhance observability as a core capability of the node layer, including metrics, logs, traces, alerts, SLIs, and SLOs.
  • Lead incident response efforts, conduct root-cause analyses, and perform postmortems, driving long-term enhancements in reliability.
  • Work in close partnership with platform, kernel/hypervisor, GPU, and infrastructure teams to refine system design and operability.

About Nebius

Nebius is a pioneering company in the realm of cloud computing, dedicated to addressing the needs of the global AI economy. Our innovative solutions enable businesses to overcome real-world challenges and innovate, all while keeping infrastructure costs manageable. With a strong team of experienced engineers and leaders, we are reshaping the future of AI cloud infrastructure.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.