About the job
Why join Nebius?
Nebius is at the forefront of a transformative era in cloud computing, dedicated to empowering the global AI economy. We provide our clients with the essential tools and resources to tackle real-world challenges and revolutionize industries, all while avoiding hefty infrastructure expenditures and the necessity of establishing extensive in-house AI/ML teams. Our talented workforce operates at the cutting edge of AI cloud infrastructure, collaborating with some of the most experienced and inventive leaders and engineers in the field.
Our Work Environment
Headquartered in Amsterdam and publicly listed on Nasdaq, Nebius boasts a global presence with R&D hubs across Europe, North America, and Israel. Our team comprises over 1,400 professionals, including more than 400 highly skilled engineers with substantial expertise in both hardware and software engineering, supplemented by an in-house AI R&D team.
We are seeking a Senior HPC Systems Engineer to be instrumental in developing our hyperscaler platform. This role involves working with its core components while analyzing and optimizing the performance of large-scale GPU clusters that bridge hardware and software.
You will engage with the entire stack, from hardware and system software to networking (InfiniBand/RoCE), virtualization (KVM/QEMU), and distributed communication layers (such as MPI and NCCL).
Key Responsibilities:
- Analyze system behavior across multiple layers, identify performance bottlenecks, and drive enhancements that influence the construction, operation, tuning, and validation of our clusters.
- Investigate and resolve performance issues of GPU clusters under real workloads, including training and inference scenarios.
- Assess and incorporate new hardware, system configurations, and tuning methods through the software stack.
- Support complex performance-related escalations from internal teams and clients.
- Collaborate closely with infrastructure, software engineering, and hardware vendor teams (such as NVIDIA, Mellanox, and Intel).
- Contribute to hardware and cluster qualification (acceptance), ensuring systems meet performance standards.
