Nebius logoNebius logo

Senior HPC Cluster Engineer

NebiusAmsterdam, Netherlands; Remote - Europe
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Senior

Qualifications

In this position, your responsibilities will include:Tuning Performance: Optimize the performance of GPU clusters and InfiniBand networks to ensure peak operation in HPC and GPU-centric environments. Issue Analysis and Troubleshooting: Identify and resolve root causes of GPU and InfiniBand network issues, proposing effective corrective actions. Hardware Integration: Incorporate new hardware into the existing infrastructure, providing support for new GPU technologies through software stacks like Kubernetes, QEMU, and KVM. Automation Enhancement: Upgrade automation systems for proactive monitoring, detection, and resolution of issues in GPU and InfiniBand environments. Device Management: Configure and manage GPU devices and InfiniBand fabrics to ensure efficient and reliable operation.

About the job

Why Choose Nebius
Nebius is pioneering the future of cloud computing to empower the global AI economy. We provide innovative tools and resources that enable our clients to tackle real-world challenges and revolutionize industries without incurring exorbitant infrastructure costs or the need to assemble large in-house AI/ML teams. Our workforce is at the forefront of AI cloud infrastructure, collaborating with some of the most talented and visionary leaders and engineers in the sector.

Our Work Environment
With our headquarters in Amsterdam and a presence on Nasdaq, Nebius boasts a global reach, with R&D hubs located throughout Europe, North America, and Israel. Our diverse team of over 1,400 employees includes more than 400 skilled engineers with extensive expertise in both hardware and software engineering, complemented by an in-house AI R&D team.

The Role

We are seeking a Senior HPC Cluster Engineer to become an integral part of our team, contributing to the advancement of our state-of-the-art hyperscaler platform. The GPU & InfiniBand team is tasked with enhancing and optimizing the core components of our Cloud platform, specifically focusing on GPU computing, InfiniBand networking, and the KVM/QEMU stack. In this role, you will collaborate closely with hardware virtualization and device emulation technologies, ensuring high performance and security within multi-GPU, HPC environments. Your responsibilities will involve analyzing, troubleshooting, and improving the infrastructure to accommodate new hardware, optimizing system performance, and automating fault detection and resolution processes in a complex system.

About Nebius

Nebius is at the forefront of the cloud computing revolution, providing essential tools for the AI economy while maintaining a lean operational model. Our global team collaborates in a dynamic environment, delivering innovative solutions that drive industry transformation.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.