Nebius logoNebius logo

Senior HPC Cluster Engineer

NebiusPrague, Czech Republic; Remote - Europe
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Senior

Qualifications

Key Responsibilities:Optimize Performance: Fine-tune GPU clusters and InfiniBand networks to guarantee peak performance in HPC and GPU-centric environments. Issue Analysis and Resolution: Identify and troubleshoot root causes of GPU and InfiniBand network issues, proposing effective corrective measures. Hardware Integration: Seamlessly incorporate new GPU hardware into the existing infrastructure, utilizing software stacks such as Kubernetes, QEMU, and KVM. Enhance Automation: Develop and improve automation systems for proactive monitoring, detection, and resolution of issues in GPU and InfiniBand environments. Device Management: Configure and manage GPU devices and InfiniBand fabrics, ensuring reliable and efficient operations.

About the job

Why Choose Nebius?
Nebius is at the forefront of a transformative era in cloud computing, dedicated to empowering the global AI economy. We provide our clients with innovative tools and resources designed to address real-world challenges and revolutionize industries, all while minimizing infrastructure expenses and eliminating the necessity for large in-house AI/ML teams. Our talented workforce operates at the cutting edge of AI cloud infrastructure, collaborating with some of the most skilled and pioneering leaders and engineers in the industry.

Our Work Environment
Based in Amsterdam and publicly traded on Nasdaq, Nebius boasts a global presence with R&D hubs located across Europe, North America, and Israel. Our team comprises over 1,400 professionals, including more than 400 highly specialized engineers with extensive expertise in both hardware and software engineering, complemented by a dedicated in-house AI R&D team.

The Role

We are seeking a Senior HPC Cluster Engineer to become an integral part of our team, contributing to the advancement of our cutting-edge hyperscaler platform. As a member of the GPU & InfiniBand team, you will focus on enhancing and optimizing the core components of our cloud platform, with a specific emphasis on GPU computing, InfiniBand networks, and the KVM/QEMU stack. Your role will involve working closely with hardware virtualization and device emulation technologies to ensure high performance and security in multi-GPU, HPC environments. You will analyze, troubleshoot, and refine our infrastructure to support new hardware, optimize system performance, and automate fault detection and resolution within complex systems.

About Nebius

Nebius is a pioneering cloud computing firm that is reshaping the AI landscape by providing scalable solutions that reduce costs and optimize resources for our clients. With a diverse and talented workforce, we are committed to innovation and excellence in AI infrastructure.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.