companylavendo logo

HPC Solutions Architect - Remote Opportunity

lavendoSan Francisco
Remote Full-time $225K/yr - $315K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

QualificationsWe are looking for candidates with a strong background in HPC architecture, cloud computing, and GPU technologies. Familiarity with NVIDIA hardware and experience in designing scalable systems is essential. You should possess excellent problem-solving skills and a passion for tackling challenging infrastructure issues.

About the job

About Us

At Lavendo, we are pioneering an infrastructure that most engineers only dream of. We operate an AI-centric cloud platform that integrates expansive GPU clusters, high-speed networking, and cloud-native tools, catering to enterprises, innovative startups, and leading research teams. Our mission is straightforward: empower our clients to efficiently train and execute complex AI and simulation workloads without the need to construct their own supercomputers.

As a publicly traded company, we are rapidly expanding, with R&D centers across North America, Europe, and the Middle East. Our culture emphasizes engineering excellence: minimal bureaucracy, significant ownership, and a focus on tackling challenging infrastructure problems while witnessing the impact of our work on real customer operations.

Your Role as HPC Specialist Solutions Architect

In this pivotal role, you will be the go-to expert for customers looking to establish or enhance advanced GPU and HPC environments in the cloud. This includes multi-rack clusters, high-speed interconnects, intricate scheduling, and strict SLAs regarding throughput and latency.

As an HPC Specialist Solutions Architect, you will design and optimize cutting-edge platforms for AI training, extensive simulations, and data-intensive workloads. You will collaborate closely with NVIDIA's latest hardware (Hopper, Blackwell, and future versions), NVLink/NVSwitch topologies, and InfiniBand/RoCE fabrics, having a substantial influence on the evolution of our platform and reference architectures. If you thrive on translating workloads into optimized clusters and maximizing performance, this is the ideal position for you.

Your Responsibilities

  • Cluster Design: Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm. Your considerations will include node types, GPU topology, queues, partitions, and failure scenarios.

  • Infrastructure Optimization: Integrate NVIDIA Hopper and Blackwell-class GPUs with NVLink/NVSwitch and InfiniBand/RoCE, ensuring the hardware layout aligns with the communication patterns of the workloads.

  • Automation: Deploy and manage GPU and Network Operators to standardize drivers, CUDA, firmware, and high-speed networking across extensive fleets, rather than managing on a box-by-box basis.

  • Supercomputer Cloud Functionality: Design and validate cloud-native HPC environments that emulate supercomputer capabilities.

About lavendo

Lavendo is at the forefront of building advanced infrastructure for AI and simulation workloads. Our innovative cloud platform supports a diverse range of clients, from enterprises to startups and research teams, enabling them to leverage powerful computing resources without the need for custom supercomputers.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.