company

Senior Distributed Systems Engineer

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

The ideal candidate will possess a strong foundation in distributed systems and high-performance computing, with proven experience in communication stack optimization and systems-level debugging. Familiarity with large-scale GPU environments and deep understanding of performance metrics is essential.

About the job

About the Institute of Foundation Models
The Institute of Foundation Models (IFM) specializes in designing and operating large-scale GPU supercomputing systems aimed at training cutting-edge foundation models. Our philosophy places emphasis on the interdependence of performance, fault tolerance, and scalability across various components, including model architecture, communication systems, runtime, and hardware topology.
This position is pivotal to our mission — enhancing communication performance, distributed reliability, and cross-layer optimization for extensive training workloads.

The Mission
We seek a highly skilled engineer to collaboratively design and optimize the communication stack for large-scale distributed training, with a focus on hybrid parallelism and Mixture-of-Experts (MoE) workloads. This is a systems-level engineering role centered on performance enhancement, distributed debugging, and communication-runtime co-design.
·       Design and optimize expert-parallel and hybrid-parallel communication patterns
·       Drive high-performance hierarchical collectives for MoE workloads
·       Co-design runtime orchestration with communication topology awareness
·       Mitigate tail latency and enhance determinism across thousands of GPUs
·       Architect fault-tolerant distributed execution that withstands real-world cluster failures
Core Technical Scope
·       Communication-compute overlap and topology-aware collective optimization
·       In-depth debugging of NCCL, RDMA, and custom communication layers
·       Implementing hybrid expert parallel strategies in modern large-scale MoE systems
·       Developing elastic and resilient distributed job orchestration concepts
·       Conducting congestion analysis and routing optimization across InfiniBand/RoCE fabrics
·       Executing microbenchmarking and performance modeling for communication-intensive workloads
Expected Technical Depth
·       Expertise in hybrid expert parallel communication strategies

About Institute of Foundation Models

The Institute of Foundation Models (IFM) is at the forefront of GPU supercomputing, focusing on the development of foundation models that revolutionize machine learning capabilities. We value innovation, collaboration, and the relentless pursuit of efficiency in computational processes.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.