company

Research Scientist - Distributed Machine Learning

ifm-usSunnyvale, CA
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Strong foundation in distributed machine learning frameworks. Experience with GPU cluster management and optimization. Proficiency in programming languages such as Python and C++. Familiarity with machine learning libraries like PyTorch or TensorFlow. Excellent problem-solving skills and ability to work in a collaborative environment.

About the job

About the Institute of Foundation Models
We are a pioneering research lab focused on the development, understanding, application, and risk management of foundation models. Our mission is to propel research forward, cultivate the next generation of AI innovators, and make significant contributions to a knowledge-driven economy.

Join our dynamic team and engage in the heart of innovative foundation model training, collaborating with top-tier researchers, data scientists, and engineers. Tackle groundbreaking challenges in AI development and contribute to transformative AI solutions that have the potential to revolutionize industries. Your strategic and innovative problem-solving skills will be vital in establishing MBZUAI as a global center for high-performance computing in deep learning, enabling impactful discoveries that inspire the future of AI innovation.

Role Overview
Develop and Enhance Distributed Pre-Training Frameworks
·       Implement DeepSpeed / FSDP / Megatron-LM on multi-node GPU clusters.
·       Design robust launch scripts, resilient checkpoints, and job monitoring systems (e.g., NCCL/GLOO/GPU).
Transform Mathematical Concepts into High-Performance Production Code
·       Prototype novel optimizers or attention mechanisms using PyTorch/NumPy/JAX or similar frameworks.
·       Convert prototypes into efficient CUDA/Triton kernels with custom gradients and performance tests.
Enhance Training Efficiency and Stability
·       Lead efforts in mixed-precision training, integrating bf16, fp8, etc., into regular workflows while assessing accuracy versus speed improvements and analyzing numerical stability.
·       Utilize kernel fusion, communication tuning, and memory optimization to achieve state-of-the-art throughput.
Accelerate Research Progress
·       Develop logging and metrics systems, along with experiment-tracking tools, to facilitate rapid iteration.
·       Design ablation studies and statistical tests that validate or challenge new concepts.
·       Guide interns and junior engineers through clear asynchronous design documentation and code reviews.
You will collaborate closely with researchers, deliver production code, and shape the landscape of large language models.

About ifm-us

ifm-us is at the forefront of AI research, dedicated to developing foundation models that transform the landscape of technology and industry. By fostering collaboration and innovation, we aim to prepare the next generation of AI leaders and contribute to a future where AI drives sustainable growth and knowledge sharing.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.