company

Machine Learning Infrastructure Engineer

ifm-usSunnyvale, CA
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Ideal candidates will possess extensive experience in distributed systems, strong programming skills, and a deep understanding of machine learning principles. Familiarity with frameworks such as PyTorch or TensorFlow is preferred but not required.

About the job

About the Institute of Foundation Models
We are a pioneering research laboratory focused on the development, understanding, application, and risk management of foundational models. Our mission is to propel research forward, cultivate the next generation of AI innovators, and make substantial contributions to a knowledge-driven economy.

Join us and collaborate with top-tier researchers, data scientists, and engineers on the forefront of foundational model training. Engage in solving critical challenges that can redefine entire sectors through advanced AI solutions. Your strategic and innovative problem-solving skills will play a vital role in positioning MBZUAI as an international leader in high-performance computing for deep learning, facilitating discoveries that will inspire future AI trailblazers.

The Role 

We are seeking a skilled distributed ML infrastructure engineer to enhance and expand our training systems. You will collaborate closely with distinguished researchers and engineers to:
• Develop and scale distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
• Implement distributed optimizers based on mathematical specifications
• Create robust configuration and launching systems across multi-node, multi-GPU clusters
• Manage experiment tracking, metrics logging, and job monitoring for enhanced external visibility
• Enhance the reliability, maintainability, and performance of training systems
• While much of your work will support large-scale pre-training, prior pre-training experience is not mandatory; strong infrastructure and systems expertise are our primary focus.

Key Responsibilities 

• Distributed Framework Ownership – Extend or adapt training frameworks (e.g., DeepSpeed, FSDP) to accommodate new applications and architectures.
• Optimizer Implementation – Convert mathematical optimizer specifications into distributed implementations.
• Launch Config & Debugging – Develop and troubleshoot multi-node launch scripts with adaptable batch sizes and parallelism strategies.

About ifm-us

At ifm-us, we are dedicated to advancing AI through innovative research and development. Our team is committed to empowering the next generation of AI leaders and fostering a transformative knowledge economy.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.