About the job
About Our Team
The Training Runtime team is at the forefront of developing a sophisticated distributed machine-learning training runtime that supports everything from initial research prototypes to cutting-edge model deployments. Our mission is twofold: to enhance the capabilities of researchers and to facilitate large-scale model training. We are creating a cohesive and flexible runtime environment that evolves with researchers as they scale their projects.
Our initiatives revolve around three key pillars: optimizing high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; constructing resilient, fault-tolerant training frameworks (including robust training loops, effective state management, resilient checkpointing, and comprehensive observability); and managing distributed processes for long-duration, job-specific uses. By embedding established large-scale functionalities into a user-friendly runtime, we empower teams to iterate rapidly and operate reliably at any scale, working closely with model-stack, research, and platform teams. Our success is measured in terms of both training throughput (the speed at which models are trained) and researcher efficiency (the speed at which concepts transform into experiments and products).
About the Position
As a Machine Learning Framework Engineer on our Training team, you will be pivotal in enhancing the training throughput of our internal framework while empowering researchers to explore innovative ideas. This role demands exceptional engineering skills, including the design, implementation, and optimization of state-of-the-art AI models, as well as writing clean, efficient machine learning code—a task that is often more challenging than it seems. A deep understanding of supercomputer performance metrics will also be critical. Ultimately, every project you undertake will aim to advance the field of machine learning.
We seek individuals who are passionate about performance optimization, have a solid grasp of distributed systems, and have an aversion to bugs in their code. Given that our training framework is utilized for extensive runs involving numerous GPUs, any performance enhancements will significantly impact our operations.
This position is based in San Francisco, CA, and adheres to a hybrid work model requiring three days in the office each week. We also provide relocation assistance for new hires.
Key Responsibilities:
Implement advanced techniques within our internal training framework to maximize hardware efficiency during training sessions.
Conduct profiling and optimization of our training framework to enhance performance.
Collaborate with researchers to facilitate the development of next-generation machine learning models.
You Will Excel in This Role If You:
Possess a strong passion for optimizing system performance.
Have a profound understanding of distributed systems and their complexities.
Demonstrate meticulous attention to detail, especially in code quality and debugging.

