About the job
About the Team
The Training Runtime team is at the forefront of developing a sophisticated distributed machine-learning training framework that underpins everything from innovative research experiments to large-scale model executions. Our mission is twofold: to enhance the productivity of researchers and to facilitate the development of cutting-edge AI models by constructing a cohesive, modular runtime that evolves alongside researchers as they progress through various scaling challenges.
Our focus centers around three key pillars: high-performance, asynchronous, zero-copy data movement that is aware of tensor and optimizer states; robust, high-availability training frameworks incorporating state management, resilient checkpointing, deterministic orchestration, and comprehensive observability; and the efficient management of distributed processes tailored for long-lasting, job-specific, and user-defined tasks.
We integrate industry-leading capabilities into a flexible, developer-centric runtime that enables teams to iterate rapidly and operate reliably at any scale, closely collaborating with model-stack, research, and platform teams. Our success is measured by the enhancement of both training throughput (the speed at which models are trained) and researcher throughput (the velocity of transforming ideas into experiments and products).
About the Role
As a Machine Learning Framework Engineer focusing on training, your primary responsibility will be to enhance the training efficiency of our internal framework while empowering researchers to explore innovative ideas. This role demands exceptional engineering skills, including the design, implementation, and optimization of state-of-the-art AI models, alongside a commitment to writing reliable machine learning code and a deep understanding of supercomputer performance. Ultimately, the projects you undertake will aim to propel the field of machine learning forward.
We seek individuals who are passionate about optimizing performance, have a keen understanding of distributed systems, and strive for code perfection. Given that our training framework supports extensive runs utilizing numerous GPUs, improvements in performance will lead to significant advancements.
This position is based in London, UK, following a hybrid work model that requires in-office attendance three days a week, with relocation assistance available for new hires.
In this role, you will:
- Leverage cutting-edge techniques within our internal training framework to achieve exceptional hardware efficiency for our training operations
- Analyze and optimize the performance of our training framework
- Collaborate with researchers to facilitate the development of next-generation models

