About the job

About the Team

The Training Runtime team is at the forefront of developing a sophisticated distributed machine-learning training framework that underpins everything from innovative research experiments to large-scale model executions. Our mission is twofold: to enhance the productivity of researchers and to facilitate the development of cutting-edge AI models by constructing a cohesive, modular runtime that evolves alongside researchers as they progress through various scaling challenges.

Our focus centers around three key pillars: high-performance, asynchronous, zero-copy data movement that is aware of tensor and optimizer states; robust, high-availability training frameworks incorporating state management, resilient checkpointing, deterministic orchestration, and comprehensive observability; and the efficient management of distributed processes tailored for long-lasting, job-specific, and user-defined tasks.

We integrate industry-leading capabilities into a flexible, developer-centric runtime that enables teams to iterate rapidly and operate reliably at any scale, closely collaborating with model-stack, research, and platform teams. Our success is measured by the enhancement of both training throughput (the speed at which models are trained) and researcher throughput (the velocity of transforming ideas into experiments and products).

About the Role

As a Machine Learning Framework Engineer focusing on training, your primary responsibility will be to enhance the training efficiency of our internal framework while empowering researchers to explore innovative ideas. This role demands exceptional engineering skills, including the design, implementation, and optimization of state-of-the-art AI models, alongside a commitment to writing reliable machine learning code and a deep understanding of supercomputer performance. Ultimately, the projects you undertake will aim to propel the field of machine learning forward.

We seek individuals who are passionate about optimizing performance, have a keen understanding of distributed systems, and strive for code perfection. Given that our training framework supports extensive runs utilizing numerous GPUs, improvements in performance will lead to significant advancements.

This position is based in London, UK, following a hybrid work model that requires in-office attendance three days a week, with relocation assistance available for new hires.

In this role, you will:

Leverage cutting-edge techniques within our internal training framework to achieve exceptional hardware efficiency for our training operations
Analyze and optimize the performance of our training framework
Collaborate with researchers to facilitate the development of next-generation models

About the job

About the Team

About the Role

This position is based in London, UK, following a hybrid work model that requires in-office attendance three days a week, with relocation assistance available for new hires.

In this role, you will:

Leverage cutting-edge techniques within our internal training framework to achieve exceptional hardware efficiency for our training operations
Analyze and optimize the performance of our training framework
Collaborate with researchers to facilitate the development of next-generation models

About the job

About the Team

About the Role

This position is based in London, UK, following a hybrid work model that requires in-office attendance three days a week, with relocation assistance available for new hires.

In this role, you will:

Leverage cutting-edge techniques within our internal training framework to achieve exceptional hardware efficiency for our training operations
Analyze and optimize the performance of our training framework
Collaborate with researchers to facilitate the development of next-generation models

About the job

About the Team

About the Role

This position is based in London, UK, following a hybrid work model that requires in-office attendance three days a week, with relocation assistance available for new hires.

In this role, you will:

Leverage cutting-edge techniques within our internal training framework to achieve exceptional hardware efficiency for our training operations
Analyze and optimize the performance of our training framework
Collaborate with researchers to facilitate the development of next-generation models

Machine Learning Framework Engineer - Training

Experience Level

Qualifications

About the job

About OpenAI

Cooks

Desk Investigator Officer

Senior Scheduling Coordinator

Commercial Cleaner

Anti-Fraud Officer - Transaction Monitoring

Project Management Information Systems Specialist

Merchant Relations Officer - Bekasi

Associate Talent Acquisition - 6 Month Contract

Door Attendant at Raffles The Red Sea | Umluj

Front Office Supervisor at Raffles The Red Sea | Umluj

Senior Finance Manager

Guest Relations Supervisor - Raffles The Red Sea (Saudi National)

People & Culture Assistant

Front Desk Agent at Raffles The Red Sea | Umluj

PBX Operator - Raffles The Red Sea | Umluj

Guest Relation Host at Raffles The Red Sea | Umluj

Cook at Guzman y Gomez | Lane Cove

EMEA Talent Acquisition & Sourcing Partner

Telephone Operator

[쿠팡] 로켓배송 직매입 MD 브랜드매니저 (뷰티)

Machine Learning Framework Engineer - Training

Experience Level

Qualifications

About the job

About OpenAI

Machine Learning Framework Engineer - Training

Experience Level

Qualifications

About the job

About OpenAI

Machine Learning Framework Engineer - Training

Experience Level

Qualifications

About the job

About OpenAI