companyReflection AI logo

Technical Staff Member - Pre-Training Infrastructure

Reflection AISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Desired QualificationsProven experience in building or managing distributed training systems for large-scale machine learning models. Strong familiarity with modern distributed training frameworks such as Megatron, DeepSpeed, or comparable large-scale training systems. Understanding of large-scale model parallelism techniques (data, tensor, pipeline, or expert parallelism). Demonstrated experience in optimizing training throughput and GPU utilization within extensive distributed settings. Familiarity with GPU communication libraries like NCCL and performance tuning for distributed workloads. Experience collaborating closely with ML researchers to bring experimental training workflows to production. Exceptional debugging skills across GPU compute, distributed training systems, and large-scale workloads.

About the job

Our Mission

At Reflection AI, our goal is to develop open superintelligence and make it universally accessible.

We are pioneering open weight models tailored for individuals, agents, enterprises, and even entire nations. Our diverse team comprises talented AI researchers and industry veterans from prestigious organizations such as DeepMind, OpenAI, Google Brain, Meta, Character. AI, Anthropic, and many more.

Role Overview

  • Construct and enhance distributed training systems that drive the pre-training of cutting-edge models.

  • Collaborate with research teams to design and execute extensive training runs for foundational models.

  • Create infrastructure that facilitates efficient training across thousands of GPUs leveraging contemporary distributed training frameworks.

  • Enhance training throughput, stability, and efficiency for extensive model training tasks.

  • Work closely with pre-training researchers to convert experimental concepts into scalable, production-ready training systems.

  • Boost performance of distributed training tasks through optimization of communication, memory management, and GPU utilization.

  • Develop and maintain training pipelines that accommodate large-scale datasets, checkpointing, and iterative experiments.

  • Identify and resolve performance bottlenecks within distributed training systems, including model parallelism, GPU communication, and training runtime environments.

  • Contribute to the creation of systems that promote swift experimentation and iteration on novel training methods.

About Reflection AI

Reflection AI is at the forefront of artificial intelligence, dedicated to creating open superintelligence that is accessible to everyone. Our innovative approach leverages the expertise of a diverse team from top-tier AI organizations, ensuring that we are equipped to tackle the challenges of tomorrow.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.