company

Software Engineer, Pre-training Systems at Magic | San Francisco

Magic.devSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Candidates should possess a strong foundation in software engineering principles and distributed systems. Previous experience in training large models in multi-node GPU setups is crucial, along with a thorough understanding of parallelism strategies and the associated performance trade-offs. Ideal applicants will have a history of debugging complex issues within production machine learning systems and a proactive approach to managing essential infrastructure. A proven ability to enhance system performance or reliability is highly desirable.

About the job

At Magic, we are dedicated to creating safe artificial general intelligence (AGI) that propels humanity forward in tackling the most pressing global challenges. We believe that the most effective route to achieving safe AGI involves automating the research and code generation processes to enhance models and resolve alignment issues more reliably than humans can achieve independently. Our methodology incorporates cutting-edge pre-training at scale, domain-specific reinforcement learning (RL), ultra-long context capabilities, and optimized inference-time computations.

Role Overview

In your role as a Software Engineer on the Pre-training Systems team, you will be responsible for designing and managing the distributed infrastructure necessary for training Magic’s long-context models at scale.

This position emphasizes large-scale model training utilizing extensive GPU clusters. You will operate at the intersection of deep learning and distributed systems, ensuring that training processes are efficient, reliable, and reproducible under extreme conditions.

Magic’s long-context models present complex systems challenges, such as sustained memory usage, communication overhead across thousands of devices, long-duration jobs requiring fault tolerance, and efficient sequence packing within hardware limitations. You will take ownership of the systems that ensure large-scale pre-training is both stable and rapid.

Your Contributions

  • Scale distributed training across large GPU clusters, implementing data, tensor, and pipeline parallelism.

  • Optimize communication strategies and gradient synchronization.

  • Enhance checkpointing, fault tolerance, and job recovery mechanisms.

  • Profile and resolve performance bottlenecks across computing, networking, and storage.

  • Advance experiment reproducibility and orchestration workflows.

  • Boost hardware utilization and overall training throughput.

  • Collaborate with Kernel and Research teams to align model architecture with system capabilities.

Qualifications We Seek

  • Solid foundation in software engineering and distributed systems.

  • Experience with training large models in multi-node GPU environments.

  • In-depth understanding of parallelism techniques and performance trade-offs.

  • Experience in debugging cross-layer issues within production ML systems.

  • Demonstrated ownership mentality and capability to manage critical infrastructure.

  • Proven track record in enhancing the performance or reliability of large-scale systems.

About Magic.dev

Magic is at the forefront of developing safe AGI, committed to driving progress in addressing humanity's most significant challenges. Our innovative approach integrates advanced machine learning techniques with a vision for a future where technology empowers humanity.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.