Featherless AI logoFeatherless AI logo

Machine Learning Engineer - Training Optimization

Featherless AIRemote (world)
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

QualificationsStrong experience in training large neural networks (LLMs or similarly large models). Hands-on experience with training optimization (not just model usage). Solid understanding of:Backpropagation, optimization algorithms, and training dynamics. Distributed systems for ML training. Experience with PyTorch (required). Comfort working close to hardware (GPUs, memory, networking constraints). Ability to move fluidly between research ideas and production-ready code.

About the job

About the Role

We are seeking a dedicated Machine Learning Engineer specializing in training optimization to join our team at Featherless AI. In this role, you will play a pivotal part in enhancing and scaling large-scale model training processes. Your responsibilities will bridge the gap between research and production, focusing on optimizing training pipelines for efficiency, speed, and cost-effectiveness, while working closely with our research team to advance model architecture and capabilities.

This position offers significant impact and ownership; your contributions will directly influence our iteration speed, scalability, and the efficiency of our model deployments.

What You’ll Do

  • Enhance large-scale model training pipelines, focusing on throughput, convergence, stability, and cost.
  • Refine distributed training strategies, including data, model, and pipeline parallelism.
  • Tune optimizers, schedulers, batch sizing, and precision settings (bf16 / fp16 / fp8).
  • Minimize training duration and computational costs through profiling, bottleneck analysis, and system-level enhancements.
  • Collaborate with researchers to implement architecture-aware training methods.
  • Develop and maintain robust training infrastructure, ensuring checkpointing, fault tolerance, and reproducibility.
  • Assess and incorporate new training methodologies, such as gradient checkpointing, ZeRO, FSDP, and custom kernels.
  • Manage training performance metrics and strive for continuous improvement.

What We’re Looking For

  • Extensive experience in training large neural networks, particularly LLMs or similarly significant models.
  • Practical expertise in training optimization, extending beyond mere model application.
  • A solid foundation in backpropagation, optimization algorithms, and training dynamics.
  • Knowledge of distributed systems relevant to ML training.
  • Proficiency with PyTorch is essential.
  • Comfort in working closely with hardware constraints, including GPUs, memory, and networking.
  • The ability to seamlessly transition between research concepts and production-ready implementations.

Nice to Have

  • Experience with large-scale distributed training setups, including multi-node and multi-GPU configurations.
  • Familiarity with tools like DeepSpeed, FSDP, Megatron, or bespoke training stacks.
  • Background in optimizing training processes for high-performance computing environments.

About Featherless AI

At Featherless AI, we are dedicated to advancing artificial intelligence through innovative solutions and cutting-edge technology. Our team is composed of experts who are passionate about pushing the boundaries of machine learning and AI applications. We foster a collaborative environment where creativity and technological advancement thrive.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.