1 - 20 of 47,289 Jobs

Search for Machine Learning Engineer - Training Optimization

47,289 results

Select all on this page (20)

Apply

Machine Learning Engineer - Training Optimization

Featherless AI

Full-time|Remote|Remote (world)

About the RoleWe are seeking a dedicated Machine Learning Engineer specializing in training optimization to join our team at Featherless AI. In this role, you will play a pivotal part in enhancing and scaling large-scale model training processes. Your responsibilities will bridge the gap between research and production, focusing on optimizing training pipeli…

Jan 22, 2026

Apply

Machine Learning Engineer - Runtime & Optimization

Waymo LLC

Full-time|$204K/yr - $259K/yr|Hybrid|Mountain View, California, USA

Waymo is at the forefront of autonomous driving technology, dedicated to becoming the world’s most trusted driver. Originating from the Google Self-Driving Car Project in 2009, Waymo has been relentlessly focused on creating the Waymo Driver—The World’s Most Experienced Driver™—to enhance mobility access and prevent traffic-related fatalities. The Waymo Driver is the driving force behind our fully autonomous ride-hail service, optimized for various vehicle platforms and applications. With over ten million rider-only trips and more than 100 million miles driven autonomously on public roads, alongside tens of billions of simulated miles across 15+ U.S. states, Waymo is redefining transportation.The ML Platform team at Waymo plays a critical role by offering a suite of tools that facilitate and automate the entire machine learning workflow lifecycle, including feature and experiment management, model development, optimization, and monitoring. Our initiatives have made machine learning more accessible across diverse teams at Waymo, including Perception, Planner, Research, and Simulation.We are seeking talented engineers with expertise in machine learning software or systems to enhance compute performance both in the cloud and on vehicles. You'll engage with the entire ML stack from a systems perspective, tackling challenges such as efficient deep learning models, model compression, and improving ML software (e.g., JAX, XLA, Triton, and CUDA). This hybrid position reports directly to the Senior Manager of Runtime and Optimization.

Feb 10, 2026

Apply

Machine Learning Runtime Optimization Engineer

Applied Intuition, Inc.

Full-time|$159.1K/yr - $199.3K/yr|On-site|Sunnyvale, California, United States

About Applied IntuitionApplied Intuition, Inc. is at the forefront of advancing physical AI technology. Established in 2017 and currently valued at $15 billion, this Silicon Valley-based company is building the essential digital infrastructure to infuse intelligence into every moving machine worldwide. We cater to industries such as automotive, defense, trucking, construction, mining, and agriculture through three primary sectors: tools and infrastructure, operating systems, and autonomy. Our solutions are trusted by 18 of the top 20 global automakers, along with the United States military and its allies, to deliver exceptional physical intelligence. Our headquarters is located in Sunnyvale, California, with additional offices across Washington, D.C.; San Diego; Ft. Walton Beach, Florida; Ann Arbor, Michigan; London; Stuttgart; Munich; Stockholm; Bangalore; Seoul; and Tokyo. Discover more at applied.co.We are an in-office company, expecting our employees to primarily work from their Applied Intuition office five days a week. We understand the importance of flexibility and trust our employees to manage their schedules responsibly. This may include occasional remote work, starting the day with morning meetings from home before heading to the office, or leaving earlier when needed to accommodate family commitments.About the RoleWe are in search of a skilled software engineer with extensive experience in optimizing machine learning models and deploying them in production-grade embedded runtime environments. Your expertise will span the entire ML framework stack, including PyTorch, JAX, ONNX, TensorRT, CUDA, XLA, and Triton.At Applied Intuition, You Will:Lead ML performance optimization across various technologies for both on-road and off-road ADAS/AD stacks aimed at deployment on a range of embedded computing platforms.Devise compute usage strategies to enhance efficiency and minimize latency of model inference for compute boards chosen by our customers.Engage in model pruning and quantization, ensuring successful deployment on memory-constrained platforms.Collaborate closely with ML engineers and software developers to identify and optimize efficient model architecture solutions.Establish methodologies to...

Feb 2, 2026

Apply

Machine Learning Engineer - Model Optimization

Waymo LLC

Full-time|$170K/yr - $216K/yr|Hybrid|Mountain View, California, United States; San Francisco, California, United States

Waymo is at the forefront of autonomous driving technology, dedicated to becoming the world's most trusted driver. Established in 2009 as the Google Self-Driving Car Project, Waymo has developed the Waymo Driver—The World’s Most Experienced Driver™—aimed at enhancing mobility access and saving countless lives from traffic-related accidents. Our technology empowers a fully autonomous ride-hail service and is adaptable across various vehicle platforms and product applications. With over ten million rider-only trips facilitated by the Waymo Driver and more than 100 million miles driven autonomously on public roads, we are paving the way for safer transportation.The Perception team is responsible for developing systems that interpret the spatial-temporal representations and semantic meanings of the environment surrounding our autonomous vehicles. Our collaborative efforts with downstream teams focus on optimizing and integrating these systems within the Waymo Driver. We engage in innovative research to tackle real-world challenges and work closely with research teams at Alphabet. Our engineers have access to extensive driving data from diverse sensors, allowing us to (1) create efficient learning methods from vast real-world datasets, (2) build and train models at scale, (3) analyze real-world behaviors, and (4) optimize models for both onboard and offboard hardware.In this hybrid role, you will report to a Technical Lead Manager.

Feb 20, 2026

Apply

Machine Learning Engineer, Marketplace Optimization

DoorDash

Full-time|$137.1K/yr - $246.8K/yr|On-site|San Francisco, CA; Sunnyvale, CA

Join DoorDash as a Machine Learning Engineer, where you'll be pivotal in designing, building, and optimizing large-scale machine learning systems within our Ads Delivery funnel. Your expertise will contribute to enhancing our Ads Marketplace, ensuring a balanced and efficient ecosystem for advertisers and consumers alike. Work with advanced AI and machine learning techniques to dynamically optimize bidding, auction design, budget pacing, forecasting, and ad experimentation. This role offers a unique opportunity to influence our innovative advertising products as we expand into new verticals such as Grocery and Retail.

Feb 5, 2026

Apply

Senior / Staff Machine Learning Optimization Engineer

Waabi

Full-time|$141K/yr - $249K/yr|Remote|Remote US & Canada

Join Waabi, a trailblazer in Physical AI, founded by the renowned AI expert Raquel Urtasun. We are at the forefront of revolutionizing autonomous transportation, driving the advancement of commercial autonomous trucks and robotaxis. Our innovative technology is backed by global leaders across the AI, automotive, logistics, and deep tech industries.With a rapidly expanding presence in Toronto, San Francisco, Dallas, and Pittsburgh, Waabi is on the lookout for diverse, innovative, and collaborative individuals eager to make a positive impact on the world. For more information, please visit: www.waabi.aiYour Responsibilities:- Collaborate intensively with autonomy and algorithm engineers to enhance safe self-driving systems utilizing an AI-first strategy.- Develop standardized distributed training frameworks for both research and production, elevating our training systems to new heights of stability and efficiency.- Conduct comprehensive profiling of model runtime and memory to identify performance bottlenecks effectively.- Investigate and assess emerging technologies for integration into Waabi’s training and inference frameworks, focusing on efficient CUDA kernels for training, quantization, model exporting, and compilation for inference.Qualifications:- MS/PhD or Bachelor's degree with a minimum of 6 years of experience in Computer Science, Robotics, or a related technical field.- Proficient in various programming languages including Python, C++, or Rust.- Experience with deep learning frameworks such as PyTorch.- Familiarity with different stages of the development lifecycle: data processing, distributed training, and model deployment.- Expertise in profiling CPU and GPU code using tools such as PyTorch Profiler and NVIDIA Nsight.- A collaborative team player who is open-minded and willing to support others.- A genuine passion for self-driving technologies, tackling complex challenges, and crafting innovative solutions.Bonus Skills:- Experience in model compilation and exporting, particularly with lower-level concepts like TensorRT.- Familiarity with identifying and integrating new technologies into existing frameworks.

Apr 24, 2025

Apply

Machine Learning Infrastructure Engineer (TPU/JAX/Optimization)

Physical Intelligence

Full-time|On-site|San Francisco

Join our team as a Machine Learning Infrastructure Engineer where you will play a pivotal role in enhancing and scaling our training systems and core model code. You will be responsible for managing critical infrastructure that supports large-scale training processes, including GPU/TPU compute management and job orchestration, while developing reusable and efficient JAX training pipelines. Collaborating closely with researchers and model engineers, you'll be instrumental in translating innovative ideas into practical experiments, and from there, into production training runs.This hands-on position merges the realms of machine learning, software engineering, and scalable infrastructure to deliver impactful results.The TeamOur ML Infrastructure team is dedicated to bolstering and accelerating core modeling efforts at Physical Intelligence by creating reliable, reproducible, and fast systems for large-scale training. We collaborate with research, data, and platform engineers to ensure seamless scaling from prototypes to production-grade training runs.Your Responsibilities- Infrastructure Ownership: Design, implement, and maintain systems for large-scale model training, focusing on scheduling, job management, checkpointing, and metrics/logging.- Distributed Training Scaling: Collaborate with researchers to facilitate JAX-based training across TPU and GPU clusters with ease.- Performance Optimization: Profile and enhance memory utilization, device usage, throughput, and distributed synchronization.- Rapid Iteration Enablement: Develop abstractions for launching, monitoring, debugging, and reproducing experiments efficiently.- Compute Resource Management: Ensure effective allocation and use of cloud-based GPU/TPU resources while managing costs.- Research Collaboration: Convert research requirements into infrastructure capabilities and advocate for best practices in large-scale training.- Core Training Code Contribution: Evolve JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

Jan 23, 2026

Apply

Senior Machine Learning Engineer - Engine Optimization | Roblox | San Mateo, CA

Roblox

Full-time|$195.8K/yr - $242.1K/yr|On-site|San Mateo, CA, United States

Roblox is a vibrant platform where millions of users come together to explore, create, play, learn, and connect in immersive 3D experiences crafted by a diverse global community of developers.At Roblox, we are dedicated to building innovative tools and a robust platform that empower our community to bring their imaginative experiences to life. Our vision is to transform how people unite, no matter where they are in the world or what device they use. We are on a mission to connect a billion individuals with optimism and civility, and we seek exceptional talent to help us achieve this goal.Joining Roblox means you will be at the forefront of shaping the future of human interaction, tackling unique technical challenges at scale, and creating safer, more respectful shared experiences for all.Our engine's resource management and streaming systems are crucial for providing a seamless, stable, and responsive experience for Roblox users across a vast array of devices and network conditions. These systems collaboratively manage compute, memory, bandwidth, and rendering quality while delivering dynamic world content in real time as players interact with their environments. The challenges we face include highly dynamic environments, unpredictable user behaviors, and opaque signals stemming from device and OS limitations.This position offers a unique chance to lead the integration of machine learning into real-time engine optimization. You will develop the ML framework for predictive resource allocation and content fetching, transitioning from heuristic-based logic to adaptive, data-driven decision-making. Your contributions will directly influence stability, visual quality, responsiveness, and content delivery across billions of global play sessions.

Feb 10, 2026

Apply

Tech Lead Manager for Machine Learning Optimization

Waymo LLC

Full-time|On-site|Mountain View, California

Join Waymo as a Tech Lead Manager for Machine Learning Optimization, where you will spearhead innovative projects to enhance our self-driving technology. In this role, you will lead a team of talented engineers and data scientists, guiding the development of advanced algorithms and optimization techniques that drive performance and reliability. You will collaborate with cross-functional teams to ensure the successful integration of ML models into our systems, pushing the boundaries of autonomous vehicle technology.

Mar 13, 2026

Apply

Staff Software Engineer, Machine Learning - Routing and Quality Optimization

Waymo LLC

Full-time|$238K/yr - $302K/yr|Hybrid|Mountain View, California, USA

At Waymo, we are pioneering the future of autonomous driving technology with an unwavering commitment to becoming the world's most trusted driver. Originating from the Google Self-Driving Car Project in 2009, our focus has always been on engineering the Waymo Driver—The World’s Most Experienced Driver™—to enhance mobility access and drastically reduce traffic-related fatalities. The Waymo Driver currently powers our fully autonomous ride-hail service, having completed over ten million rider-only trips and covering more than 100 million miles on public roads and tens of billions in simulation across 15+ U.S. states.This role in Software Engineering is crucial as we develop the intelligent systems that allow the Waymo Driver to navigate complex environments, make sound decisions, and ensure safe transportation for our users. We tackle intricate technical challenges in robotics, perception, decision-making, and deep learning, collaborating closely with hardware and systems engineers. If you are a dedicated software engineer or researcher with a passion for Level 4 autonomous driving, we are eager to meet you.In this hybrid role, you will report directly to a Technical Lead Manager.

Feb 10, 2026

Apply

Principal Machine Learning Engineer - Mobile AI Inference Optimization

Unity Technologies

Full-time|$278.1K/yr - $347.6K/yr|On-site|Mountain View, CA, USA

Role Overview Unity Technologies is advancing mobile gaming with AI-driven features. The Principal Machine Learning Engineer will focus on deploying advanced AI models, such as transformers and diffusion networks, directly onto mobile devices. This position shapes how Unity brings state-of-the-art multi-modal models from research into real-world mobile applications. What You Will Do Technical Leadership: Set the vision for deploying multi-modal AI models on iOS and Android, drawing on deep experience with transformers, diffusion models, and generative architectures. Make key decisions on model optimization strategies, including compression, quantization, and knowledge distillation to address mobile device constraints. Assess and select inference runtimes (such as CoreML, ONNX Runtime Mobile, TFLite) to improve team capabilities and deployment outcomes. Oversee the entire optimization pipeline, from model export through hardware-specific kernel tuning across different processing units. Architecture & Research Translation: Work closely with research scientists to convert innovative model architectures into operational, mobile-optimized systems. Design scalable systems capable of processing varied inputs, images, text, metadata, while ensuring real-time output performance. Develop new approaches for dynamic resolution and token reduction tailored for mobile environments. Monitor and incorporate advancements in efficient AI technologies to keep Unity’s mobile AI stack current. Team Leadership & Mentorship: Guide and mentor machine learning engineers, establishing best practices for on-device performance evaluation. Collaborate with cross-functional teams to ensure AI capabilities align with product roadmaps and device requirements. Promote a culture centered on performance measurement, defining and tracking key metrics for efficiency and accuracy. Location Mountain View, CA, USA

Apr 15, 2026

Apply

Technical Staff Member - Advanced Machine Learning Optimization

Moonlake

Full-time|On-site|San Mateo

Join Moonlake, a pioneering company harnessing AI to develop immersive world simulations.Role OverviewEnhancing Training EfficiencyImplement data loaders, fusion techniques, activation rematerialization, and gradient checkpointing.Optimize training with FSDP/ZeRO/tensor+pipeline parallelism and NCCL tuning.Improving GPU and Kernel PerformanceConduct Nsight profiling, develop Triton/CUDA kernels, and create fused operations.Implement flash-attention style accelerations, sequence packing, and KV-cache optimizations.Optimizing InferenceFocus on low-latency serving, continuous batching, and speculative decoding strategies.Apply quantization methods (GPTQ/AWQ), distillation, and pruning techniques.Infrastructure and ReliabilityManage SLURM/Kubernetes multi-node jobs and ensure checkpoint hygiene.Maintain determinism, environment pinning, and effectively handle GPU failures.Our dedicated team thrives on collaboration in our San Mateo office.

Nov 25, 2025

Apply

Principal Machine Learning Engineer - Training Systems

Rhoda AI

Full-time|On-site|Palo Alto

At Rhoda AI, we are pioneering the development of a comprehensive foundation for the next generation of humanoid robots. Our focus spans high-performance, software-defined hardware to advanced foundational models and video world models that govern robot functionality. Our robots are engineered to be versatile, capable of navigating intricate, real-world environments and tackling scenarios not previously encountered in training. We stand at the crossroads of large-scale learning, robotics, and systems, bolstered by a research team comprising experts from prestigious institutions such as Stanford, Berkeley, and Harvard. Our ambition is not merely to add features; we are crafting a revolutionary computing platform for physical tasks, underpinned by over $400 million in funding, driving aggressive investments in research & development, hardware innovation, and scaling up manufacturing to bring our vision to fruition.Role OverviewWe are in search of a Principal Machine Learning Systems Engineer to take charge of our training systems' performance from start to finish. You will be instrumental in defining the scaling of our model training, enhancing efficiency, scalability, and accuracy across extensive multimodal training environments. This is a pivotal systems role, not merely focused on infrastructure support. Your contributions will significantly influence our compute utilization efficiency, scalability of models across thousands of GPUs, and the speed of research iterations.Your ResponsibilitiesOversee training performance from start to finishAnalyze and enhance the performance of large-scale multimodal training encompassing vision, video, proprioception, actions, and language.Create systematic performance attributions by breaking down step-time into compute, communication, and input pipeline, along with scaling curves for various cluster sizes and identifying key bottlenecks.Drive quantifiable improvements across:Distributed efficiency (e.g., communication and compute overlap, bucketization, topology-aware mapping, and parallelism strategies).Compute efficiency (e.g., identifying kernel hotspots, operator fusion, attention optimization, and minimizing framework/runtime overhead).Memory efficiency (e.g., activation checkpointing, sequence packing, and reducing fragmentation).Design training systems rather than just tuning themDefine and refine parallelism strategies including data, tensor, pipeline, sharding, and hybrid approaches.Enhance execution efficiency through communication scheduling, graph capture, execution optimization, and runtime enhancements.Contribute to the overall system architecture with innovative solutions.

Mar 10, 2026

Apply

Software Engineer: Machine Learning Infrastructure

Generalist

Full-time|On-site|San Francisco Bay Area (San Mateo) or Boston (Somerville)

About the RoleAt Generalist, we are at the forefront of training expansive robot foundation models, leveraging cutting-edge GPU hardware, primarily from Nvidia, to execute distributed training tasks and experimental research. Our operations demand exceptional storage solutions and optimized data loading processes, necessitating the full utilization of cloud infrastructure alongside custom-built solutions.In this role, you will take charge of our inference infrastructure. Our robotic systems rely on a dedicated fleet of on-premises GPUs designed for demanding real-time computations and latency-sensitive applications within resource-constrained environments.Your Responsibilities:Manage and optimize our GPU compute fleets.Facilitate user-friendly access to GPUs for researchers, ensuring optimal utilization.Enhance ML data loading, transport, and storage systems in extensively utilized distributed environments.Oversee the orchestration of our robot inference fleets.You May Excel in This Position If You:Have experience managing large GPU fleets for large-scale, distributed training or inference.Possess significant expertise in using Slurm or Kubernetes for ML workload orchestration.Have developed high-scale ML data loaders and preparation systems.Understand the intricacies of ML hardware, storage, and networking systems.Are familiar with the Nvidia GPU ecosystem.

Feb 12, 2026

Apply

Machine Learning Research Engineer - Training

Achira

Full-time|On-site|San Francisco Office

Why Join Achira?Become part of an elite team comprising scientists, machine learning researchers, and engineers dedicated to transforming the predictability of the physical microcosm and revolutionizing drug discovery.Explore uncharted territories: we are on a mission to innovate next-generation model architectures that merge AI with chemistry.Engage in large-scale operations: harness massive computational resources, extensive datasets, and ambitious objectives.Take ownership of significant projects from inception to deployment on large-scale infrastructures.Thrive in a culture that values precision, speed, execution, and a proactive mindset.About the PositionAt Achira, we are committed to developing state-of-the-art foundation models that tackle the most complex challenges in simulation for drug discovery and beyond. Our atomistic foundation simulation models (FSMs) serve as world models of the physical microcosm, incorporating machine learning interaction potentials (MLIPs), neural network potentials (NNPs), and various generative models.We are seeking a Machine Learning Research Engineer (MLRE) who excels at the intersection of advanced machine learning and rigorous research methodologies. Collaborate closely with our research scientists to design and enhance intelligent training systems that propel us beyond contemporary architectures into a new era of ML-driven molecular modeling.Your mission is clear yet ambitious: to establish the foundational frameworks for training atomistic simulation models at scale. This entails a deep dive into architecture, data, optimizers, losses, training metrics, and representation learning, all while constructing high-performance systems that maximize the potential of our models. In this role, you will be instrumental in creating a blueprint for pretraining FSMs similar to today’s large-scale generative AI systems, making a significant impact on drug discovery.At Achira, you will have the chance to pioneer models that comprehend and simulate the physical world at an atomic level, achieving unprecedented speed and accuracy.

Sep 26, 2025

Apply

Machine Learning Engineer - Decentralized ML Training Platform

Pluralis Research

Full-time|On-site|San Francisco

OverviewPluralis Research is at the forefront of Protocol Learning, innovating a decentralized approach to train and deploy AI models that democratizes access beyond just well-funded corporations. By aggregating computational resources from diverse participants, we incentivize collaboration while safeguarding against centralized control of model weights, paving the way for a truly open and cooperative environment for advanced AI.We are seeking a talented Machine Learning Training Platform Engineer to design, develop, and scale the core infrastructure that powers our decentralized ML training platform. In this role, you will have ownership over essential systems including infrastructure orchestration, distributed computing, and service integration, facilitating ongoing experimentation and large-scale model training.ResponsibilitiesMulti-Cloud Infrastructure: Create resource management systems that provision and orchestrate computing resources across AWS, GCP, and Azure using infrastructure-as-code tools like Pulumi or Terraform. Manage dynamic scaling, state synchronization, and concurrent operations across hundreds of diverse nodes.Distributed Training Systems: Design fault-tolerant infrastructure for distributed machine learning, including GPU clusters, NVIDIA runtime, S3 checkpointing, large dataset management and streaming, health monitoring, and resilient retry strategies.Real-World Networking: Develop systems that simulate and manage real-world network conditions—such as bandwidth shaping, latency injection, and packet loss—while accommodating dynamic node churn and ensuring efficient data flow across workers with varying connectivity, as our training occurs on consumer nodes and non-co-located infrastructure.

Apr 1, 2026

Apply

Machine Learning Performance Engineer

Jane Street

Full-time|On-site|New York, New York, United States

Join our innovative Machine Learning team at Jane Street as a Performance Engineer, where your expertise in low-level systems programming and optimization will play a critical role in enhancing our machine learning capabilities. Machine learning is a vital component of Jane Street's global operations. Our dynamic trading environment acts as a unique, rapid-feedback platform for ML experimentation, allowing us to seamlessly integrate new concepts and methodologies. Your primary responsibility will be to optimize the performance of our models during both the training and inference phases. We prioritize efficient large-scale training, low-latency inference in real-time systems, and high-throughput inference in research scenarios. This involves not only refining CUDA implementations but also taking a holistic approach that encompasses storage systems, networking, as well as host and GPU-level considerations. We aim to ensure that our platform operates efficiently at the lowest levels—questioning whether high throughput translates into effective goodput and analyzing the actual time taken to load vectors from the L2 cache. If you're curious and passionate about tackling complex problems, you’ll find a welcoming environment here, even if you haven't previously considered a career in finance.

Feb 5, 2026

Apply

Staff AI Engineer - Model Post-Training and Alignment

OKX

Full-time|On-site|San Jose, California, United States

Join OKX as a Staff AI Engineer specializing in Model Post-Training and Alignment. In this pivotal role, you will lead initiatives that enhance the performance and alignment of AI models. You will work with cutting-edge technologies and collaborate with cross-functional teams to drive innovation in AI solutions.

Mar 18, 2026

Apply

Freelance Machine Learning Engineer

Toloka AI

Contract|Remote|Remote — Wisconsin, United States

Toloka AI is hiring a Freelance Machine Learning Engineer for a remote contract role based in Wisconsin, United States. This position centers on building and improving machine learning models that directly support product development and help shape the user experience. Responsibilities Create and fine-tune machine learning models for practical, real-world use Use data science techniques to enhance product features Work with other team members to solve technical challenges Requirements Solid background in machine learning and data science Proven ability to tackle complex problems using technical approaches Comfortable working independently as well as collaborating with a team Remote Work This contract role is fully remote but requires residence in Wisconsin, United States.

Apr 27, 2026

Apply

Senior Machine Learning Engineer

Vectara

Full-time|Remote|US Remote

At Vectara, we are revolutionizing the deployment of Enterprise AI Agents and AI Assistants, emphasizing Accuracy, Security, and Explainability like never before. Our enterprise RAG Platform stands out by utilizing advanced models for retrieval, embedding, and reranking, alongside a meticulously optimized LLM trained for quality and cutting-edge Hallucination Mitigation techniques. Our innovative approach has garnered recognition in esteemed publications such as the New York Times and Visual Capitalist, solidifying our reputation as leaders in responsible, production-ready AI solutions. With a diverse clientele of over 100 enterprises, including prominent US military organizations, financial institutions, healthcare providers, and manufacturers, we are committed to delivering exceptional results.Our founding team comprises seasoned professionals from Google, specializing in neural information retrieval and distributed systems. We invite you to join us in our mission to empower the world to discover meaningful insights. At Vectara, our team is built on passion and expertise, featuring top talents from companies like Google, Cloudera, Splunk, MongoDB, and Elastic.

Mar 16, 2026

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, or location & role pages.

Create account — see all 47,289 results

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, or location & role pages.

Machine Learning Engineer - Training Optimization

Featherless AIRemote (world)

Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Experience Level

Experience

About the job

About the Role

We are seeking a dedicated Machine Learning Engineer specializing in training optimization to join our team at Featherless AI. In this role, you will play a pivotal part in enhancing and scaling large-scale model training processes. Your responsibilities will bridge the gap between research and production, focusing on optimizing training pipelines for efficiency, speed, and cost-effectiveness, while working closely with our research team to advance model architecture and capabilities.

This position offers significant impact and ownership; your contributions will directly influence our iteration speed, scalability, and the efficiency of our model deployments.

What You’ll Do

Enhance large-scale model training pipelines, focusing on throughput, convergence, stability, and cost.
Refine distributed training strategies, including data, model, and pipeline parallelism.
Tune optimizers, schedulers, batch sizing, and precision settings (bf16 / fp16 / fp8).
Minimize training duration and computational costs through profiling, bottleneck analysis, and system-level enhancements.
Collaborate with researchers to implement architecture-aware training methods.
Develop and maintain robust training infrastructure, ensuring checkpointing, fault tolerance, and reproducibility.
Assess and incorporate new training methodologies, such as gradient checkpointing, ZeRO, FSDP, and custom kernels.
Manage training performance metrics and strive for continuous improvement.

What We’re Looking For

Extensive experience in training large neural networks, particularly LLMs or similarly significant models.
Practical expertise in training optimization, extending beyond mere model application.
A solid foundation in backpropagation, optimization algorithms, and training dynamics.
Knowledge of distributed systems relevant to ML training.
Proficiency with PyTorch is essential.
Comfort in working closely with hardware constraints, including GPUs, memory, and networking.
The ability to seamlessly transition between research concepts and production-ready implementations.

Nice to Have

Experience with large-scale distributed training setups, including multi-node and multi-GPU configurations.
Familiarity with tools like DeepSpeed, FSDP, Megatron, or bespoke training stacks.
Background in optimizing training processes for high-performance computing environments.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.