company

Senior Machine Learning Engineer - Distributed ML Training Platform

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

Key ResponsibilitiesDistributed Training Architecture & OptimizationDesign and implement large-scale distributed training systems optimized for heterogeneous hardware in low-bandwidth, high-latency environments. Develop model-parallel training strategies (data, tensor, pipeline parallelism) utilizing custom sharding techniques to minimize communication overhead. Enhance GPU utilization, memory efficiency, and computational performance across distributed nodes. Implement robust checkpointing, state synchronization, and recovery methods for long-duration, fault-prone training tasks. Create monitoring and metrics systems to assess training progress, model quality, and system performance bottlenecks. Decentralized Networking & ResilienceArchitect resilient training systems that can withstand node failures, network partitions, and dynamic participant adjustments. Design and optimize peer-to-peer networks for decentralized coordination across non-co-located nodes. Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management. Analyze and optimize communication patterns to lower latency and bandwidth usage in multi-participant scenarios.

About the job

Join Us as a Senior Machine Learning Engineer

Pluralis Research is at the forefront of innovative research in Protocol Learning. Our mission is to enhance the training of foundation models through collaborative, decentralized methods, allowing multiple participants to contribute without needing access to a complete model. We aim to create community-owned models with sustainable economic frameworks.

We are seeking experienced Senior and Staff Engineers with over 5 years of expertise in distributed systems and large-scale machine learning training. You will play a pivotal role in developing a groundbreaking substrate for training distributed ML models that function effectively over consumer-grade internet connections.

About Pluralis Research

At Pluralis Research, we are dedicated to pioneering advancements in machine learning through collaborative and decentralized approaches. Our innovative research on Protocol Learning is transforming how foundation models are trained and owned by communities, ensuring equitable access and sustainability in the AI ecosystem.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.