company

Machine Learning Engineer - Decentralized ML Training Platform

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

ResponsibilitiesMulti-Cloud Infrastructure: Create resource management systems for provisioning and orchestrating compute resources across AWS, GCP, and Azure utilizing infrastructure-as-code (Pulumi/Terraform). Manage dynamic scaling, state synchronization, and concurrent operations across numerous heterogeneous nodes. Distributed Training Systems: Design robust infrastructures for fault-tolerant distributed machine learning, including GPU clusters, NVIDIA runtime, S3 checkpointing, large dataset management, health monitoring, and resilient retry strategies. Real-World Networking: Develop systems that simulate and manage real-world network conditions such as bandwidth shaping, latency injection, and packet loss while optimizing data flow across workers with varying connectivity, as our training occurs on consumer nodes rather than in traditional data centers. What You Will BringYou should ideally possess over 5 years of experience with a strong focus on:Infrastructure & Platform Engineering: Proven experience with infrastructure-as-code (Pulumi/Terraform/CloudFormation) managing multi-cloud deployments, lifecycle orchestration, self-healing systems, Docker/Kubernetes (EKS), GPU workloads, and heterogeneous clusters at scale. Distributed Systems & ML Infrastructure: Comprehensive understanding of distributed training workflows, checkpointing, data sharding, model versioning, long-running job orchestration, and decentralized networking (P2P, NAT traversal, traffic shaping).

About the job

At Pluralis Research, we are at the forefront of Protocol Learning, innovating a decentralized approach to training and deploying AI models. This revolutionary method democratizes access to AI, allowing individuals to contribute to model training rather than relying solely on large corporations. By aggregating computing power from a diverse range of participants and incentivizing their contributions, we are paving the way for a truly collaborative and open AI ecosystem.

We are seeking a skilled ML Training Platform Engineer to design, develop, and enhance the foundational infrastructure that supports our decentralized machine learning training platform. You will be responsible for key systems that encompass infrastructure orchestration, distributed computing, and services integration, facilitating ongoing experimentation and large-scale model training.

About Pluralis Research

Pluralis Research is committed to revolutionizing the way AI models are trained and deployed through innovative decentralized methods. Our mission is to provide individuals with the opportunity to participate in AI development, breaking down barriers traditionally held by large corporations.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.