company

Machine Learning Infrastructure Engineer

Flexion RoboticsZürich, Zurich, Switzerland
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

RequirementsMinimum of 3 years of professional experience in developing and managing infrastructure for large-scale deep learning systems. Hands-on experience in training or supporting the training of large models (billions of parameters) in distributed multi-node GPU environments, along with a deep understanding of foundational concepts (DDP, FSDP, NCCL). In-depth experience with at least one major cloud platform (like AWS or GCP), including compute provisioning and management. Proficiency in container orchestration platforms (e.g., Kubernetes) and job scheduling tools (e.g., Slurm). Solid programming skills in languages such as Python or Go. Excellent problem-solving skills and the ability to work collaboratively in a fast-paced environment.

About the job

About Flexion Robotics

Flexion Robotics is building the intelligence framework for tomorrow’s humanoid robots. The company’s mission is to move from early prototypes to fully functional humanoid systems. Founded by leading scientists in robot reinforcement learning (with backgrounds at Nvidia and ETH Zürich) and backed by top international venture capital, Flexion Robotics has quickly progressed from first lines of code to deploying real humanoid capabilities.

Role Overview

The Machine Learning Infrastructure Engineer will help shape the core computing and data systems that support cognitive development for humanoid robots. This position focuses on building and maintaining the platforms needed to train large foundational models on substantial datasets. The work involves designing training clusters, architecting data pipelines to move information from simulators and robots into model training, and creating tools that enable AI engineers to train, evaluate, and iterate efficiently.

This is a senior, on-site position based in Zürich. The Infrastructure team includes engineers with experience at Google, Meta, and Amazon. The role offers broad responsibility for systems supporting data collection, training, and experimentation workflows, including infrastructure strategy, cluster orchestration, distributed training, data platforms, CI, and experimentation tools.

What You Will Do

  • Design, deploy, and maintain GPU compute clusters for large-scale model training across multiple cloud providers, including job scheduling with Slurm and Kubernetes.
  • Build data platforms and pipelines: set up storage, processing, and serving layers to manage data from simulator outputs and robot telemetry to training datasets. This includes infrastructure using object storage (S3), parallel filesystems (Lustre), and data formats such as Parquet, WebDataset, and LeRobot. Use distributed processing tools like Ray and Spark to transform and validate data at scale.
  • Work with AI engineers to optimize distributed training on multi-node GPU clusters, focusing on throughput, device utilization, and communication efficiency. Improve distributed IsaacLab-based sim-to-real training workflows.
  • Evaluate and select new platforms: assess cloud providers, GPU-as-a-Service options, and new tools, taking ownership of decisions as computing needs grow.

Location

This role is on-site at Flexion Robotics’ Zürich office.

About Flexion Robotics

Flexion Robotics is at the forefront of robotics innovation, developing advanced humanoid systems that integrate cutting-edge AI technologies. With a focus on transforming theoretical concepts into practical applications, we aim to reshape the future of robotics through our dedicated team of experts and robust technological foundation.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.