About the job
Join us in creating the backbone of data infrastructure for real-world robotic operations.
As robotics transitions from research labs to real-world applications across factories, warehouses, vehicles, and field deployments, understanding the intricacies of robotic performance becomes critical. When robots encounter failures or unexpected behaviors, data analysis is key to deciphering the underlying issues.
At Foxglove, we are at the forefront of building tools for observability, visualization, and data infrastructure that empower robotics and autonomous systems teams to manage, analyze, and derive insights from vast amounts of multimodal sensor data collected from operational systems and production fleets.
Role Overview
We are seeking a passionate ML Platform Engineer with robust infrastructure expertise to design, deploy, and scale our data platform systems. This platform-centric role will allow you to take charge of the infrastructure layer that facilitates machine learning in production environments, going beyond just the models themselves.
Your responsibilities will encompass ensuring the reliability, scalability, and performance of the ML platform, including areas such as inference serving, pipeline orchestration, training infrastructure, and evaluation frameworks. You will be tackling substantial challenges such as managing petabyte-scale multimodal robotics data and optimizing high-throughput retrieval and embedding pipelines in a hands-on infrastructure capacity.
Key Responsibilities
Design and operationalize production inference infrastructure, focusing on model serving, autoscaling, load balancing, and cost efficiency across cloud environments.
Own the platform architecture for embedding and retrieval pipelines that enable semantic search across multimodal robotics data (image, video, point cloud, and time series).
Develop and sustain the training and evaluation infrastructure that supports rapid model performance iteration, including job orchestration, experiment tracking, and dataset versioning.
Lead decisions on cloud infrastructure (AWS/GCP) that affect latency, throughput, reliability, and scalability.
Establish platform abstractions and internal tools that empower product engineers to deliver ML-enhanced features without managing infrastructure directly.
Assess, integrate, and operationalize third-party ML infrastructure components while establishing clear build vs. buy frameworks for the team.

