About the job
Periodic Labs is an AI and physical sciences company based in Menlo Park. The team focuses on advancing scientific discovery by building advanced models that drive progress in materials, energy, and related fields. The company operates with a strong sense of ownership and a drive to push scientific boundaries, supported by leading investors and a rapidly growing organization.
Role overview
The Machine Learning Systems Engineer will own the systems layer that powers model training and inference. This work is closely tied to the reinforcement learning (RL) feedback loop at the heart of Periodic Labs' research process, where models propose experiments, experiments generate data, and that data improves future models. The role blends deep infrastructure work with research collaboration, focusing on both performance and integration with the scientific workflow.
What you will do
- Develop scheduling solutions for GB series GPUs using platforms like Ray, Slurm, and Kubernetes. Aim to minimize latency and maximize resource utilization across different cluster setups.
- Create profiling tools, both online and offline, to identify and resolve bottlenecks in the training and inference stack.
- Implement direct S3 checkpoint streaming to remove I/O bottlenecks during large-scale training runs.
- Benchmark RL training configurations across model sizes, batch strategies, and hardware architectures to find optimal setups.
- Write and optimize communication and GPU kernels to increase hardware throughput.
- Design and implement zero-copy RDMA weight synchronization between training and inference systems, keeping the RL loop fast and efficient.
- Develop sandbox execution environments for rapid algorithm testing and iteration.
Key focus areas
- Scheduling, kernels, RDMA, weight synchronization, and communication primitives
- Collaboration with researchers to co-design algorithms and infrastructure
- Accelerating the RL feedback loop that drives scientific discovery at Periodic Labs
