About the job
Parallel Domain is looking for a Senior Site Reliability Engineer to help build and maintain the infrastructure behind its high-fidelity simulation platform. This technology supports the development and validation of autonomous vehicles and robotic systems in safe, virtual environments.
Role overview
This is a hands-on engineering position focused on ensuring the smooth operation of large-scale, distributed simulation workloads. The role involves close collaboration with teams working on platforms, simulations, and machine learning projects. Day-to-day work centers on managing and scaling multi-region AWS infrastructure, deploying and maintaining Kubernetes clusters, and improving the reliability and security of deployment pipelines used by engineering teams.
Key responsibilities
- Manage and scale AWS infrastructure across multiple regions
- Deploy, monitor, and optimize Kubernetes workloads
- Enhance reliability and security of deployment systems
- Support large-scale batch simulation and distributed workloads
- Collaborate with engineering teams across platforms, simulations, and machine learning
Challenges and focus areas
- Multi-region GPU scheduling
- Running Windows workloads on Kubernetes
- Scaling batch simulation infrastructure
This remote role is open to candidates based in Canada. The team values innovative thinkers who are eager to solve complex infrastructure problems and contribute directly to the evolution of autonomous system technology.
