About the job
About Pathway
Pathway is revolutionizing artificial intelligence with its pioneering post-transformer model that mimics human thought processes.
Our groundbreaking architecture (BDH) surpasses traditional Transformer models, providing enterprises with comprehensive insights into model functionality. By integrating this foundational model with the fastest data processing engine available, Pathway empowers organizations to transcend mere incremental improvements and embrace genuinely contextualized, experience-driven intelligence. Trusted by prestigious entities like NATO, La Poste, and Formula 1 teams, we are at the forefront of technological advancement.
Founded by complexity scientist Zuzanna Stamirowska, our leadership team includes AI visionaries such as CTO Jan Chorowski, who pioneered Attention in speech processing and collaborated with Nobel laureate Geoff Hinton at Google Brain, and CSO Adrian Kosowski, a distinguished computer scientist and quantum physicist who earned his PhD at 20.
With backing from esteemed investors and advisors, including TQ Ventures and Lukasz Kaiser, co-author of the Transformer model behind ChatGPT and a key figure at OpenAI, Pathway operates out of Palo Alto, California.
The Opportunity
We are seeking a passionate Senior ML Infrastructure / DevOps Engineer who thrives on optimizing Linux environments, distributed systems, and GPU cluster scalability over traditional notebook usage. You will be responsible for the infrastructure that drives our machine learning training and inference workloads across diverse cloud platforms, managing everything from basic Linux setups to advanced container orchestration and CI/CD pipelines.
Your role will be integral to the R&D team, focusing on production infrastructure, including clusters, networks, storage, observability, and automation. Your contributions will directly influence the speed and efficiency of model training, deployment, and iteration.
Why This Role is Unique
- Manage and scale GPU-intensive clusters utilized daily by the R&D team for high-scale training and rapid inference.
- Design, build, and automate our ML platform, moving beyond the execution of predefined playbooks.
- Collaborate across multiple major cloud providers to tackle intriguing challenges in networking, scheduling, and cost/performance optimization at scale.
Your Responsibilities
- Architect, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queue management, quota management).
- Automate infrastructure provisioning and configuration using infrastructure-as-code (Terraform, CloudFormation, cluster tooling) and configuration management techniques.
- Create and uphold robust ML pipelines (data ingestion, training, evaluation, deployment) with strong assurances around reproducibility, traceability, and rollback capabilities.
- Implement and maintain monitoring and observability solutions to ensure maximum uptime and performance.

