Qualifications
Key ResponsibilitiesArchitect, deploy, and sustain scalable Kubernetes clusters tailored for AI model inference and training tasks. Oversee and improve Slurm-based HPC environments for effective distributed training of large language models. Create resilient APIs and orchestration frameworks for both training workflows and inference services. Execute resource scheduling and job management across diverse computational environments. Analyze system performance, identify bottlenecks, and implement enhancements across training and inference infrastructure. Develop monitoring, alerting, and observability solutions tailored for ML workloads operating on Kubernetes and Slurm. Quickly address system outages and work collaboratively with teams to ensure high uptime for critical training and inference services. Optimize cluster utilization and establish autoscaling strategies to meet dynamic workload demands. QualificationsProficient in Kubernetes administration, including custom resource definitions, operators, and cluster management. Hands-on experience with Slurm workload management, encompassing job scheduling, resource allocation, and cluster optimization. Demonstrated experience in deploying and managing distributed training systems at scale. Solid understanding of container orchestration and distributed systems architecture. Knowledge of LLM architecture and training processes, including Multi-Head Attention and distributed training strategies. Experience in managing GPU clusters and optimizing compute resource usage.
About the job
Join the innovative team at Perplexity AI as an AI Infrastructure Engineer. We harness cutting-edge technologies, including Kubernetes, Slurm, Python, C++, and PyTorch, primarily within the AWS ecosystem. In this role, you will collaborate intimately with our Inference and Research teams to design, deploy, and enhance our extensive AI training and inference clusters.
About Perplexity AI
Perplexity AI is at the forefront of artificial intelligence technology, dedicated to building scalable infrastructure solutions that empower groundbreaking research and applications in machine learning and data science.