companyPerplexity AI logo

AI Infrastructure Engineer

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Key ResponsibilitiesArchitect, deploy, and sustain scalable Kubernetes clusters tailored for AI model inference and training tasks. Oversee and improve Slurm-based HPC environments for effective distributed training of large language models. Create resilient APIs and orchestration frameworks for both training workflows and inference services. Execute resource scheduling and job management across diverse computational environments. Analyze system performance, identify bottlenecks, and implement enhancements across training and inference infrastructure. Develop monitoring, alerting, and observability solutions tailored for ML workloads operating on Kubernetes and Slurm. Quickly address system outages and work collaboratively with teams to ensure high uptime for critical training and inference services. Optimize cluster utilization and establish autoscaling strategies to meet dynamic workload demands. QualificationsProficient in Kubernetes administration, including custom resource definitions, operators, and cluster management. Hands-on experience with Slurm workload management, encompassing job scheduling, resource allocation, and cluster optimization. Demonstrated experience in deploying and managing distributed training systems at scale. Solid understanding of container orchestration and distributed systems architecture. Knowledge of LLM architecture and training processes, including Multi-Head Attention and distributed training strategies. Experience in managing GPU clusters and optimizing compute resource usage.

About the job

Join the innovative team at Perplexity AI as an AI Infrastructure Engineer. We harness cutting-edge technologies, including Kubernetes, Slurm, Python, C++, and PyTorch, primarily within the AWS ecosystem. In this role, you will collaborate intimately with our Inference and Research teams to design, deploy, and enhance our extensive AI training and inference clusters.

About Perplexity AI

Perplexity AI is at the forefront of artificial intelligence technology, dedicated to building scalable infrastructure solutions that empower groundbreaking research and applications in machine learning and data science.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.