companyPerplexity logo

AI Infrastructure Engineer at Perplexity | San Francisco

PerplexitySan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management. Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization. Experience with deploying and managing distributed training systems at scale. Deep understanding of container orchestration and distributed systems architecture. High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies). Experience managing GPU clusters and optimizing compute resource utilization.

About the job

Join the innovative team at Perplexity as an AI Infrastructure Engineer. In this role, you will leverage your expertise in Kubernetes, Slurm, Python, C++, and PyTorch, primarily utilizing AWS. Collaborate closely with our Inference and Research teams to design, deploy, and optimize our extensive AI training and inference clusters.

Responsibilities

  • Architect, deploy, and manage scalable Kubernetes clusters tailored for AI model inference and training workloads.

  • Oversee and enhance Slurm-based HPC environments for distributed training of large language models.

  • Create robust APIs and orchestration systems for training pipelines and inference services.

  • Implement effective resource scheduling and job management systems across diverse compute environments.

  • Evaluate system performance, identify bottlenecks, and implement enhancements across both training and inference infrastructures.

  • Develop monitoring, alerting, and observability solutions specifically designed for ML workloads running on Kubernetes and Slurm.

  • Quickly respond to system outages and collaborate with multiple teams to ensure high uptime for critical training runs and inference services.

  • Optimize cluster utilization and execute autoscaling strategies to meet dynamic workload demands.

Qualifications

  • Extensive experience in Kubernetes administration, including custom resource definitions, operators, and cluster management.

  • Proficient in Slurm workload management, encompassing job scheduling, resource allocation, and cluster optimization.

  • Demonstrated experience in deploying and managing distributed training systems at scale.

  • In-depth knowledge of container orchestration and the architecture of distributed systems.

  • Solid familiarity with LLM architecture and training processes, including Multi-Head Attention, Multi/Grouped-Query, and distributed training strategies.

  • Experience in managing GPU clusters and optimizing compute resource utilization.

Required Skills

  • Advanced Kubernetes administration and YAML configuration management skills.

  • Expertise in Slurm job scheduling, resource management, and cluster configuration.

  • Proficiency in Python and C++ programming with a focus on systems and infrastructure automation.

About Perplexity

Perplexity is at the forefront of AI innovation, dedicated to building cutting-edge solutions that enhance understanding and interaction with technology. Our team is passionate about leveraging advanced technologies to solve complex problems and create impactful AI applications.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.