Hark logoHark logo

Infrastructure Engineer for Large-Scale AI Training

HarkSan JoseNew
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

Requirements 5+ years of experience in infrastructure, systems, or platform engineering, with a minimum of 2 years in machine learning (ML) or high-performance computing (HPC) environments. Proven track record in managing GPU clusters or large-scale distributed compute infrastructure. Strong knowledge of Infrastructure as Code (IaC) tools and practices. Experience with CI/CD pipelines and monitoring systems. Excellent problem-solving skills and the ability to work collaboratively in a fast-paced environment. Familiarity with machine learning workloads and their infrastructure needs is a plus.

About the job

About Hark

Hark is at the forefront of artificial intelligence, dedicated to developing advanced and personalized systems that are proactive, multimodal, and capable of seamless interaction through speech, text, vision, and persistent memory.

We aim to revolutionize the interface between humans and machines by integrating advanced intelligence with next-generation hardware. While existing AI mainly relies on outdated devices and chat interfaces, Hark is pioneering the next generation of agentic systems designed for natural interaction with people and the real world.

To achieve our vision, we are creating multimodal models and state-of-the-art AI hardware from the ground up, establishing a unified interface for a new era of intelligent systems.

About the Role

We are seeking a skilled Member of Technical Staff to lead our Infrastructure Compute team, focusing on large-scale GPU computing clusters that power our AI training and deployment workloads. This role sits at the intersection of systems engineering and machine learning infrastructure, ensuring the reliability, scalability, and efficiency of the compute platform critical to our research and engineering teams. This is a high-impact position ideal for individuals passionate about infrastructure as a product and adept in complex distributed systems environments.

Responsibilities

  • Design, implement, and maintain Infrastructure as Code (IaC) best practices to facilitate repeatable, auditable, and scalable cluster provisioning.
  • Enhance and secure CI/CD deployment pipelines for robust, efficient, and low-latency model service delivery across production environments.
  • Manage and optimize stable training infrastructure operating with over 10,000 GPUs, focusing on job scheduling, fault tolerance, and network fabric efficiency.
  • Collaborate closely with ML researchers and engineers to identify and resolve compute bottlenecks through infrastructure enhancements.
  • Monitor system health, establish Service Level Objectives (SLOs), and lead incident response for critical training and inference workloads.
  • Drive capacity planning, implement cost-efficiency initiatives, and manage the hardware lifecycle across the GPU fleet.
  • Contribute to the development of internal tools and platform abstractions to enhance the developer experience for teams utilizing compute resources.

About Hark

Hark is an innovative AI company focused on creating next-generation intelligent systems that interact naturally with users. We are dedicated to advancing the field of artificial intelligence by merging cutting-edge technology with human-centered design.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.