Qualifications
Core ResponsibilitiesDesigning GPU-based compute systems and distributed architecturesEstablishing and optimizing ML/AI training environmentsDeveloping cluster scheduling and resource managementIntegrating hybrid cloud (AWS / Azure / GCP) and on-premise systemsImplementing Kubernetes-based orchestration and infrastructure automationCreating system-level tooling and internal services (Python / Go / Rust)Collaborating closely with ML engineers and researchersProviding technical guidance and mentoring to the engineering teamQualifications8+ years of experience in distributed systems, infrastructure, or HPC environmentsDeep expertise in GPU compute (CUDA, multi-GPU systems)Proficient in Kubernetes and cloud platforms (AWS/Azure/GCP)Experience in infrastructure automation (Terraform, Ansible)Strong programming background (Python, Go, Rust, or similar)System-level thinking and performance optimization experienceLeadership experience (technical lead / mentoring / design reviews)Preferred QualificationsFamiliarity with ML platforms or large-scale model training infrastructureExperience in service mesh or zero-trust architecturesBackground in research or AI-related environments
About the job
Lead Server Engineer (AI / GPU Infrastructure)
Join a pioneering technology company at the forefront of innovation, specializing in groundbreaking 3D and imaging solutions powered by artificial intelligence and high-performance computing capabilities. Our ambition is to build a next-generation platform that facilitates the training and execution of complex ML/AI models on scalable, GPU-based infrastructure.
In this role, you will be instrumental in designing and managing our entire compute infrastructure, ranging from GPU clusters to distributed systems and hybrid cloud architectures. Your expertise will not only influence operations but also define the system architecture, scalability, and performance.