About the job
Join Our Team at ai&
ai& is a cutting-edge global AI technology firm committed to addressing the surging demand for artificial intelligence solutions. Our dual mission is to establish ourselves as a leading AI research lab focused on localization while also providing comprehensive global infrastructure and computing services. We are developing a cohesive, state-of-the-art platform that amalgamates next-gen data centers, diverse computing resources, and advanced model services. We believe that owning the entire technology stack is pivotal to effectively building and scaling AI solutions.
At ai&, we empower small, agile teams with the autonomy to confront significant challenges. We prioritize breaking down complex problems into digestible components and collaboratively addressing intricate issues. We are in search of driven, mission-oriented individuals who exhibit substantial personal initiative. Curiosity is the cornerstone of our talent, and we desire team members who are excited to grow alongside our dynamic technology and expanding enterprise.
We are actively recruiting talented individuals globally, with offices in Tokyo, San Francisco, Austin, and Toronto. We are eager to connect with exceptional talent wherever they may be located.
As an Inference & Serving Engineer, your mission is to create a high-performance, multi-tenant serving architecture that maximizes the utilization of varied hardware resources. You will navigate the complexities of various state-of-the-art inference frameworks and engines, optimizing the runtime for specific workloads. Your responsibilities will encompass not just Large Language Models but also pioneering Generative AI applications, including high-throughput video generation and sophisticated multimodal systems with heightened memory and computational demands.
Your role extends beyond merely deploying models at scale; you will build a robust system that unites specialized, high-performance clusters with extensive, multi-node deployments as our company expands. A profound understanding of the 'Inference Triangle' is essential—continually fine-tuning the stack to achieve the ideal balance between low latency (TTFT/ITL), high throughput, and inference quality (Precision/Quantization). The ideal candidate should be a hands-on engineer who perceives the entire GPU fleet as a singular, programmable compute fabric and is enthusiastic about engaging at every level of the stack.

