About the job
About ai&
ai& is an innovative global AI technology company committed to addressing the increasing global demand for artificial intelligence solutions. Our dual vision encompasses establishing ourselves as a leading AI lab focused on localization and serving as a comprehensive global infrastructure and compute provider. We are in the process of developing a cohesive and optimized global platform that merges cutting-edge data centers, diverse compute resources, and sophisticated model services. We firmly believe that the most effective method to develop and scale AI technologies is to control the entire stack from top to bottom.
At ai&, we empower smaller teams with the freedom required to confront significant challenges. Our methodology involves breaking down large problems into manageable parts and collaboratively solving complex issues. We are on the lookout for highly motivated, mission-driven individuals who exhibit strong personal initiative. Curiosity is highly valued as the bedrock of talent, and we seek individuals who are eager to evolve alongside our advancing technology and expanding business.
We are actively recruiting talent worldwide, with offices in Tokyo, San Francisco, Austin, and Toronto. We are enthusiastic about meeting exceptional candidates wherever they are located.
Role Overview
As a Kernel Optimization Engineer, your primary goal will be to maximize the performance of heterogeneous GPU hardware. This role involves diving below the framework layer, creating, profiling, and fine-tuning custom CUDA and ROCm/HIP kernels that form the backbone of our inference and training stack. You will work with both NVIDIA and AMD architectures, gaining an understanding of their architectural distinctions and optimizing your code accordingly.
This position focuses on kernel creation rather than deployment. You will identify execution loop bottlenecks such as memory bandwidth saturation, warp divergence, occupancy limits, and cache thrashing, and devise solutions based on first principles. Collaboration with our inference and serving team will be crucial to ensure that your kernel creations yield tangible performance enhancements , your domain of expertise will encompass the kernel layer and everything beneath it.
Your responsibilities will include exploring attention mechanisms, quantization primitives, custom activation functions, fused operators, and the communication kernels that integrate multi-GPU systems. The ideal candidate possesses a hardware-centric mindset: they conceptualize in terms of warps, tiles, and memory hierarchies prior to engaging with frameworks. They should be comfortable interpreting PTX and roofline charts and are perpetually in pursuit of optimization.
