About the job
Join Huawei Canada as a Distinguished Engineer in AI Computing Systems.
About the Team:
The Advanced Computing and Storage Lab, part of the Vancouver Research Centre, is dedicated to pioneering adaptive computing system architectures. We tackle the complexities introduced by flexible and variable application loads to enhance stability and quality in training clusters. Our focus includes developing dynamic cluster configuration strategies and precision control systems to ensure efficient computing power clusters. Our lab is actively engaged in key industry AI applications, particularly in large model training and inference, utilizing technologies such as low-precision training, multi-modal training, and reinforcement learning. We are committed to conducting bottleneck analysis and creating optimization solutions that enhance training, inference performance, and overall usability.
About the Job:
- As an industry leader in training cluster software frameworks, you will gain insights into the evolution of AI large model training frameworks. You will plan and design AI frameworks and software features for various scenarios like large model pre-training, post-training, and integrated training and inference, establishing critical capabilities for our training cluster software framework.
- Lead the team in optimizing large model training by developing key technologies such as low-precision training, parallel strategy tuning, and training resource optimization, driving the commercial implementation of these optimization technologies.
- Focus on our training servers, super nodes, and other products, leading the development of large model AI training frameworks, operator libraries, and acceleration libraries. Leverage system engineering and software-hardware collaboration to maximize AI cluster computing efficiency.
- Identify and collaborate with high-quality academic resources in large model training, working alongside domain experts to advance our technological capabilities.

