Join Hyperconnect AI Lab
At Hyperconnect AI Lab, we are dedicated to transforming user experiences by tackling challenges that traditional technologies struggle to solve. Leveraging the power of machine learning, we innovate in areas such as video, voice, natural language, and recommendation systems. Our mission is to develop a multitude of models across diverse domains and to deliver these solutions reliably via mobile and cloud infrastructure, ultimately driving the growth of our services, including Azar.
About the ML Platform Team
The ML Platform team within the AI Lab is focused on automating and stabilizing the entire ML production process, ensuring rapid business impact from AI technologies. We aim to maximize the productivity of research and development across the organization by establishing a sustainable platform.
Currently managing over 50 models in production, our team addresses complex technical challenges while pursuing the following core responsibilities:
Building and Developing Cloud-Based ML Ops Infrastructure
We develop and operate ML Ops components to create an automated feedback loop (AI Flywheel) that utilizes product data for model retraining, evaluation, and deployment, leading to continuous product enhancement. Key components include:
- A unified serving platform utilizing ArgoCD and NVIDIA Triton for rapid deployment of models trained on various deep learning frameworks (TensorFlow, PyTorch).
- A workflow platform based on Argo Workflows, enabling users to easily create and execute their required workflows.
- A robust data pipeline that streamlines the processing of raw data into usable formats for training.
Additionally, we provide developer portals, SDKs, and CLI tools for managing and utilizing these ML Ops components, facilitating the construction of continuous learning pipelines. We also conduct PoCs for rapidly evolving MLOps technologies and implement necessary improvements in production.
Establishing and Operating High-Performance GPU Clusters
To support seamless ML research and large-scale model training, we design and build an HPC (High-Performance Computing) GPU cluster optimized for business needs. This includes state-of-the-art resources such as A100/H100 GPUs and high-speed interconnects like InfiniBand to minimize node bottlenecks.
We meticulously tune scheduling policies to enable cost-effective sharing of limited computational resources within the research organization, ensuring efficient partitioning according to workload characteristics.