About the job
ABOUT HUT 8
Hut 8 is at the forefront of technological innovation, striving to construct and manage some of the largest data centers globally for advanced computing tasks, including AI, Colocation, Cloud services, and Bitcoin Mining. We are dedicated to offering stimulating and challenging opportunities for those eager to lead teams, tackle complex issues, and create a significant impact from the very first day. If you are a driven individual searching for a career that combines both reward and challenge, look no further!
ABOUT THE ROLE
We operate GPU clusters that are pivotal in powering the future of AI, and we are seeking a detail-oriented, hands-on Data Center Technician to ensure they maintain peak operational performance.
As an AI Data Center Technician II, you will engage directly with NVIDIA H100, H200, and B300 GPU clusters, in addition to state-of-the-art high-speed networking infrastructure. Your role is critical to supporting the AI organizations that are shaping the technologies of tomorrow—your ability to maintain uptime is crucial. Responsibilities include diagnosing, repairing, and maintaining our systems, with a dedicated focus on GPU cluster integrity, InfiniBand, high-speed Ethernet fabric, and hardware dependability.
This position operates under general supervision and reports directly to the Director of AI Infrastructure. Key responsibilities include:
GPU Cluster Operations & Diagnostics
- Conduct advanced diagnostics, repairs, and maintenance on NVIDIA H100, H200, and B300 GPUs and multi-node GPU clusters.
- Monitor cluster health and promptly respond to hardware failures to ensure maximum uptime.
- Perform firmware flashing and BIOS-level configurations on GPU nodes.
Networking
- Implement, maintain, and troubleshoot high-speed InfiniBand and Ethernet fabric (QSFP & OSFP).
- Utilize knowledge of InfiniBand topology across multi-rack GPU cluster environments.
- Assist with IP addressing, subnetting, and network diagnostics.
Systems & Linux
- Operate within Linux environments for system monitoring, diagnostics, and hardware management.
- Analyze data to detect failure patterns and drive proactive maintenance.
- Support firmware updates, driver management, and hardware configuration.

