About the job
About Us
Lightning AI, the innovative force behind PyTorch Lightning, is revolutionizing the AI landscape since 2019. We provide an all-encompassing platform designed to streamline the development, training, and deployment of AI systems, facilitating the transition from research to production effortlessly.
Following our merger with Voltage Park, a cutting-edge neocloud and AI Factory, we unite developer-centric software with cost-effective, large-scale computing solutions. Our tools are tailored for experimentation, training, and production inference, incorporating built-in security, observability, and control.
We cater to various clients, from individual researchers to startups and large enterprises, operating globally with offices in key cities including New York, San Francisco, Seattle, and London. We're proud to be backed by prestigious investors like Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.
Our Core Values
Move Fast: We prioritize speed and accuracy, breaking down complex challenges into manageable tasks.
Focus: We aim to achieve one goal at a time, working collaboratively to deliver precise features.
Balance: We believe sustained performance comes from adequate rest and recovery, ensuring a healthy work-life balance.
Craftsmanship: We strive for excellence in every detail, taking pride in our work and its impact.
Minimal: We embrace simplicity to drive innovation, eliminating unnecessary complexity and focusing on what truly matters.
Role Overview
We are on the lookout for a GPU & Compute Infrastructure Engineer to become a vital member of our Infrastructure Engineering team. In this pivotal role, you will manage image systems, diagnostics, and validation across expansive bare-metal computing infrastructure, particularly for GPU-optimized systems. You will work at the crossroads of hardware, systems, and software, developing automation, enhancing reliability, and facilitating efficient cluster setups for AI/ML and HPC workloads.
Your responsibilities will include overseeing our image pipeline, running validation environments and test clusters, and supporting GPU hardware qualification. This role is essential for maintaining the integrity of our infrastructure, ensuring consistency, performance, and reliability.
