Location: Paris About H H is focused on advancing superintelligence through agentic AI. The team builds AI agents that automate complex, multi-step tasks typically handled by people, aiming to expand what humans can achieve. The company values safety, responsibility, and technical ambition. Openness, ongoing learning, and collaboration shape the culture, and every team member’s perspective matters. About the Inference Team The Inference team designs and improves the infrastructure behind H-models, which power the company’s agent technology. The group’s main goals are to optimize hardware use for high throughput, low latency, and cost efficiency, so users get a seamless experience. What You Will Do Design and build scalable inference pipelines that deliver low latency and control costs. Boost model performance by optimizing memory use, throughput, and latency, using methods like distributed computing, model compression, quantization, and caching. Develop custom GPU kernels for tasks where speed is critical, such as attention mechanisms and matrix multiplications. Work closely with H’s research teams to refine model architectures and improve inference efficiency. Keep up with new research in the field, reviewing recent papers and techniques to improve memory, throughput, and latency (for example: Flash attention, Paged Attention, Continuous batching). Identify, prioritize, and implement the latest inference techniques. Requirements Technical Skills: Master’s or PhD in Computer Science, Machine Learning, or a related field. Skilled in at least one programming language: Python, Rust, or C/C++. Experience with GPU programming frameworks such as CUDA, Open AI Triton, or Metal. Knowledge of model compression and quantization methods. Soft Skills: Works well in collaborative, multidisciplinary teams. Strong communicator and comfortable presenting ideas. Eager to take on new challenges.
Apr 14, 2026