About the job
Join the innovative team at Moonlake, where we harness the power of AI to create real-time interactive content.
Mission: Elevate performance metrics by enhancing throughput, reducing latency, and optimizing costs - deploying our models 2–10 times faster and at lower costs without compromising quality.
Scope of Work:
- GPU Performance: Expertise in CUDA/Triton kernels, FlashAttention family, paged attention, and CUDA Graphs.
- Serving Stack: Proficiency with TensorRT-LLM/Triton Inference Server, vLLM/TGI; continuous batching; on-GPU KV reuse; speculative decoding/medusa; and mixture-of-agents routing.
- Parallelism: Experience with FSDP/ZeRO, TP/PP/expert parallel; NCCL tuning.
- Quantization/PEFT: Familiarity with AWQ/GPTQ/FP8; LoRA/DoRA serving.
- Systems: Knowledge of Ray/k8s/Argo, observability tools (Prom/Grafana/OpenTelemetry), autoscaling, A/B infrastructure, and canary + rollback.
Tech Signals:
Ideal candidates will have previous experience at infrastructure-heavy startups such as Databricks or Roblox.
We are dedicated to maintaining an on-site, in-person team based in San Mateo.

