About the job
Cerebras Systems is at the forefront of AI technology, creating the world's largest AI chip that is 56 times the size of traditional GPUs. Our innovative wafer-scale architecture combines the compute power of dozens of GPUs into a single chip, simplifying the programming experience. This unique design enables us to achieve unparalleled training and inference speeds, allowing machine learning practitioners to run extensive ML applications seamlessly without the complexities of managing numerous GPUs or TPUs.
Our clientele includes premier model laboratories, multinational corporations, and pioneering AI-driven startups. Notably, OpenAI has recently formed a multi-year collaboration with Cerebras, aiming to harness 750 megawatts of computational scale to revolutionize key workloads through ultra-high-speed inference.
Thanks to our cutting-edge wafer-scale architecture, Cerebras Inference delivers the fastest Generative AI inference solution globally, achieving speeds over ten times faster than GPU-based hyperscale cloud inference services, thus transforming the user experience of AI applications and enabling real-time iterations and enhanced intelligence through additional agentic computation.
Responsibilities:
- Lead the design and implementation of advanced system-level debugging, validation, and observability platforms.
- Develop automated systems for collecting and analyzing numerical data and execution anomalies.
- Create visualization and analysis tools to facilitate efficient root-cause investigations.
- Build frameworks for failure classification, regression detection, and anomaly monitoring.
- Enhance compilers, runtimes, and programming interfaces to support sophisticated profiling and instrumentation.
- Improve workflows related to system bring-up, low-level debugging, and validation.
- Collaborate cross-functionally with teams in compiler, hardware, firmware, runtime, and infrastructure domains.
- Establish best practices to ensure debuggability, reliability, and operational excellence.
- Lead impactful initiatives and support incident response while driving long-term corrective solutions.

