Fireworks AI logoFireworks AI logo

Technical Staff Member - AI Training Infrastructure

Fireworks AISan Mateo, CA
On-site Full-time $175K/yr - $220K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

Key Responsibilities:Design and implement scalable infrastructure for extensive model training workloads. Develop and maintain distributed training pipelines for LLMs and multimodal models. Enhance training performance across multiple GPUs, nodes, and data centers. Implement monitoring, logging, and debugging tools to support training operations. Architect and maintain data storage solutions for large-scale training datasets. Automate infrastructure provisioning, scaling, and orchestration for model training. Collaborate with researchers to refine and optimize training methodologies. Analyze and enhance the efficiency, scalability, and cost-effectiveness of training systems. Troubleshoot complex performance issues in distributed training environments. Minimum Qualifications:Bachelor's degree in Computer Science, Computer Engineering, or a related field, or equivalent practical experience. A minimum of 3 years of experience with distributed systems and machine learning infrastructure. Proficient in PyTorch. Expertise in cloud platforms such as AWS, GCP, and Azure. Familiarity with containerization and orchestration technologies (Kubernetes, Docker). Understanding of distributed training techniques (data parallelism, model parallelism, FSDP).

About the job

About Us:

At Fireworks AI, we are at the forefront of developing innovative generative AI infrastructure. Our platform is recognized for delivering top-tier models and the industry's fastest, most scalable inference capabilities. As an industry leader in LLM inference speed, we are pushing boundaries with groundbreaking projects, including our own function calling and multimodal models. Fireworks is a Series C startup valued at $4 billion, supported by premier investors such as Benchmark, Sequoia, Lightspeed, Index, and Evantic. Our passionate and collaborative team is comprised of seasoned professionals from Meta PyTorch and Google Vertex AI.

The Role: 

We are seeking a Training Infrastructure Engineer to design, build, and optimize the infrastructure that underpins our large-scale model training operations. Your contributions will be pivotal in establishing high-performance AI training infrastructure. You'll work closely with AI researchers and engineers to develop robust training pipelines, optimize distributed training workloads, and guarantee the reliability of model development.

About Fireworks AI

Fireworks AI is revolutionizing the generative AI landscape with a strong focus on quality and performance. As a leader in the field, we are dedicated to innovation and excellence, backed by a team of industry veterans and significant funding.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.