About the job
About Us
At Cohere, our mission is to harness the power of intelligence for the benefit of humanity. We specialize in training and deploying pioneering models that empower developers and enterprises to create transformative AI solutions. From content generation to advanced semantic search, we are at the forefront of AI innovation, driving its widespread adoption.
We take immense pride in our work and believe that every team member plays a crucial role in enhancing our models and delivering exceptional value to our clients. Our culture encourages hard work, agility, and a relentless focus on customer satisfaction.
Cohere is composed of leading researchers, engineers, designers, and more, all dedicated to their craft. We value diverse perspectives, knowing they are essential for creating outstanding products.
Join us in shaping the future of AI!
Why Join Our Team?
Our internal infrastructure team is pivotal in constructing top-tier infrastructure and tools that facilitate the training, evaluation, and deployment of Cohere's foundational models. As part of our team, you’ll collaborate closely with AI researchers to meet their cutting-edge workload requirements, emphasizing stability, scalability, and observability. Your role will involve building and managing Kubernetes GPU superclusters across multiple cloud environments, directly supporting the development of industry-leading AI models that empower Cohere's platform.
We are looking for software engineers at various career stages. Whether you are just starting your professional journey or are an experienced staff engineer, you will find ample opportunities for growth and impact here.
Note: All infrastructure roles require participation in a 24/7 on-call rotation, and you will be compensated for your on-call duties.
Your Responsibilities as a Software Engineer:
Design and manage Kubernetes compute superclusters across diverse cloud platforms.
Collaborate with cloud service providers to optimize infrastructure costs, performance, and reliability for AI workloads.
Work in tandem with research teams to assess their infrastructure needs and discover enhancements for stability, performance, and efficiency in novel model training techniques.
Create and implement resilient, scalable systems for training AI models, with a focus on developing user-friendly interfaces that empower researchers to manage their workflows independently.
