About the job
ABOUT US:
Become a member of our dynamic AI DevOps team, dedicated to constructing and managing the infrastructure, frameworks, and tools that empower AI and data engineers to design, deploy, and operate AI applications at scale.
Our goal is to streamline the complexities of executing AI workloads in production through a robust and scalable platform built on Kubernetes and cloud infrastructure. We prioritize automation, observability, developer-friendly frameworks, and cost-effective infrastructure, enabling teams to transition swiftly from experimentation to production.
In this role, you will significantly contribute to the development and enhancement of the internal platforms that fuel our AI capabilities.
THE CHALLENGE:
- Collaborate closely with AI and data engineers to design and sustain the infrastructure necessary for operating AI and data workloads in production.
- Enhance and refine frameworks, tooling, and platform capabilities to streamline the development, deployment, and operation of AI applications.
- Build and manage Kubernetes-based platforms optimized for scalable and reliable AI workloads.
- Implement GitOps deployment workflows utilizing ArgoCD and Helm to standardize and automate application deployments.
- Enhance observability and monitoring across AI workloads using Grafana and Prometheus.
- Lead cost optimization initiatives, ensuring efficient utilization of computing resources, storage, and cluster capacity.
- Automate infrastructure provisioning and platform configuration through Infrastructure as Code practices.
- Engage with engineering teams to solicit feedback and continually enhance the AI platform and developer experience.
- Advocate for modern engineering practices such as automation, reproducibility, observability, and secure infrastructure design.
ABOUT YOU:
- You possess experience working with production-grade systems and comprehend how applications are developed, deployed, and managed in contemporary environments.
- You enjoy enhancing developer workflows and constructing platforms that facilitate easier deployment, operation, and scaling of applications and data workloads.
- You have hands-on experience with Kubernetes and at least one major cloud platform, with AWS experience being a significant advantage.
- You are adept in utilizing Infrastructure as Code tools, such as Terraform or similar.
- You have experience with CI/CD or GitOps workflows, ideally with ArgoCD and Helm.
- You are familiar with monitoring and observability tools like Grafana and Prometheus.
- You are committed to continuous learning and improvement.

