About the job
At Crusoe, we are on a mission to accelerate the availability of energy and intelligence. We are building the foundational technology that empowers individuals to innovate boldly with AI while maintaining speed, scale, and sustainability.
Join us in the AI revolution with sustainable technology at Crusoe, where you will lead significant innovations, make a real impact, and collaborate with a team that is pioneering responsible and transformative cloud infrastructure.
About the Role:
We are seeking a highly proficient engineer with extensive experience in designing and managing observability platforms at scale. You will be responsible for architecting, developing, and operating Crusoe’s next-generation observability stack, which will allow engineers to gain insights into the internal state of distributed systems through metrics, logs, and traces. Your contributions will guarantee reliability, performance, and actionable insights across Crusoe’s global infrastructure and cloud platform.
Key Responsibilities:
Design and manage scalable observability systems (metrics, logging, tracing) in multi-datacenter Kubernetes environments.
Architect comprehensive telemetry pipelines, covering ingestion, storage, querying, and visualization.
Enhance monitoring and alerting mechanisms with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry.
Develop scalable log collection and processing pipelines utilizing Fluent Bit, Vector, Loki, or ELK/Opensearch stacks.
Implement distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrate with service meshes, load balancers, and APIs.
Establish and promote the adoption of SLOs, SLIs, and error budgets across various services and teams.
Automate the provisioning and scaling of observability infrastructure using Kubernetes, Terraform, and custom tools (Go, Python).
Ensure the reliability and cost-effectiveness of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure).
Integrate security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls.
Collaborate with engineering teams to embed observability into applications, services, and infrastructure.
Mentor engineers and influence Crusoe’s observability strategy and technical roadmap.

