companyCrusoe logo

Senior Software Engineer - Cloud Availability Platform Engineering (Observability)

CrusoeSan Francisco, CA - US
On-site Full-time $166K/yr - $201K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

Qualifications:Proven experience in software engineering and observability platform development. Strong proficiency in Kubernetes and cloud-native architectures. Expertise in telemetry systems, including metrics, logging, and tracing. Familiarity with tools such as Prometheus, Grafana, ELK stack, and OpenTelemetry. Experience in programming with Go, Python, or similar languages. Understanding of security practices in observability platforms. Strong problem-solving skills and ability to work collaboratively in a team environment.

About the job

At Crusoe, we are on a mission to accelerate the availability of energy and intelligence. We are building the foundational technology that empowers individuals to innovate boldly with AI while maintaining speed, scale, and sustainability.

Join us in the AI revolution with sustainable technology at Crusoe, where you will lead significant innovations, make a real impact, and collaborate with a team that is pioneering responsible and transformative cloud infrastructure.

About the Role:
We are seeking a highly proficient engineer with extensive experience in designing and managing observability platforms at scale. You will be responsible for architecting, developing, and operating Crusoe’s next-generation observability stack, which will allow engineers to gain insights into the internal state of distributed systems through metrics, logs, and traces. Your contributions will guarantee reliability, performance, and actionable insights across Crusoe’s global infrastructure and cloud platform.

Key Responsibilities:

  • Design and manage scalable observability systems (metrics, logging, tracing) in multi-datacenter Kubernetes environments.

  • Architect comprehensive telemetry pipelines, covering ingestion, storage, querying, and visualization.

  • Enhance monitoring and alerting mechanisms with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry.

  • Develop scalable log collection and processing pipelines utilizing Fluent Bit, Vector, Loki, or ELK/Opensearch stacks.

  • Implement distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrate with service meshes, load balancers, and APIs.

  • Establish and promote the adoption of SLOs, SLIs, and error budgets across various services and teams.

  • Automate the provisioning and scaling of observability infrastructure using Kubernetes, Terraform, and custom tools (Go, Python).

  • Ensure the reliability and cost-effectiveness of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure).

  • Integrate security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls.

  • Collaborate with engineering teams to embed observability into applications, services, and infrastructure.

  • Mentor engineers and influence Crusoe’s observability strategy and technical roadmap.

About Crusoe

Crusoe is at the forefront of integrating sustainable technology with cloud infrastructure, driving innovation in the AI sector while prioritizing environmental responsibility. Our team is dedicated to developing solutions that enhance the capabilities of AI without compromising on sustainability.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.