Crusoe logoCrusoe logo

Site Reliability Engineer at Crusoe | Dublin, IE

CrusoeDublin - IE
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

Candidates should possess experience in Site Reliability Engineering, with a comprehensive understanding of distributed systems, networking, and Linux. A strong affinity for automation, combined with excellent problem-solving skills, is essential. Experience with cloud infrastructure and a proactive mindset towards continuous improvement will be highly advantageous.

About the job

Crusoe is on a mission to revolutionize the way we access and utilize energy and intelligence. We are building the infrastructure that empowers a future where ambitious AI-driven projects can thrive without compromising on scale, speed, or sustainability.

Join us at Crusoe and be part of the AI revolution through sustainable technology. Here, you will spearhead significant innovations, create a lasting impact, and collaborate with a team committed to delivering responsible and transformative cloud infrastructure.

About This Role:

As a Site Reliability Engineer (SRE) at Crusoe, you will be integral in maintaining the reliability and performance of our cutting-edge infrastructure. Our SRE team focuses on identifying, analyzing, and mitigating issues to uphold high Service Level Agreements (SLAs) through effective Service Level Indicators (SLIs) and Service Level Objectives (SLOs). By automating processes and proactively addressing potential problems, you will help ensure that our systems run seamlessly, advising engineering teams on best practices for resilient coding. Your role will involve anticipating issues before they affect our customers, conducting comprehensive post-mortems, and promoting continuous improvement to uphold the highest reliability standards for Crusoe's AI platform. The ideal candidate possesses a solid foundation in SRE practices, distributed systems, networking, and Linux, along with a passion for automation and problem-solving. This is a full-time position.

What You’ll Be Working On:

  • Automation and Tool Development: Streamline routine processes and enhance Crusoe’s internal infrastructure platform, allowing software teams to operate effectively without needing in-depth knowledge of the operating system, hardware, or network.

  • Collaboration and Planning: Engage in daily stand-up meetings with the team to review projects, recent incidents, and daily priorities. Collaborate on strategies for launching new data centers or upgrading existing ones. Work closely with software engineers to ensure the adoption of resilient coding practices and review modifications prior to deployment.

  • System Monitoring and Alerting: Analyze overnight alerts and performance metrics to guarantee optimal system operation. Evaluate system logs and develop innovative tools to enhance our monitoring capabilities.

  • Incident Response and Problem Solving: Participate in incident response simulations, post-mortems, and root cause analysis sessions to extract valuable lessons from past issues.

About Crusoe

Crusoe is pioneering advancements in energy and intelligence, creating sustainable solutions that empower ambitious AI projects. Our commitment to innovation and responsibility positions us at the forefront of cloud infrastructure development.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.