Operations Engineering Manager Fleet Reliability jobs in Dublin – Browse 1,063 openings on RoboApply Jobs

Operations Engineering Manager - Fleet Reliability

CoreWeaveDublin, Ireland

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Manager

Qualifications

Proven experience in operations management within a technology or engineering environment. Strong leadership skills with a focus on team development and performance management. Experience with server infrastructure, cloud technologies, and automation tools. Excellent problem-solving skills and ability to drive process improvements. Strong communication abilities, both verbal and written.

About the job

We take pride in being a Living Wage accredited Employer.

Your Role

The Fleet Reliability Operations Team serves as the core of CoreWeave’s capacity delivery and maintenance initiatives. This team is tasked with provisioning, updating, and managing server nodes, along with executing the processes and tools that configure and validate our server fleet. As the first responders to hardware issues in production, this team is empowered to drive automation and observability design throughout our server fleet lifecycle.

We are on the lookout for an Operations Engineering Manager to join the Fleet Reliability Operations team. This role will be pivotal in maintaining and enhancing our delivery volume as we expand our fleet tenfold. You will cultivate a robust talent pipeline, oversee onboarding and training, provide leadership in processes, and advocate for reliability and customer satisfaction. As the manager of this team, you will have the chance to:

Establish and lead a 24/7 team of process-oriented engineers focused on reliability and observability.
Facilitate the development and documentation of clear, consistent processes for provisioning, validating, and troubleshooting nodes in our server fleet.
Critically assess and champion process and automation improvements, prioritizing event-driven automated remediation.
Provide a 24/7 engineering support function for critical, time-sensitive node delivery and maintenance.
Enhance our onboarding, documentation, enablement, and performance management programs to elevate team members' growth and capabilities.
Foster a culture of accountability and performance measurement within your team.

About CoreWeave

CoreWeave is the essential cloud platform for AI, empowering innovators with cutting-edge technology and expert support. We are committed to delivering exceptional performance and driving the future of AI infrastructure.

Similar jobs

1 - 20 of 1,063 Jobs

Select all on this page (20)

Apply

Operations Engineering Manager - Fleet Reliability

CoreWeave

Full-time|On-site|Dublin, Ireland

CoreWeave is at the forefront of AI infrastructure, providing the essential cloud computing services tailored for innovators. Our platform equips AI pioneers with the necessary technology, tools, and expert teams to confidently build and scale their AI solutions. Trusted by top AI labs, startups, and global enterprises, CoreWeave combines unparalleled infrastructure performance with extensive technical expertise to drive breakthroughs and transform compute capabilities. Established in 2017, CoreWeave made its public debut on Nasdaq (CRWV) in March 2025. Discover more at www.coreweave.com. We take pride in being a Living Wage accredited Employer. Your RoleThe Fleet Reliability Operations Team serves as the core of CoreWeave’s capacity delivery and maintenance initiatives. This team is tasked with provisioning, updating, and managing server nodes, along with executing the processes and tools that configure and validate our server fleet. As the first responders to hardware issues in production, this team is empowered to drive automation and observability design throughout our server fleet lifecycle.We are on the lookout for an Operations Engineering Manager to join the Fleet Reliability Operations team. This role will be pivotal in maintaining and enhancing our delivery volume as we expand our fleet tenfold. You will cultivate a robust talent pipeline, oversee onboarding and training, provide leadership in processes, and advocate for reliability and customer satisfaction. As the manager of this team, you will have the chance to:Establish and lead a 24/7 team of process-oriented engineers focused on reliability and observability.Facilitate the development and documentation of clear, consistent processes for provisioning, validating, and troubleshooting nodes in our server fleet.Critically assess and champion process and automation improvements, prioritizing event-driven automated remediation.Provide a 24/7 engineering support function for critical, time-sensitive node delivery and maintenance.Enhance our onboarding, documentation, enablement, and performance management programs to elevate team members' growth and capabilities.Foster a culture of accountability and performance measurement within your team.

Apr 3, 2026

Apply

SRE, Site Reliability Engineering

Klaviyo

On-site|On-site|Dublin, IE

Join Klaviyo as a Site Reliability Engineer II in Dublin, where you'll play a pivotal role in ensuring the reliability, scalability, and sustainability of our critical platforms. Our approach treats reliability as a core product feature, leveraging your engineering skills to tackle complex operational challenges. You'll collaborate with a dynamic team to enhance our infrastructure, security, and software engineering practices, ensuring our systems perform optimally at scale. Your contributions will directly influence how our engineering teams build software and how our customers engage with our platform daily.

Jan 31, 2026

Apply

Staff Software Engineer, AI Reliability Engineering

Anthropic

On-site|On-site|Dublin, IE

About AnthropicAt Anthropic, we are on a mission to develop AI systems that are not only reliable and interpretable but also steerable. Our primary goal is to ensure that AI technology is safe and advantageous for all users and society at large. Our rapidly expanding team consists of dedicated researchers, engineers, policy experts, and business leaders, all working collaboratively to create beneficial AI solutions.Role OverviewAt Anthropic, we believe in the strength of collaboration. Our AI Reliability Engineering (AIRE) team plays a crucial role in maintaining the robustness of Claude, our flagship AI, ensuring it remains reliable for everyone who relies on it. We work closely with various teams within Anthropic to enhance reliability across our essential service paths—from the SDK, through our network, API layers, serving infrastructure, and accelerators, and back again. Our hands-on approach allows us to make impactful improvements during incidents and in collaborative projects.Reliability is an emergent quality that extends beyond individual teams. Our role involves taking a comprehensive view of the systems, offering a unique opportunity for dynamic, cross-functional engagement with the most critical aspects of our operations.

Feb 9, 2026

Apply

Site Reliability Engineer III

MongoDB, Inc.

Full-time|Hybrid|Dublin

MongoDB, Inc. supports organizations as they build and operate modern applications. The company’s flagship product, MongoDB Atlas, is a multi-cloud database platform available across AWS, Google Cloud, and Microsoft Azure in more than 115 regions. Atlas enables customers to run applications both on-premises and in the cloud. Each month, over 175,000 new developers join the MongoDB community. Companies such as Samsung and Toyota rely on MongoDB for next-generation, AI-driven applications. Role overview The Site Reliability Engineer III joins a team responsible for designing and maintaining the infrastructure that powers MongoDB services, with a particular focus on the Atlas platform. As customer requirements and regulations change, the SRE team works to deliver low-latency responses and address data sovereignty needs. The goal is to build complex systems that are reliable, straightforward to operate, and easy to monitor. Infrastructure-as-code and self-healing systems are core values for the team. Collaboration with other engineering groups is a regular part of the role, ensuring shared knowledge and responsibility for system health. Location This position is based in Dublin and follows a hybrid work model.

Apr 21, 2026

Apply

Database Reliability Engineer

Starling Bank

Full-time|Hybrid|Dublin, County Dublin, Ireland

At Starling Bank, we are on a transformative mission to redefine the banking experience. As the UK’s first digital bank, our vision centers around leveraging cutting-edge technology to deliver fast, fair, and transparent banking services that empower our customers to manage their finances effortlessly.Our organization marries the core principles of being a fully licensed bank with the dynamic pace of a tech innovator. With a workforce of over 3,000 professionals across our offices in London, Southampton, Cardiff, and Manchester, we emphasize a culture that fosters innovation, collaboration, and ownership.As a Database Reliability Engineer, you will be integral to our tech team, contributing to a work environment that encourages creativity and the use of advanced technologies. Your role will encompass building, optimizing, and maintaining reliable database systems that are crucial for our banking operations.We believe in a flat organizational structure that empowers every team member to make impactful decisions. Our core values—Listen, Keep It Simple, Do The Right Thing, Own It, and Aim For Greatness—guide our mission to create a better banking experience.Hybrid WorkingOur hybrid working model encourages collaboration while allowing flexibility, requiring attendance at the office at least once a week.Data EnvironmentOur Data teams work across various divisions, focusing on delivering insights that positively impact our business and customers. We invite talented data professionals at all levels to be part of our journey.

Apr 8, 2026

Apply

Staff Site Reliability Engineer

MongoDB, Inc.

Full-time|Hybrid|Dublin

The Team The Storage Layer Services (SLS) team at MongoDB is pioneering the re-architecture of our cloud storage layer, fundamentally enhancing the core of our next-generation cloud storage architecture. This innovative team is dedicated to developing high-performance, multi-tenant distributed storage services that elevate the current Atlas storage stack and facilitate the efficient execution of diverse customer workloads. As a member of this team, you will collaborate closely with engineers responsible for building these storage services. Your role will involve defining Service Level Objectives (SLOs), shaping capacity plans, and ensuring the reliability, durability, and operational safety of the storage layer that supports Atlas. You will be part of a select group of senior Site Reliability Engineers (SREs), playing a vital role in the execution of a strategic multi-year roadmap for MongoDB's cloud storage architecture. We are particularly eager to connect with candidates located in Dublin, as this role follows a hybrid working model.

Apr 10, 2026

Apply

Senior Site Reliability Engineer - Ireland

Arista Networks

Full-time|On-site|Dublin

Join Arista Networks as a Senior Site Reliability Engineer, where you will play a crucial role in ensuring the reliability, performance, and scalability of our systems. You will collaborate with cross-functional teams to implement best practices in software development and operational excellence.

Apr 1, 2026

Apply

Site Reliability Engineer (SRE/DevOps) - Engineering Productivity

Arista Networks

Full-time|On-site|Dublin

Collaboration and Innovation Await YouJoin Arista Networks as a talented Site Reliability Engineer within our Engineering Productivity (EngProd) team, where you will play a crucial role in maintaining and enhancing our rapidly expanding infrastructure. We seek a versatile and adaptable professional who is eager to explore new technologies. As part of our software engineering team, you will collaborate with peers to design, build, and manage secure, scalable, and fault-tolerant tools and infrastructure in a hybrid cloud environment.In the EngProd group, you will engage with fellow engineers to architect, scale, and operate the systems that support Arista’s product development teams. Our technology stack includes industry standards such as Ansible, Artifactory, Gerrit, Jenkins, Kubernetes, Grafana, Spinnaker, MySQL, ElasticSearch, Google Cloud, Varnish, and Perforce, alongside custom-built internal systems designed to automate CI/CD, testing, analysis, and visualization.Your ResponsibilitiesSafely and incrementally build, deploy, and manage critical production systems with an emphasis on scalability, reliability, observability, performance, and security.Enhance and monitor the developer experience across various services.Automate processes to eliminate toil and enhance operational efficiency of production systems.Proactively monitor and respond to alerts while setting up automated alert handling mechanisms.Develop and maintain incident response runbooks.Triage platform and infrastructural issues, assisting Arista software engineers and collaborating with third-party vendor support.Document postmortems and create solutions to prevent recurring incidents.Communicate and plan maintenance windows for production systems.Work closely with Arista’s product development teams to identify and resolve infrastructural bottlenecks affecting their workflows.Research and implement best practices around infrastructure and platforms to ensure secure, scalable, and fault-tolerant systems.Analyze and understand the design and implementation details of open-source systems to improve triage and resolution processes.

Mar 12, 2026

Apply

Site Reliability Engineer at StepStone | Dublin

StepStone

Full-time|On-site|Dublin

Join StepStone as a Site Reliability Engineer and play a critical role in ensuring the stability and performance of our innovative platforms. In this position, you will collaborate with cross-functional teams to enhance system reliability, improve the scalability of our applications, and automate operations processes. Your expertise in monitoring, incident response, and cloud technologies will be invaluable as you work on enhancing our infrastructure and delivering top-notch solutions.

Apr 10, 2026

Apply

Site Reliability Engineer at airapps | Dublin

airapps

Full-time|On-site|Dublin

airapps is looking for a Site Reliability Engineer (SRE) in Dublin. This role centers on keeping services reliable, available, and performing well. Working side by side with software development teams, the SRE will help strengthen system architecture and support ongoing improvements. Role overview The Site Reliability Engineer focuses on supporting the stability and efficiency of airapps’ systems. The position involves regular collaboration with developers to address system challenges and refine processes. Key responsibilities Monitor and maintain the reliability and uptime of core services Work with development teams to improve system design and architecture Apply new technologies and methods to boost operational efficiency Location This position is based in Dublin.

Apr 28, 2026

Apply

Site Reliability Engineer at Crusoe | Dublin, IE

Crusoe

Full-time|On-site|Dublin - IE

Crusoe is on a mission to revolutionize the way we access and utilize energy and intelligence. We are building the infrastructure that empowers a future where ambitious AI-driven projects can thrive without compromising on scale, speed, or sustainability.Join us at Crusoe and be part of the AI revolution through sustainable technology. Here, you will spearhead significant innovations, create a lasting impact, and collaborate with a team committed to delivering responsible and transformative cloud infrastructure.About This Role:As a Site Reliability Engineer (SRE) at Crusoe, you will be integral in maintaining the reliability and performance of our cutting-edge infrastructure. Our SRE team focuses on identifying, analyzing, and mitigating issues to uphold high Service Level Agreements (SLAs) through effective Service Level Indicators (SLIs) and Service Level Objectives (SLOs). By automating processes and proactively addressing potential problems, you will help ensure that our systems run seamlessly, advising engineering teams on best practices for resilient coding. Your role will involve anticipating issues before they affect our customers, conducting comprehensive post-mortems, and promoting continuous improvement to uphold the highest reliability standards for Crusoe's AI platform. The ideal candidate possesses a solid foundation in SRE practices, distributed systems, networking, and Linux, along with a passion for automation and problem-solving. This is a full-time position.What You’ll Be Working On:Automation and Tool Development: Streamline routine processes and enhance Crusoe’s internal infrastructure platform, allowing software teams to operate effectively without needing in-depth knowledge of the operating system, hardware, or network.Collaboration and Planning: Engage in daily stand-up meetings with the team to review projects, recent incidents, and daily priorities. Collaborate on strategies for launching new data centers or upgrading existing ones. Work closely with software engineers to ensure the adoption of resilient coding practices and review modifications prior to deployment.System Monitoring and Alerting: Analyze overnight alerts and performance metrics to guarantee optimal system operation. Evaluate system logs and develop innovative tools to enhance our monitoring capabilities.Incident Response and Problem Solving: Participate in incident response simulations, post-mortems, and root cause analysis sessions to extract valuable lessons from past issues.

Jan 14, 2026

Apply

Senior Site Reliability Engineer at Tenable | Dublin, Ireland

Tenable, Inc.

Full-time|On-site|Ireland - Office - Dublin

About Tenable Tenable is a global leader in Exposure Management, trusted by over 44,000 organizations to help understand and reduce cyber risk. The company supports 65% of the Fortune 500, 45% of the Global 2000, and many government agencies. Team and Culture Tenable’s people are at the heart of its success. Teams work together to build cybersecurity solutions and maintain a culture rooted in respect and excellence. Employees collaborate with industry experts and have the tools and support to make a measurable difference. Role Overview: Senior Site Reliability Engineer This Dublin-based role sits within the SRE Infrastructure Management team. The team’s mission is to keep Tenable’s cloud-centric exposure management platform reliable, scalable, and secure. The focus is on reducing manual operational work by building advanced automation, especially using AI. What You Will Do Design and build AI-powered agentic workflows to automate complex SRE tasks, including incident investigation and deployment reliability. Develop evaluation frameworks, prompt engineering methods, retrieval strategies, and structured output validation to improve the accuracy and observability of agent pipelines. Write production code, create agentic workflows, and integrate observability and infrastructure platforms. Analyze the impact of automation efforts using real toil data. What Sets This Role Apart This position is not limited to operations with minor automation. Most of the work involves hands-on development: designing, coding, and deploying intelligent systems that replace manual SRE workflows. The team uses large language models, agentic architectures, and deep SRE knowledge to drive results. Location Office-based in Dublin, Ireland.

Apr 20, 2026

Apply

Core Operations Engineer - Join Our Innovative Team

Virtu Financial

Full-time|On-site|Dublin, Ireland

Virtu Financial is a premier financial services firm that harnesses advanced technology to provide liquidity in global markets and deliver innovative, transparent trading solutions to our clientele. As a market maker, Virtu enhances market efficiency by offering deep liquidity across a vast array of over 19,000 securities, spanning 235 venues in 36 countries worldwide. THE ROLE As part of Virtu's dynamic global team, our Site Reliability/Core Operations Engineers are crucial in managing the deployment, maintenance, and continuous improvement of a complex electronic trading system operating across numerous venues globally. This role places you at the forefront of our technology's interaction with financial markets, requiring quick decision-making and composure in high-pressure situations. As the first point of contact for all external trading connections, you will engage in a variety of functions, including counterparty support, risk management, and system optimization. Our engineers thrive in a Linux environment, tackling intricate technical challenges while collaborating with traders and exchanges to grasp the intricacies of micro-market structures. A fervent interest in both markets and technology is essential for success in this unique opportunity within a fast-paced electronic trading landscape.

Mar 6, 2026

Apply

Senior Product Manager - Star Trek Fleet Command

Scopely

Full-time|Hybrid|ES - Barcelona, Spain; GB - London, United Kingdom; IE - Dublin, Ireland

Scopely is seeking a Senior Product Manager to join our dynamic Star Trek Fleet Command team located in Dublin, Barcelona, or the UK, embracing a hybrid/remote-first work model. Since its launch in 2018, Star Trek Fleet Command has consistently ranked among the top 10 grossing Massive Multiplayer Strategy games on the market. Our commitment to innovation and player engagement has driven its ongoing evolution and success. At Scopely, we are passionate about our mission to inspire play every day, whether through our collaborative work environments or our strong connections with our player communities. As a global team of gaming enthusiasts, we are dedicated to developing, publishing, and innovating within the mobile games industry, connecting millions of players worldwide.

Feb 17, 2026

Apply

Senior QA Manager - Star Trek Fleet Command

Scopely

Full-time|On-site|ES - Barcelona, Spain; GB - United Kingdom; IE - Dublin, Ireland

Scopely is searching for a Senior QA Manager to oversee quality assurance for Star Trek Fleet Command. The position can be based in Barcelona, Dublin, or the United Kingdom. Key responsibilities Create and maintain testing strategies tailored to Star Trek Fleet Command Lead, guide, and support a team of QA specialists Collaborate with development teams to ensure high quality and strong player satisfaction Locations Barcelona, Spain Dublin, Ireland United Kingdom

Apr 27, 2026

Apply

Site Reliability Engineering Internship - Summer 2026 at Crusoe | Dublin, Ireland

Crusoe

Full-time|On-site|Dublin - IE

At Crusoe, we are on a mission to drive the future of energy and intelligence. Our innovative platform empowers individuals to harness the full potential of artificial intelligence without compromising on scalability, speed, or sustainability.Join the forefront of the AI revolution with Crusoe's sustainable technology. Here, you'll be instrumental in pioneering transformative innovations, making a significant impact, and collaborating with a team that is redefining responsible cloud infrastructure.About the Role:As a Software Engineering Intern, you will be part of a dedicated team shaping the future of distributed systems technology. This 12-week, full-time internship in our Dublin office offers a unique opportunity to contribute to the development of a robust cloud infrastructure that supports groundbreaking advancements in fields such as artificial intelligence, graphics rendering, and computational biology. You won't just observe; you'll take on real responsibilities, tackle production-level challenges, and play a key role in Crusoe's vision for sustainable and ethical high-performance computing.Throughout your internship, you will engage in impactful projects that extend beyond traditional classroom learning. Benefit from one-on-one mentorship from industry veterans and collaborate with a diverse group of engineers to construct fault-tolerant systems utilized by customers across the globe. We are looking for motivated, inquisitive, and proactive students ready to forge valuable connections and launch their careers by addressing today's most challenging computational problems.Your ResponsibilitiesSystem Development: Design, implement, and maintain scalable, highly available, and fault-tolerant distributed systems to support demanding computational workloads.Product Development: Innovate and create cutting-edge products and tools from inception that will be leveraged by a global user base.Production Support: Identify, troubleshoot, and resolve complex issues in production environments to maintain platform reliability.Feature Development: Collaborate with product owners and stakeholders to design, test, and iterate on new features that enhance platform capabilities.Team Collaboration: Work closely with senior engineers and peers to ensure technical tasks align with broader organizational objectives.Mentorship Opportunities: Engage in dedicated mentorship sessions to accelerate your growth and deepen your technical expertise.

Jan 29, 2026

Apply

Senior Site Reliability Engineer at Veeva | Dublin, Ireland

Veeva Systems Inc.

Full-time|Hybrid|Ireland - Dublin

Veeva Systems is a purpose-driven leader in cloud solutions for the life sciences industry, dedicated to accelerating the delivery of therapies to patients. As one of the fastest-growing SaaS companies globally, we achieved over $2 billion in revenue last year and are poised for continued growth.Our core values—Do the Right Thing, Customer Success, Employee Success, and Speed—guide our operations. We made history in 2021 by becoming a public benefit corporation (PBC), committed to balancing the interests of our customers, employees, society, and investors.At Veeva, we embrace flexibility through our Work Anywhere philosophy, enabling you to thrive in your preferred work environment—whether from home or in the office.Be a part of our mission to transform the life sciences sector, making a meaningful impact on our customers, employees, and communities.The Role We are looking for a Senior Site Reliability Engineer to join our Vault Platform team. In this role, you will be responsible for maintaining the scalability and reliability of our enterprise applications, addressing complex challenges on a global scale. Your expertise in Java and modern open-source technologies will be critical in enhancing our production systems.The ideal candidate will possess a wealth of experience with Java applications and the latest open-source technologies, ideally gained from enterprise software development or a rapidly growing tech environment. As a Senior SRE, you should be innately curious and proficient in problem-solving. You will also offer a unique engineering perspective, understanding how systems integrate to function effectively for hundreds of customers across North America, Europe, and Asia.

Aug 10, 2021

Apply

Team Lead, Site Reliability Engineering - Storage Layer Service

MongoDB, Inc.

Full-time|On-site|Dublin

Role Overview MongoDB is hiring a Team Lead for Site Reliability Engineering, with a focus on the Storage Layer Service. This position is based in Dublin. What You Will Do Lead efforts to improve the reliability and performance of the Storage Layer Service. Work closely with teams across the company to deliver solutions that support both user experience and operational goals. Guide and support engineers as they address technical challenges in the storage layer. Collaboration This role involves regular collaboration with other engineering groups and stakeholders to identify opportunities for improvement and implement changes that make a measurable impact.

Apr 15, 2026

Apply

Major Incident Lead - Site Reliability

InterSystems

Full-time|Remote|Dublin (Remote)

Overview Join our dynamic Managed Services team as a Major Incident Lead specializing in Site Reliability. In this critical role, you will spearhead the response to significant, customer-impacting incidents across InterSystems’ managed services platforms. As the Incident Commander, you will ensure swift service restoration, maintain clear and confident communication with stakeholders, and coordinate effectively across SRE, engineering, support, cloud, and service delivery teams. Operating within a service model aligned with SRE principles, you will prioritize service reliability by leveraging service level indicators and objectives, focusing on reducing customer impact during live incidents over root cause analysis. Beyond immediate incident management, you will lead post-incident reviews to transform operational failures into actionable reliability enhancements and minimize repeat incidents. This position is vital for preserving customer trust, ensuring platform resilience, and achieving operational excellence in a 24x7, mission-critical, and highly regulated environment.

Mar 26, 2026

Apply

Lead Engineer for Atlas Clusters Fleet Signal Management

MongoDB, Inc.

Full-time|On-site|Dublin

MongoDB is looking for a Lead Engineer to join the Atlas Clusters Fleet Signal Management team in Dublin. This position focuses on developing and enhancing the core systems that power MongoDB Atlas, the company’s cloud database management platform. Key responsibilities Lead the creation of new features and solutions aimed at boosting Atlas performance and reliability Collaborate with engineers, product managers, and cross-functional teams to design, implement, and maintain essential systems Ensure a stable, high-quality experience for Atlas users Location This role is based in Dublin.

Apr 23, 2026

Create account — see all 1,063 results

1 - 20 of 1,063 Jobs

Select all on this page (20)

Apply

Operations Engineering Manager - Fleet Reliability

CoreWeave

Full-time|On-site|Dublin, Ireland

Apr 3, 2026

Apply

SRE, Site Reliability Engineering

Klaviyo