1 - 20 of 7,100 Jobs

Search for Senior Site Reliability Engineer at mlabs | Remote

7,100 results

Apply
companymlabs logo
Full-time|Remote|Remote — Germany

Join our dynamic team at mlabs as a Senior Site Reliability Engineer. In this pivotal role, you will leverage your expertise to enhance the reliability, performance, and scalability of our systems. Your contributions will play a crucial role in ensuring we deliver exceptional service to our clients.As a Senior Site Reliability Engineer, you will collaborate with various teams to design and implement robust infrastructure solutions. Your ability to troubleshoot and solve complex problems will be vital in maintaining our high availability standards.

Mar 18, 2026
Apply
companyExaring AG logo
Full-time|Remote|Homeoffice

Exaring AG operates waipu.tv, a streaming platform serving over a million customers in Germany. The service combines Free TV, Pay TV, NewTV, and Video-on-Demand, with features like recording, restart, and timeshift. Users can watch waipu.tv on smartphones, tablets, smart TVs, FireTV, Apple TV, and the waipu.tv stick. Exaring AG handles the entire platform, from video encoding to delivering smooth streaming experiences. Role overview The Senior Site Reliability Engineer ensures waipu.tv remains stable and reliable as its audience grows. This role focuses on strengthening the infrastructure, refining existing systems, and supporting new technical solutions for the streaming service. What you will do Design, build, and deploy software to improve the stability, scalability, availability, and performance of waipu.tv. Collaborate with the team to resolve production issues and develop automated solutions to prevent future incidents. Lead the architecture and ongoing management of the central Kubernetes platform. Monitor system performance and respond to outages when necessary. Support teams developing microservices for production infrastructure, using CNCF projects such as Kubernetes, Prometheus, and OpenTelemetry. Location This is a remote role (Homeoffice).

Apr 24, 2026
Apply
companyOrcrist Technologies logo
Full-time|Remote|Remote / Berlin

Site Reliability Engineer Company Overview At Orcrist Technologies, we are pioneers in developing the Orcrist Intelligence Platform (OIP), a robust and secure Kubernetes-native system designed for flexibility across cloud, on-prem, and air-gapped environments. Our dedicated Innovation team spearheads new initiatives, working independently from delivery teams to prototype and validate comprehensive solutions before they become fully-fledged products. Role Overview As a Site Reliability Engineer, you will play a crucial role in accelerating Innovation's rapid prototyping cycles. Your responsibilities will include automating prototype environments, overseeing demo deployments, and validating platform constraints early in the development process. You will swiftly provision sandboxes, ensure prototypes are deployable, and create infrastructure handoff packages for seamless productization. Key Responsibilities Automate the provisioning and teardown of prototype environments using Kubernetes, GitOps, and Pulumi/Terraform to facilitate rapid discovery cycles. Manage reliable demo deployments that showcase new initiatives while maintaining a quick iteration pace. Early validation of air-gapped, on-premise, and cloud constraints to ensure prototypes are ready for adoption by productization teams. Implement observability for prototypes using OpenTelemetry, Prometheus, and Grafana, along with lightweight Service Level Objectives (SLOs) to support validation efforts. Create comprehensive infrastructure handoff packages, including deployment specifications, Helm charts, and runbooks to ensure a smooth transition to productization. Collaborate with the Platform team to ensure prototypes align with standards while meeting the rapid iteration demands of the Innovation team. Qualifications 4+ years of experience in Site Reliability Engineering or Platform engineering with practical knowledge of Kubernetes and GitOps. Proven track record in rapid environment provisioning, ephemeral deployments, and infrastructure automation. Comfortable validating deployability constraints (air-gapped, on-prem) early in the development process. Exceptional technical writing skills, capable of producing clear deployment specifications and handoff artifacts that minimize adoption friction. Eligibility to work in Germany; EU or NATO citizenship is preferred. Preferred Qualifications Proficiency in the German language (B1+), experience with BSI/ISO compliance, and familiarity with supply chain security tools (e.g., Cosign, Kyverno). Experience working in prototype or R&D environments, demo automation, or infrastructure handoff processes. What We Offer Access to a modern tech stack including Kubernetes, Argo CD, Terraform, Prometheus, Vault, Kyverno, GitOps, and OpenTelemetry. A remote-first work environment in Germany, complemented by regular meetups in Berlin and 30 days of vacation. Engagement in mission-driven projects that have a meaningful impact on public safety and defense. Provision of equipment and a home office budget, along with support for professional development.

Jan 23, 2026
Apply
companyOrcrist Technologies logo
Site Reliability Engineer

Orcrist Technologies

Full-time|Remote|Remote / Berlin

Site Reliability Engineer Company Overview At Orcrist Technologies, we are pioneering a next-generation data intelligence platform designed to manage petabyte-scale data with lightning-fast query responses. Our innovative solution is based on Kubernetes and is offered as both a B2B SaaS and an on-premise self-hosted option, including air-gapped deployments. We empower clients in defense, law enforcement, and enterprise sectors to translate mission-critical data into actionable insights. Your Role As a Site Reliability Engineer, you will be integral in deploying and managing our data intelligence platform within agency-controlled environments. You will construct and operate secure, highly available Kubernetes clusters, both on-premises and in hybrid architectures. In this role, you will also respond as a forward-deployed SRE during incidents and upgrades, ensuring our systems adhere to strict privacy, audit, and legal evidence standards tailored for law enforcement applications. Key Responsibilities Deploy, install, and manage Kubernetes clusters for our platform in on-prem and hybrid settings. Configure and maintain GitOps workflows, Helm/Kustomize, and artifact registries within restricted networks. Design and lead incident response initiatives for the observability stack (Prometheus, Grafana) and enforce disaster recovery protocols. Enhance system security through network segmentation, mTLS, IAM, and vulnerability remediation. Create compliance documentation, operational runbooks, and train both agency and Orcrist teams on best practices. About You 5+ years of experience in SRE/DevOps, with a focus on on-call ownership and managing production systems. Extensive hands-on experience with Kubernetes (on-prem/hybrid), GitOps (Argo CD/Flux), and infrastructure automation tools (Ansible, Terraform). Strong expertise in observability tools (Prometheus, Grafana, Loki) and complex incident response methodologies. Fluency in both German and English (C1+), authorized to work in Germany, with a willingness to travel (20–30%). Preferred Qualifications In-depth understanding of IT and governance frameworks within law enforcement or the public sector. Relevant certifications such as CKA/CKAD, ISO 27001 Lead Implementer, CISSP, or GDPR Practitioner. Demonstrated experience integrating with essential enterprise systems, including Identity and Access Management (SAML, LDAP), and Security Information and Event Management (SIEM) platforms. Familiarity with digital evidence workflows and contributions to judicial processes. Previous exposure to managing sensitive environments, including air-gapped systems and investigative tools for public safety.

Jan 9, 2026
Apply
companyScout24 AG logo
Full-time|Hybrid|Berlin

Why Join Scout24?Scout24 is the proud home of ImmoScout24, Germany's premier platform for real estate. For over 25 years, we have been at the forefront of transforming the real estate market in Germany and Austria. Our mission is to create a digital ecosystem that unites homeowners, seekers, and agents, making the journey to find the perfect home a seamless experience. Your career is as vital as finding the right property; hence, #WorkingatScout24 means you will be part of a vibrant, diverse team of around 1,100 colleagues from 58 nationalities. We celebrate individuality and foster a culture of open-mindedness and authenticity, enabling true learning and personal growth. Mistakes are viewed as opportunities for growth and innovation. Together, we proactively strive for improvement and take responsibility, discussing both successes and challenges with mutual respect because we are #oneteam.If this resonates with you, we would love to welcome you on board! Even if you don't meet every requirement, we encourage you to share how you can contribute to our team. Grow with us! Welcome home!Beyond our outstanding company culture, we offer exceptional benefits that make Scout24 a fantastic workplace!

Dec 10, 2025
Apply
companyRobert Bosch Semiconductor Manufacturing Dresden GmbH logo
Senior Application Manager, Site Reliability Engineer

Robert Bosch Semiconductor Manufacturing Dresden GmbH

Full-time|On-site|Dresden

Take charge of commissioning and operating, as well as evolving, an application landscape for the semiconductor factory of the future.Define and implement operational processes and deployment strategies independently, adhering to modern principles such as DevOps and Site Reliability Engineering (SRE).Oversee change management, reliably implement requirements, assess risks, and produce comprehensive documentation.Proactively work towards achieving SLA targets for availability while managing IT incident management and disaster recovery.Work in an agile environment, participate in retrospectives, and continuously enhance systems and processes.Support the assurance of cost-effectiveness, quality, reliability, and innovation in the IT operations field.

Feb 27, 2026
Apply
companyHelsing logo
Full-time|On-site|Berlin; London; Munich

Who We AreHelsing is a pioneering defense AI company dedicated to safeguarding democracies. Our mission is to attain technological leadership, enabling open societies to make sovereign decisions and uphold their ethical standards. As a company, we recognize the profound responsibility that comes with developing and deploying powerful technologies like AI, and we are committed to addressing this responsibility with integrity.Our team consists of driven engineers, AI specialists, and customer-facing program managers who are passionate about solving the most complex and impactful challenges. We embrace a culture of openness and transparency, encouraging healthy debates about the role of technology in defense, its benefits, and its ethical implications.The RoleWe operate primarily in high-security, on-premise environments, and we are seeking a Site Reliability Engineer to support these critical infrastructures. In this role, you will be responsible for the design, implementation, and management of our on-premise Kubernetes infrastructure.We value engineers who exhibit a strong work ethic, prioritize effectively, and excel in teamwork. Clear communication, knowledge sharing, and collaboration are essential to advancing both our team and our mission.The Day-to-DayAs a Site Reliability Engineer, you will design and build cloud-native infrastructure platforms on-premises, focusing on Kubernetes-based solutions that empower our development teams to operate services at scale.You will create robust observability frameworks using tools like Grafana, Prometheus, and distributed tracing to ensure system reliability and performance.You will architect and implement secure, multi-tenant Kubernetes clusters to support our high-security environments.

Feb 18, 2026
Apply
companyN26 logo
Full-time|On-site|Berlin

N26 is looking for a Site Reliability Engineer to join the Platform Engineering team in Berlin. This role centers on maintaining and improving the reliability, performance, and scalability of core systems. Role overview Work closely with cross-functional teams to support and enhance the platform. The focus is on building solutions that keep systems stable and responsive as the company grows. What you will do Monitor and improve system reliability and uptime Collaborate with other teams to address performance and scalability challenges Contribute to solutions that strengthen the platform’s technical foundation Location This position is based in Berlin.

Apr 29, 2026
Apply
companyVeeva Systems Inc. logo
Full-time|Hybrid|Germany - Frankfurt

Veeva Systems is a pioneering organization focused on transforming the life sciences industry through innovative cloud solutions, enabling companies to accelerate the delivery of therapies to patients. As one of the fastest-growing SaaS enterprises, we achieved over $2 billion in revenue last fiscal year, with significant growth opportunities on the horizon.Our core values—Do the Right Thing, Customer Success, Employee Success, and Speed—define our culture. In 2021, we made history by becoming a public benefit corporation (PBC), committed to balancing the interests of our customers, employees, society, and investors.As a Work Anywhere company, we provide you with the flexibility to work from home or in the office, ensuring you thrive in your ideal work environment.Join us in transforming the life sciences landscape, making a meaningful impact on our customers, employees, and the broader community.

Aug 10, 2021
Apply
companyjobgether logo
Full-time|On-site|Germany

Role overview Jobgether seeks an Engineering Manager specializing in Site Reliability Engineering to lead a team dedicated to the development and maintenance of essential systems. This role is based in Germany and centers on ensuring the reliability and performance of the company's core infrastructure. What you will do Guide and mentor a team of site reliability engineers. Supervise the stability, scalability, and efficiency of production environments. Champion reliability and performance best practices within the team. Collaborate with other departments to align infrastructure with business objectives and sustain high system availability. Requirements Demonstrated experience managing engineering teams, particularly in site reliability or similar areas. Solid background in designing and supporting reliable, scalable systems. Strong ability to work with both technical and non-technical groups to advance business goals.

Apr 27, 2026
Apply
companynatuvion logo
Full-time|Remote|Remote job

As a Senior Site Reliability Engineer, you will take on the responsibility for the stable, secure, and scalable operation of our Kubernetes and Cloud infrastructure – hands-on, independently, and with genuine ownership.Your Responsibilities: Operate and optimize Kubernetes clusters (EKS) and AWS infrastructure.Debug complex issues (Performance, Scheduling, OOM, CrashLoops).Set up and manage self-hosted services (e.g., Istio, OpenSearch, RabbitMQ).Implement GitOps (ArgoCD/Flux) and observability (Logging, Metrics, Tracing).Define SLIs/SLOs and alerting strategies.Develop backup and disaster recovery concepts (including RTO/RPO).Analyze and enhance system architectures (scalability, security, SPOFs).

Apr 2, 2026
Apply
companyGetYourGuide logo
Full-time|On-site|Berlin

GetYourGuide connects travelers with memorable experiences in over 12,000 cities. Since 2009, the company has helped millions discover new destinations. The Berlin headquarters leads a global team, with offices in cities such as New York and Bangkok. More than 850 employees collaborate to reshape how people find and book travel adventures. The Staff Site Reliability Engineer joins the Operational Excellence team, which works to minimize disruptions, boost productivity, and build user trust. As GetYourGuide expands its AI-powered travel solutions, this role ensures engineering speed and reliability remain strong so customers enjoy seamless experiences. What you will do Collaborate with product teams to improve system reliability, performance, and trust across the platform. Incident management and reliability Reduce the number of incidents, as well as Mean Time to Detect (MTTD) and Mean Time to Recovery (MTTR). Lead post-incident reviews and turn findings into lasting improvements. Create tools and runbooks that speed up diagnosis and resolution of production issues. Foster a culture that treats incidents as learning opportunities, not blame assignments. Take part in the infrastructure on-call rotation. Observability and production confidence Advance the Datadog-based observability stack, including metrics, logs, traces, dashboards, and alerts. Help teams define meaningful Service Level Objectives (SLOs) and prevent alert fatigue. Strengthen production debugging tools so engineers can solve issues independently. Change confidence and release quality Lower change failure rates by guiding teams on effective testing and deployment practices. Learn more about GetYourGuide’s team and mission at getyourguide.careers.

Apr 27, 2026
Apply
companyscalablegmbh logo
Full-time|On-site|Berlin

Role Overview scalablegmbh is looking for a Senior Cloud Site Reliability Engineer with a focus on network systems. This position is based in Berlin. What You Will Do Maintain and improve the reliability, performance, and scalability of cloud infrastructure. Work closely with engineering teams to optimize network services and resolve technical challenges. Contribute to developing solutions that strengthen network systems. Support a culture of ongoing improvement across the organization. About You Bring expertise in cloud technologies and network systems. Enjoy solving complex problems and collaborating with others. Ready to make an impact in a growing company.

Apr 14, 2026
Apply
company
Full-time|On-site|Berlin

Join redcare-pharmacy as a Senior Site Reliability Engineer in Berlin. We are seeking a talented and experienced individual who can enhance our infrastructure and ensure the reliability and performance of our systems. This role will involve collaboration with development teams to build scalable systems and improve our operational practices.

Jan 29, 2026
Apply
companyflipapp logo
Full-time|Hybrid|Berlin, Berlin, Germany; Remote (Europe); Stuttgart, Baden-Württemberg, Germany

Flip develops an AI-powered employee experience platform designed for frontline workers. The company’s mission is to make internal information easily accessible for every employee, wherever they work. Flip is expanding quickly and aims to change how millions of frontline employees stay connected with their organizations. Role overview The Site Reliability Engineer (m/w/d) joins the Platform Squad to keep Flip’s infrastructure fast, resilient, and ready for growth. This role focuses on shaping reliability practices, building internal tools, and fostering a culture where engineering teams can deploy confidently at scale while maintaining high uptime. The position is well-suited for those who enjoy designing high-throughput, highly available systems and want to influence the production operations of a growing SaaS platform. Key responsibilities Enable scaling: Expand and optimize Azure cloud infrastructure and Kubernetes clusters to support Flip’s global growth, prioritizing high throughput and availability. Ensure resilience & security: Design and implement zero-downtime deployments, effective rollback mechanisms, and disaster recovery strategies to keep the platform available at all times. Create observability: Improve the LGTM stack (Loki, Grafana, Tempo, Mimir) so teams have clear insight into system health and performance. Location This position can be based in Berlin or Stuttgart, Germany, or performed remotely from anywhere in Europe.

Apr 23, 2026
Apply
companyscalablegmbh logo
Full-time|On-site|München

Role Overview scalablegmbh is looking for a Senior Cloud Site Reliability Engineer (Network) in München. This position focuses on maintaining the reliability, availability, and performance of cloud-based network systems. The role works closely with teams across the company to design, implement, and refine infrastructure that supports a growing client base. What You Will Do Ensure cloud network systems run reliably and meet performance targets Collaborate with cross-functional teams to design and optimize infrastructure solutions Guide cloud strategy decisions with technical expertise Troubleshoot complex network issues Apply best practices to improve network reliability and operational efficiency Location This role is based in München.

Apr 14, 2026
Apply
companyAlmedia logo
Full-time|Remote|Berlin

Join Almedia, a pioneering company on a mission to revolutionize marketing by rewarding a community of over 60 million users for their engagement with global brands. Here, you can accelerate your career in an exciting environment aiming to become Germany's next bootstrapped unicorn, recognized as Europe's #3 fastest-growing company in 2025 (FT1000).We are seeking a passionate and skilled Site Reliability Engineer / DevOps to help us maintain the performance and reliability of our high-traffic platform.

Feb 3, 2026
Apply
companyTipico Co. Ltd. logo
Full-time|On-site|Karlsruhe

Join Tipico as a Site Reliability Engineer and become a key player in enhancing the excitement of sports betting for our customers. You will be part of a dynamic and agile team that thrives on collaboration and innovation. Each day will present new challenges as you develop technical solutions and products that elevate our offerings.Your Responsibilities:Manage production environments by monitoring system availability and overall health.Develop software and systems to optimize platform infrastructure and applications.Enhance the reliability, quality, and speed of our software solutions.Measure and optimize system performance to stay ahead of customer needs.Provide operational support for large-scale distributed applications.Analyze metrics from operating systems and applications for performance tuning.Collaborate with development teams to enhance service delivery.Engage in system design, platform management, and capacity planning.Create sustainable systems through automation.Balance rapid feature development with reliability, adhering to service-level objectives.

Feb 23, 2026
Apply
companyDoctolib logo
Full-time|On-site|Berlin, Berlin, Germany; Paris, Paris, France

At Doctolib, we pride ourselves on fostering a dynamic engineering environment where innovation thrives. Our mission is to enhance the lives of healthcare professionals and patients alike. We are seeking a Senior Site Reliability Engineer to ensure our production systems operate seamlessly, playing a crucial role in supporting the rapid expansion of Doctolib's services. Your Responsibilities As a Senior Site Reliability Engineer within the Core Reliability & Observability team, you will be instrumental in defining the company's observability strategy and maintaining the reliability, debuggability, and scalability of our platform. This position bridges infrastructure, developer experience, and product engineering, focusing on developing and enhancing the core elements of logging, metrics, tracing, and alerting across our organization. Lead the implementation of an observability strategy across the platform, emphasizing scalable, developer-friendly logging and tracing solutions. Identify and spearhead cross-functional reliability initiatives to enhance incident detection, response, and postmortem analysis capabilities. Participate in the on-call rotation and actively work on improving our on-call experience by optimizing alerting, minimizing noise, and providing actionable telemetry. Who You Are You could be our next teammate if you possess: A minimum of 3 years of hands-on experience with large-scale production platforms. Demonstrated proficiency with cloud platforms such as AWS, Azure, or Google Cloud. A strong understanding of containerization and orchestration technologies (Docker and Kubernetes). A deep knowledge of Helm for managing Kubernetes manifests and ArgoCD for GitOps workflows. Extensive expertise in observability tooling and architecture, including: Logging: Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Logstash, Vector. Tracing: OpenTelemetry or proprietary APMs. Metrics: Prometheus, Thanos, Datadog, or equivalent. Proficiency in at least one programming language (e.g., Ruby, Python, Go, Java) and a strong grasp of infrastructure as code principles. Experience with monitoring and observability tools.

Mar 19, 2026
Apply
companyClickHouse logo
Full-time|Remote|Germany (remote)

About ClickHouseRanked among the most innovative and rapidly growing private cloud companies, ClickHouse is proud to be featured on the 2025 Forbes Cloud 100 list. With a robust clientele exceeding 3,000 and an impressive annual recurring revenue (ARR) growth of over 250% year on year, ClickHouse is a leader in real-time analytics, data warehousing, observability, and AI workloads.Our recent $400 million Series D funding round has further validated our continuous momentum. In just the past three months, notable clients such as Capital One, Lovable, Decagon, Polymarket, and Airwallex have either adopted or expanded their use of our platform, joining esteemed brands like Meta, Cursor, Sony, and Tesla.Join us on our mission to revolutionize the way companies leverage data!Note: This position can be based remotely in the Netherlands, UK, or Germany.At ClickHouse, we are dedicated to providing our customers with reliable and secure services. To further this commitment, we are expanding our Site Reliability Engineering team within ClickHouse Core. As one of the pioneering members of our Reliability Engineering Team, you will play a crucial role in developing and enhancing processes that ensure the reliability, availability, scalability, and performance of ClickHouse. You will work collaboratively with various teams—such as Control Plane, Dataplane, Security, Support, and Operations—to guide them in deploying ClickHouse optimally for our customers. Additionally, you will manage engineering escalation processes, lead investigations, conduct blameless post-mortem analyses, and drive continuous improvements in how ClickHouse operates and optimizes in the cloud. This role presents a unique opportunity to make a meaningful impact on our elastic, limitless scale, high-performance ClickHouse in ClickHouse Cloud.What will you do?Continuously enhance the reliability and performance of ClickHouse core.Develop and refine metrics and alerts to proactively identify and prevent production issues before they impact customers.Investigate common customer issues to uncover root causes, submit bug fixes, report issues, and propose enhancements.Enhance incident response processes and conduct post-mortem analyses for outages, collaborating with support and Cloud teams to communicate effectively with affected customers.Plan and implement Chaos initiatives across Engineering teams based on internal priorities.Manage on-call processes to ensure swift and effective incident handling.

Apr 2, 2026

Sign in to browse more jobs

Create account — see all 7,100 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.