About the job
About ClickHouse
Ranked among the most innovative and rapidly growing private cloud companies, ClickHouse is proud to be featured on the 2025 Forbes Cloud 100 list. With a robust clientele exceeding 3,000 and an impressive annual recurring revenue (ARR) growth of over 250% year on year, ClickHouse is a leader in real-time analytics, data warehousing, observability, and AI workloads.
Our recent $400 million Series D funding round has further validated our continuous momentum. In just the past three months, notable clients such as Capital One, Lovable, Decagon, Polymarket, and Airwallex have either adopted or expanded their use of our platform, joining esteemed brands like Meta, Cursor, Sony, and Tesla.
Join us on our mission to revolutionize the way companies leverage data!
Note: This position can be based remotely in the Netherlands, UK, or Germany.
At ClickHouse, we are dedicated to providing our customers with reliable and secure services. To further this commitment, we are expanding our Site Reliability Engineering team within ClickHouse Core. As one of the pioneering members of our Reliability Engineering Team, you will play a crucial role in developing and enhancing processes that ensure the reliability, availability, scalability, and performance of ClickHouse. You will work collaboratively with various teams—such as Control Plane, Dataplane, Security, Support, and Operations—to guide them in deploying ClickHouse optimally for our customers. Additionally, you will manage engineering escalation processes, lead investigations, conduct blameless post-mortem analyses, and drive continuous improvements in how ClickHouse operates and optimizes in the cloud. This role presents a unique opportunity to make a meaningful impact on our elastic, limitless scale, high-performance ClickHouse in ClickHouse Cloud.
What will you do?
- Continuously enhance the reliability and performance of ClickHouse core.
- Develop and refine metrics and alerts to proactively identify and prevent production issues before they impact customers.
- Investigate common customer issues to uncover root causes, submit bug fixes, report issues, and propose enhancements.
- Enhance incident response processes and conduct post-mortem analyses for outages, collaborating with support and Cloud teams to communicate effectively with affected customers.
- Plan and implement Chaos initiatives across Engineering teams based on internal priorities.
- Manage on-call processes to ensure swift and effective incident handling.

