About the job
At Braze, we pride ourselves on our exceptional team dynamics. Our employees are not only highly skilled but also approachable and genuinely kind, which fosters a positive work environment.
We aim to channel this passion into our work by establishing high standards, encouraging collaboration, and promoting a healthy work-life balance as we navigate our rapid global expansion while advocating for equity and opportunity both within and outside our organization.
To succeed in this environment, you should be ready to set ambitious goals for yourself and your colleagues. We believe in taking initiative, embracing responsibility, and welcoming diverse viewpoints, all of which are vital to our ongoing success.
Our insatiable curiosity and willingness to share our unique passions contribute to a vibrant company culture that thrives on balance.
If you’re motivated to tackle exhilarating challenges and possess a proactive mindset in times of change, you’ll find the support to make a meaningful impact here, backed by a dedicated and passionate team. If Braze resonates with your aspirations, we look forward to meeting you.
WHAT YOU'LL DO
We are seeking a Senior Site Reliability Engineer for our Currents team, responsible for the development, maintenance, and enhancement of Currents, our scalable data export system. This Kafka-based event pipeline processes tens of billions of messages daily, enabling our clients to analyze user behavior in near real-time.
You will play a vital role on a collaborative and skilled team, guiding projects from inception to production while optimizing our existing high-scale systems. Your expertise and teamwork will be crucial in addressing the significant engineering challenges associated with managing a critical data streaming system. As a Senior Site Reliability Engineer, you will primarily focus on observability, scalability, and reliability strategies for every project.
Key responsibilities include:
- Troubleshooting and resolving live performance and reliability issues while implementing strategies to prevent recurrence.
- Writing and reviewing code, mentoring engineers, and fostering a culture of reliability.
- Implementing sustainable incident response practices and conducting blameless postmortems.
- Establishing and promoting standards for monitoring, reliability, and performance.
- Facilitating collaboration between infrastructure and platform engineering teams.
- Enhancing services by planning for scalability and reliability.
- Mentoring junior engineers in SRE best practices and agile project management.

