About the job
At Braze, we pride ourselves on our approachable and passionate team. We foster an environment where high standards, teamwork, and work-life balance are paramount as we navigate rapid global growth while striving for equity and opportunity both within and outside our organization.
To thrive at Braze, you must hold yourself and your colleagues to high standards. Autonomy, accountability, and openness to new perspectives are crucial for our continued success.
Our culture is vibrant, driven by a deep curiosity to learn and a willingness to share diverse passions. If you enjoy tackling exciting challenges and are proactive in the face of change, you will have the opportunity to make a significant impact with our dedicated team behind you. If this resonates with you, we look forward to meeting you!
WHAT YOU'LL DO
As a Senior Site Reliability Engineer (SRE), your primary responsibility will be ensuring the seamless operation of all internal-facing services and platforms, essentially maintaining site uptime. SREs are a unique blend of system administrators and software engineers who apply sound engineering principles and operational discipline, along with sophisticated automation, to our environments and infrastructure services.
You will play a vital role in enhancing automation, infrastructure reliability, and empowering Braze’s engineering teams to effectively leverage our infrastructure products and platforms. With over 3.3 billion monthly active users and the processing of hundreds of billions of data points each month, Braze operates at an impressive scale, sending billions of messages daily. Our technology stack includes Ruby on Rails, MongoDB, Redis, Kafka, Kubernetes, and more. In your position, you will collaborate with both your team and consumer engineering teams to continuously enhance our infrastructure, automation, and tooling for internal products.
Main responsibilities include:
- Collaborating with Braze’s engineering teams to:
- Architect products that effectively utilize infrastructure platforms in a scalable and reliable manner.
- Debug reliability and scalability issues across all stack layers, including the products built using our infrastructure.
- Enhance monitoring, and...

