About the job
Rithum™ stands as the premier global commerce network, enhancing the collaboration between brands, suppliers, and retailers to create unparalleled e-commerce experiences. Our innovative platform equips brands and retailers with the tools to accelerate growth, streamline operations, expand product offerings, and boost profit margins.
More than 40,000 companies, representing over $50 billion in annual GMV, place their trust in Rithum to drive their business across diverse channels. Our comprehensive commerce, marketing, and delivery solutions empower clients to craft optimized consumer shopping journeys from start to finish.
Overview
The Database Reliability Engineering (DBRE) team at Rithum is dedicated to ensuring the availability, reliability, and observability of our extensive database systems. Emphasizing automation, we aim to minimize manual tasks and continuously seek improvements in our processes. Our management of a large-scale SQL Server environment encompasses hundreds of instances across hybrid infrastructures (on-prem VMware and AWS) and includes various relational and NoSQL database platforms such as MongoDB, DynamoDB, Elasticsearch, MySQL, Postgres, and Redis. These systems are integral to all business operations. The DBRE team fosters a culture of curiosity, transparency, collaboration, and lifelong learning.
As a Senior Database Reliability Engineer, you will embody and promote these values among your peers. Your role will involve managing diverse database systems while leading your own technically focused projects.
Responsibilities
- Ensure the highest levels of availability and reliability for mission-critical database systems across hybrid infrastructures.
- Design, implement, and maintain SQL Server Always-on Availability Groups, clustering, and replication topologies while continually enhancing observability across all database systems.
- Lead significant database upgrade initiatives and modernization projects, providing support to fellow engineers and teams in utilizing database systems effectively.
- Enhance observability through telemetry, performance analysis, and proactive monitoring techniques.
- Drive process improvements via automation, implementing operational workflows using PowerShell, Python, and CI/CD tools.
- Safeguard and secure all data effectively.
- Engage in our on-call rotation duties.
- Troubleshoot and optimize high-load production systems, addressing complex performance and replication challenges.
- Lead technical responses during high-severity incidents.

