About the job
About the Team You'll Join
- The Server Developer (SRE) is an integral part of the server platform team.
- Toss typically operates with small, feature-based silo teams comprising planners, designers, and developers. The server platform team is responsible for creating common functionalities and frameworks needed across these silo teams.
- This year, the server platform team aims to enable quicker and smoother launches of securities services while ensuring their stable operation, with plans to expand many more services in the future.
- Unlike silo teams, there are no separate planners in the server platform team. Engineers within the team take on the roles of both planners and developers, actively contemplating necessary features, gathering feedback, and determining the direction for development.
Responsibilities You Will Have
Establishing preventive measures and responding to incidents for stable services.
- Design and operate response processes to minimize service impact during incidents, ensuring rapid response.
- Analyze root causes (RCA) of incidents, enhance pre- and post-detection capabilities, and build systems to shorten recovery time and prevent recurrence.
Ensuring service visibility and availability.
- Achieve visibility across infrastructure, networks, and Kubernetes environments, linking it to actual service metrics.
- Establish SLOs to clearly define problem situations, continuously enhancing the alert system with necessary metrics.
- Proactively identify potential bottlenecks in components in anticipation of traffic increases and improve structures.
Conducting in-depth analysis of issues and identifying root causes.
- When service disruptions occur, analyze beyond logs, utilizing eBPF, memory, network, and kernel areas for comprehensive root cause analysis.
- Precisely analyze internal application behavior to provide optimal resource configuration guidelines to developers.
Automating operations and developing internal tools.
- Automate repetitive analytical tasks and identify areas that hinder operational efficiency, developing tools for improvement.
- Automate testing environments to enhance service reliability and provide easy access for colleagues.

