companyToss logo

Server Developer (Site Reliability Engineer) at Toss

TossSeoul
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Who We Are Looking ForExperience in transitioning to scalable architectures to handle growing services or improving structures to manage large-scale traffic reliably. A deep understanding of Java, Kotlin, and Spring Boot frameworks, with the ability to analyze and optimize code from a performance perspective. Persistence in logically approaching issues arising in complex distributed environments and tracing root causes to completion. Solid foundation in Linux OS, kernel, and network protocols (TCP/IP), enabling low-level analysis. Experience in building testing platforms or environments that facilitate automated testing for colleagues rather than just performing tests. Able to recognize inefficient repetitive tasks as problems and systematize solutions through code.

About the job

About the Team You'll Join

  • The Server Developer (SRE) is an integral part of the server platform team.
  • Toss typically operates with small, feature-based silo teams comprising planners, designers, and developers. The server platform team is responsible for creating common functionalities and frameworks needed across these silo teams.
  • This year, the server platform team aims to enable quicker and smoother launches of securities services while ensuring their stable operation, with plans to expand many more services in the future.
  • Unlike silo teams, there are no separate planners in the server platform team. Engineers within the team take on the roles of both planners and developers, actively contemplating necessary features, gathering feedback, and determining the direction for development.

 

 

Responsibilities You Will Have

Establishing preventive measures and responding to incidents for stable services.

  • Design and operate response processes to minimize service impact during incidents, ensuring rapid response.
  • Analyze root causes (RCA) of incidents, enhance pre- and post-detection capabilities, and build systems to shorten recovery time and prevent recurrence.

Ensuring service visibility and availability.

  • Achieve visibility across infrastructure, networks, and Kubernetes environments, linking it to actual service metrics.
  • Establish SLOs to clearly define problem situations, continuously enhancing the alert system with necessary metrics.
  • Proactively identify potential bottlenecks in components in anticipation of traffic increases and improve structures.

Conducting in-depth analysis of issues and identifying root causes.

  • When service disruptions occur, analyze beyond logs, utilizing eBPF, memory, network, and kernel areas for comprehensive root cause analysis.
  • Precisely analyze internal application behavior to provide optimal resource configuration guidelines to developers.

Automating operations and developing internal tools.

  • Automate repetitive analytical tasks and identify areas that hinder operational efficiency, developing tools for improvement.
  • Automate testing environments to enhance service reliability and provide easy access for colleagues.

About Toss

Toss revolutionizes financial services, focusing on innovative solutions that enhance user experience. As a part of our collaborative and dynamic environment, you will join a team dedicated to creating impactful technology solutions.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.