About the job
About the Team You'll Join
- The Infra Engineering Tribe is an engineering organization at Toss that designs and operates the network, systems, and infrastructure to ensure the stable operation of various Toss services.
- The Systems Engineer team goes beyond mere maintenance; we fundamentally improve infrastructure structures, eliminate root causes of failures, and establish infrastructure strategies suitable for the introduction of new services and technologies.
- Our goal is to ensure that all Toss services possess scalability and stability.
The Challenges We're Tackling Together
- We operate a large-scale on-premises infrastructure that reliably processes millions of financial transactions.
- We design computing environments for diverse workloads (GPU, analytics, ML, etc.).
- We analyze root causes during incidents and prevent recurrence through structural improvements.
- We design and standardize service architectures that guarantee high availability and scalability.
- We operate and optimize data infrastructures based on DW, Data Mart, and Data Lake.
- We plan and internalize operational tools, automation, and monitoring systems.
Your Responsibilities Upon Joining Us
- You will design, build, and reliably operate on-premises-based infrastructure.
- You will define issues in complex infrastructure environments and derive optimal solutions.
- You will lead system improvements while collaborating with various teams such as data, platform, and security.
We're Looking for Someone Who
- Has experience operating large-scale Linux servers and network infrastructure.
- Is proficient in quickly identifying issues and designing structural solutions.
- Has experience in operational automation using scripts such as Python and Bash.
- Has experience responding to incidents using open-source monitoring and logging tools.
- Can effectively communicate and collaborate with diverse stakeholders.
GPU and ML Infrastructure Experience
- Experience operating and enhancing GPU Clusters (Slurm, Kubernetes, etc.) is a plus.
- Experience supporting ML Ops environments with tools like Kubeflow, MLflow, Airflow is favorable.
- Experience with scheduling, monitoring, and resource optimization for AI/ML workloads is advantageous.
Data Infrastructure Experience
- Experience operating Data Warehouses, Data Marts, and Data Lakes is a plus.
- Experience managing distributed data processing infrastructure based on Hadoop and Spark is advantageous.
- Experience designing and enhancing hardware for large-scale data processing systems is a plus.
- Experience operating Kafka-based data pipeline infrastructure and responding to incidents is beneficial.
# Resume Recommendations
- Please provide detailed examples of at least two complex problems you defined and solved (focusing on root cause analysis, solution approaches, results, and infrastructure changes).
- Detail the projects you contributed to (including project duration, role, technologies used, infrastructure structure, and improvements made).
# Journey to Joining Toss
- Application Submission > Job Interview > Cultural Fit Interview > Reference Check > Compensation Discussion > Final Acceptance and Onboarding.
# A Message for Future Colleagues
> "You can experience all aspects of being a System Engineer."
- We are looking for individuals who can face complex problems, define them clearly, and solve them optimally. If you want to help innovate infrastructure at Toss, please apply now!

