Toss Securities logo

Machine Learning Engineer - Infrastructure

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

Experience with building and operating Kubernetes-based ML infrastructure for large-scale traffic is necessary. A sense of responsibility for the stable operation of live services, experience analyzing and debugging root causes of issues, and a strong understanding of system resource operations are essential. Experience in strengthening systems through problem-solving during service operations is highly valued.

About the job

Join Our Innovative Team

  • The Machine Learning Engineer (Infra) will be part of the ML Platform Team within the Product Division at Toss Securities.
  • The primary goal of the ML Platform Team is to create an optimal machine learning platform that enables the efficient and stable development and operation of various AI/ML services at Toss Securities.
  • The ML Engineer (Infra) will focus on maximizing the efficiency of large-scale AI infrastructure, finely controlling resource usage, and enhancing infrastructure performance to its peak.

 

Your Responsibilities

  • Design and operate high-performance AI computing environments reliably.
    • Design and operate top-of-the-line GPU clusters (H100, B300 series) connected via InfiniBand and high-performance storage (400Gbps) within a Kubernetes environment.
    • Beyond merely building infrastructure, optimize networks and storage to extract the full potential of hardware performance.
  • Develop a comprehensive control system for the entire AI infrastructure.
    • Create an observability system to integrate and monitor AI resources distributed across internal infrastructure and external cloud.
    • Implement management features to prevent resource monopolization by specific services and allocate resources precisely based on importance.
  • Create automation tools for the most efficient resource usage.
    • Analyze actual usage patterns to develop tools that recommend 'just-right resources' to avoid waste.
    • Implement features that automatically scale up or down based on real-time model performance or error rates, and reallocate GPUs where necessary.
  • Establish an environment for identifying and resolving model performance bottlenecks.
    • Build profiling environments to accurately pinpoint slowdowns during model training or serving.
    • Support the analysis and improvement of performance degradation causes between hardware and software.

 

Who We Are Looking For

  • You have experience building and operating Kubernetes-based ML infrastructures that handle large-scale traffic.
  • You take responsibility for reliably operating live services beyond simple development.
  • You have experience persistently analyzing and debugging to resolve root causes when issues arise.
  • You possess a solid understanding of system resources (GPU/CPU/Memory/Network/Storage) and have experience building monitoring systems for them.
  • You value the process of solving various problems that arise during service operations and strengthening the system.

 

Preferred Qualifications

  • Experience in unified monitoring of resource usage in large-scale clusters.
  • Experience building systems to systematically control resources through Quota and Rate Limits.
  • Experience with open-source platforms like Kubeflow or Kubernetes, including in-depth modifications as needed.
  • Experience analyzing and optimizing bottlenecks at the kernel level using tools like Nsight Systems/Compute or PyTorch Profiler.
  • Experience designing tasks to reduce costs or enhance performance tailored to workload characteristics (Rightsizing, Cost Optimization).
  • Experience leveraging GPU virtualization technologies like MIG and MPS to maximize resource utilization.

About Toss Securities

Toss Securities is a leading company in the financial technology sector, dedicated to leveraging artificial intelligence and machine learning to enhance our services. We are committed to innovation and excellence in providing top-tier financial solutions.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.