About the job
Why You Will Enjoy Being Part of Our Team:
Blameless Culture: We address incidents collaboratively. Our policy is clear: Resolve the issue first, analyze the root cause later — no blame, only solutions.
Fully Cloud-Based & Large-Scale Operations: Operating entirely within the Google Cloud Platform (GCP) ecosystem and Google Kubernetes Engine (GKE), we manage seamless auto-scaling during peak traffic events.
AI-Driven Processes: We utilize AI to enhance daily operations, automate log analysis and troubleshooting, and expedite software releases.
Empowerment Through Trust: Your access rights start minimal but expand as you demonstrate your skills. Master our systems, and you’ll earn the highest access privileges.
Key Responsibilities (50% Automation / 50% Operations):
This pivotal role demands robust engineering expertise, practical experience, and hands-on implementation skills. You will:
Serve as the primary point of contact for incident management, swiftly addressing issues as they arise.
Guarantee optimal performance, availability, and scalability of production systems.
Automate infrastructure provisioning in the cloud, including systems and software setups.
Design and manage build & release pipelines, configuration management, and code deployments across various environments.
Collaborate closely with the development team to refine deployment processes and strategies.
Identify and tackle challenges or opportunities in critical high-impact areas.
Your First 6 Months:
Months 1-2 (Learning Phase): Focus on understanding Chợ Tốt's core infrastructure. We will sponsor your learning through Coursera to obtain necessary Google Cloud / K8s certifications and familiarize you with our infrastructure across all three environments.
Months 3-6 (Execution Phase): Achieve mastery in the infrastructure, particularly in Production. You will manage support requests from Engineers, take on Group-level assignments, and engage in on-call responsibilities.

