About the job
Aethir stands at the forefront of DePIN technology as a premier provider of enterprise-grade, AI-driven GPU-as-a-Service. Our innovative approach harnesses a distributed cloud computing infrastructure, enabling GPU providers to deliver scalable solutions for AI and gaming customers. Our vision is to empower enterprise clients with advanced AI chips while facilitating cloud gaming experiences for hundreds of thousands of users globally, all within a decentralized cloud framework that directly benefits the community.
We are seeking a dynamic and experienced Senior DevOps Engineer (Site Reliability Engineer) to join our expanding team at our new headquarters in Kuala Lumpur, Malaysia. In this pivotal role, you will be tasked with the maintenance, optimization, and scalability of our production systems, ensuring exceptional high availability, reliability, and performance across our decentralized compute network. Your contributions will be vital in supporting mission-critical infrastructure for our global AI and cloud gaming clientele.
Key Responsibilities:
- Monitoring and Fault Response: Proactively monitor, assess, and respond to system faults, ensuring effective troubleshooting and optimization of our production systems.
- System Architecture Oversight: Continuously evaluate and enhance system architecture, process logic, and performance metrics to ensure optimal stability and efficiency.
- Cross-departmental Collaboration: Partner with business teams to resolve operational and maintenance-related challenges.
- Production Fault Coordination: Act swiftly to address production failures, leading the resolution efforts.
- Collaborative Problem Resolution: Facilitate teamwork among R&D, operations, and product teams to effectively investigate and remedy issues.
- Response Time Management: Ensure timely resolution of production issues by maintaining accountability for response and resolution times.
- Case Studies and System Optimization: Conduct thorough analyses of production incidents and implement improvements to enhance system performance and stability.
- Comprehensive Documentation: Develop and maintain detailed documentation of system architecture, processes, and troubleshooting protocols.
- Continuous Process Improvement: Identify operational inefficiencies and implement necessary changes to enhance maintenance processes.
