Qualifications

Requirements: Bachelor's degree in Computer Science, Engineering, or a related discipline. Proven experience in operations and maintenance development, ideally within a cloud computing or AI-centric environment. Deep understanding of system architecture and performance monitoring. Strong troubleshooting skills and ability to optimize complex systems. Excellent interpersonal and communication skills for effective collaboration. Ability to manage multiple tasks and priorities in a fast-paced environment.

About the job

Aethir stands at the forefront of DePIN technology as a premier provider of enterprise-grade, AI-driven GPU-as-a-Service. Our innovative approach harnesses a distributed cloud computing infrastructure, enabling GPU providers to deliver scalable solutions for AI and gaming customers. Our vision is to empower enterprise clients with advanced AI chips while facilitating cloud gaming experiences for hundreds of thousands of users globally, all within a decentralized cloud framework that directly benefits the community.

We are seeking a dynamic and experienced Senior DevOps Engineer (Site Reliability Engineer) to join our expanding team at our new headquarters in Kuala Lumpur, Malaysia. In this pivotal role, you will be tasked with the maintenance, optimization, and scalability of our production systems, ensuring exceptional high availability, reliability, and performance across our decentralized compute network. Your contributions will be vital in supporting mission-critical infrastructure for our global AI and cloud gaming clientele.

Key Responsibilities:

Monitoring and Fault Response: Proactively monitor, assess, and respond to system faults, ensuring effective troubleshooting and optimization of our production systems.
System Architecture Oversight: Continuously evaluate and enhance system architecture, process logic, and performance metrics to ensure optimal stability and efficiency.
Cross-departmental Collaboration: Partner with business teams to resolve operational and maintenance-related challenges.
Production Fault Coordination: Act swiftly to address production failures, leading the resolution efforts.
Collaborative Problem Resolution: Facilitate teamwork among R&D, operations, and product teams to effectively investigate and remedy issues.
Response Time Management: Ensure timely resolution of production issues by maintaining accountability for response and resolution times.
Case Studies and System Optimization: Conduct thorough analyses of production incidents and implement improvements to enhance system performance and stability.
Comprehensive Documentation: Develop and maintain detailed documentation of system architecture, processes, and troubleshooting protocols.
Continuous Process Improvement: Identify operational inefficiencies and implement necessary changes to enhance maintenance processes.

About Aethir

Aethir is a pioneering company in the DePIN sector, specializing in enterprise-grade, AI-focused GPU-as-a-Service solutions. Our mission is to provide powerful AI capabilities and cloud gaming services to customers worldwide through a decentralized cloud architecture, enhancing accessibility and performance for our clients.

Senior DevOps Engineer (Site Reliability Engineer)

Experience Level

Qualifications

About the job

About Aethir

Similar jobs