Qualifications
Key Responsibilities:Establish and uphold standards for runbook quality, incorporating clear escalation protocols, rollback strategies, and assessments of customer impact.Validate runbooks through tabletop exercises and live testing prior to deployment to shift teams.Proactively identify deficiencies in GOC processes, tools, and service coverage.Develop and implement new processes to rectify operational blind spots.Standardize incident triage, escalation, and resolution workflows across all shifts.Create and maintain the GOC’s operational knowledge base. Manage the complete lifecycle of GOC runbooks, including creation, peer review, validation, and retirement.Draft runbooks for common operational tasks that shift teams can independently execute.Lead post-incident reviews for major incidents, providing actionable insights.Monitor recurring issue patterns and facilitate root cause resolutions in collaboration with engineering teams.Track and report on incident metrics, identifying trends that necessitate systemic improvements.Ensure that lessons learned are incorporated into future operations.
About the job
About TensorWave
At TensorWave, our mission is straightforward: to provide seamless, secure, reliable, and resilient AI compute at scale. Our innovative cloud platform removes infrastructure barriers, allowing creators to concentrate on innovation instead of battling technical obstacles. We believe that transformative AI should progress at the speed of ideas, not hindered by infrastructure.
About the Role
As we develop the next generation of GPU cloud infrastructure, our Global Operations Center (GOC) serves as the essential support for 24/7 operations across various data centers. In the position of Lead Operations Engineer, you will be the technical anchor of the GOC, acting as a liaison between our frontline operations engineers and the engineering teams responsible for building and maintaining our platform.
Your contributions will enhance the effectiveness of shift teams: refining and validating operational runbooks, analyzing significant incidents to promote systemic enhancements, and collaborating with engineering leads to improve alert systems and identify tasks suitable for delegation to the operations floor. Working alongside the Head of Global Operations, you will be pivotal in elevating the operational maturity of the GOC and shifting from reactive measures to proactive, standardized operations.
About TensorWave
TensorWave is at the forefront of revolutionizing AI compute through our cutting-edge cloud platform. Our dedication to quality and innovation empowers teams to focus on their creative solutions rather than infrastructure challenges, fostering an environment where ideas can flourish at unprecedented speeds.