About the job
At Mysten Labs, we are pioneers in creating decentralized and open protocols that form the backbone of the future internet of value. Our mission is to develop foundational infrastructure that accelerates the adoption of blockchain technologies and decentralized protocols.
Team Overview
As a Production Engineer at Mysten Labs, you will be the guardian of our decentralized infrastructure, ensuring the seamless operation of the Sui blockchain, the Walrus storage network, and other critical protocols under high traffic conditions, adversarial attacks, and large-scale global events. With a focus on automation, observability, and system resilience, you will work alongside experts in distributed systems to deploy, monitor, and enhance software that handles the most demanding workloads in Web3. This role involves hands-on engagement in a rapidly scaling decentralized environment, utilizing hybrid cloud and bare-metal infrastructure, a comprehensive Grafana/Mimir/Loki observability stack, and other state-of-the-art integrations.
Role Responsibilities
You will take ownership of four key pillars within the Production Engineering team: Infrastructure, Observability, Release Engineering, and Reliability. Your responsibilities will include implementing operational rigor across internal services and the decentralized Sui Stack (including protocols like Walrus and SEAL), proactively identifying bottlenecks, automating repetitive tasks, and implementing fixes that mitigate outages at scale. You will also collaborate with core engineers to productionize Rust-based systems and assist internal teams in developing new products while addressing unique Web3 challenges. If you excel at transforming chaos into uptime and have a passion for automation, this is your opportunity to help build the infrastructure for the next billion users on Sui.
Key Responsibilities:
Scale hybrid cloud and bare-metal infrastructure for Sui validators and Walrus storage nodes, focusing on optimizing costs, latency, and resilience against threats.
Enhance observability using our Grafana stack (Mimir, Loki) by creating dashboards, alerts, and tracing to identify issues in real-time during peak loads.
Transform deployment pipelines by innovating beyond GitHub Actions: design next-generation CI/CD for rapid releases, integrating Infrastructure as Code (IaC) solutions such as Pulumi and Kubernetes orchestration.
Achieve reliability improvements by measuring and optimizing performance, authoring runbooks, automating incident responses, and strengthening systems against decentralized threats like DDoS attacks or blockchain halts.
Collaborate with Engineering partners across Greece and Europe to enhance product development velocity.

