About the job
Platform Site Reliability Engineer II
About The Role
Join Todyl's Application Platform Engineering team, where we focus on creating robust infrastructure, services, and frameworks that allow our application development teams to deploy services swiftly and securely at the heart of our security solutions. As a key member of this forward-thinking team, you will be instrumental in designing and engineering innovative solutions that are not only high-performing and resilient but also require minimal maintenance. Your contributions will significantly enhance the reliability and security of our platform, while also enabling our engineering teams to explore new frontiers in the security domain.
Responsibilities:
Develop tools and services to support Todyl's application hosting infrastructure, particularly in Kubernetes environments.
Create automation to enhance reliability and minimize manual intervention for Day 2 Operations, emphasizing infrastructure-as-code methodologies.
Implement and uphold security policies, access controls, and system patching, considering security hygiene as a paramount operational duty.
Manage the attack surface of production infrastructure: identify vulnerabilities, prioritize remediation efforts, and drive CVE resolutions to completion.
Operationalize security tools by establishing integrations, creating remediation workflows, and ensuring consistent follow-up on identified issues.
Oversee features and services through deployment and stabilization; ensure work is completed only when stable in production and adequately documented.
Collaborate with product and engineering teams to deliver solutions that align with stakeholder and business requirements.
Enhance application monitoring and alerting to reduce detection and restoration times; analyze dashboards and logs to confirm successful deployments.
Identify and pursue cost-optimization opportunities, including resource labeling, right-sizing, and efficiency enhancements to lower COGs.
Participate in a weekly on-call rotation, resolve most issues independently, and update runbooks and documentation post-incident.
Requirements:
MUST HAVE: Experience managing Kubernetes and application hosting infrastructure.

