About the job
Aghanim is hiring a Mid-Level/High-Level DevOps / SRE Engineer in Lisbon. This role focuses on managing and improving our production platform, which runs on Google Cloud Platform (GCP) and Google Kubernetes Engine (GKE). Cloudflare sits at the front, Datadog provides observability, and CI/CD pipelines run through GitHub Actions.
Work closely with Senior and Principal engineers to strengthen reliability, expand monitoring, and reduce manual operational work. The systems you support handle high loads and must be ready for sudden traffic spikes.
What You Will Do
Platform Operations (GCP/GKE)
- Manage and support production systems on GCP, with a focus on GKE and other managed services.
- Carry out platform enhancements and operational tasks as directed by more senior engineers.
Infrastructure as Code & Delivery Enablement
- Apply infrastructure changes using Terraform and, where needed, Terragrunt.
- Develop and maintain Helm charts and Kubernetes manifests.
- Improve reliability of GitHub Actions and CI/CD workflows, including deployment automation.
Monitoring & Observability (Datadog)
- Create and manage Datadog dashboards and monitors to ensure effective alerting.
- Find and address monitoring gaps in key system components. Refine alerts to cut noise and improve signal quality.
Incident Management
- Participate in incident response and operational support: triage, mitigation using runbooks, escalation, and follow-up remediation.
- Contribute to postmortem reviews with clear facts, timelines, and actionable remediation steps.
Security Fundamentals (DevSecOps)
- Set up and operate security tools and monitoring systems. Help triage findings and implement solutions under supervision.
- Promote secure-by-default practices such as secrets management, access control, and baseline hardening.
Cost Awareness
- Understand and manage operational costs for the platform.

