About the job
Aghanim is hiring a Senior/Principal DevOps Engineer in Lisbon. This position centers on owning and improving a fully cloud-native platform, built on Google Cloud Platform (GCP) and Cloudflare, and monitored through Datadog. The infrastructure is managed with Infrastructure as Code and automated CI/CD pipelines via GitHub Actions.
Role Overview
This is a hands-on role with significant responsibility. The Senior/Principal DevOps Engineer ensures the platform stays reliable during heavy traffic and rapid growth. The work includes meeting strict SLA/SLO targets, supporting scaling from 10 to 50 times current loads, and optimizing for both efficiency and cost as the company and its microservices expand.
Main Responsibilities
Cloud Infrastructure Management
- Oversee and improve production infrastructure on GCP and Cloudflare (cloud-only, no on-premises systems).
- Maintain high availability and performance for a SaaS platform serving both B2B and B2C customers.
Scalability and Highload Management
- Design and operate systems that handle sudden traffic spikes, with increases up to 10–20 times within seconds.
- Develop strategies for scaling compute, network, and data layers: autoscaling, capacity planning, and safe degradation.
SLA/SLO and Incident Management
- Monitor and take responsibility for reliability metrics: availability, latency, and error rates as defined by SLA/SLO.
- Lead incident response, from detection through mitigation, postmortem analysis, and implementing permanent solutions.
Infrastructure as Code and Kubernetes Operations
- Build and maintain Infrastructure as Code using Terraform and Terragrunt when needed.
- Manage Kubernetes clusters on GKE, including upgrades, scaling, and security improvements.
- Create and maintain Helm charts and Kubernetes manifests.
Observability with Datadog
- Implement and maintain observability systems in Datadog: metrics, logs, APM, dashboards, monitoring, and alerting.

