companyAghanim logo

Senior/Principal DevOps Engineer

AghanimLisbon
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

We are seeking candidates with the following qualifications:Strong experience with cloud platforms, particularly Google Cloud Platform (GCP). Proficient in Infrastructure as Code tools, particularly Terraform. Experience with container orchestration, specifically Kubernetes. Knowledge of observability tools like Datadog. Excellent problem-solving skills and a proactive approach to incident management. Strong communication skills to collaborate effectively across teams.

About the job

Aghanim is hiring a Senior/Principal DevOps Engineer in Lisbon. This position centers on owning and improving a fully cloud-native platform, built on Google Cloud Platform (GCP) and Cloudflare, and monitored through Datadog. The infrastructure is managed with Infrastructure as Code and automated CI/CD pipelines via GitHub Actions.

Role Overview

This is a hands-on role with significant responsibility. The Senior/Principal DevOps Engineer ensures the platform stays reliable during heavy traffic and rapid growth. The work includes meeting strict SLA/SLO targets, supporting scaling from 10 to 50 times current loads, and optimizing for both efficiency and cost as the company and its microservices expand.

Main Responsibilities

Cloud Infrastructure Management

  • Oversee and improve production infrastructure on GCP and Cloudflare (cloud-only, no on-premises systems).
  • Maintain high availability and performance for a SaaS platform serving both B2B and B2C customers.

Scalability and Highload Management

  • Design and operate systems that handle sudden traffic spikes, with increases up to 10–20 times within seconds.
  • Develop strategies for scaling compute, network, and data layers: autoscaling, capacity planning, and safe degradation.

SLA/SLO and Incident Management

  • Monitor and take responsibility for reliability metrics: availability, latency, and error rates as defined by SLA/SLO.
  • Lead incident response, from detection through mitigation, postmortem analysis, and implementing permanent solutions.

Infrastructure as Code and Kubernetes Operations

  • Build and maintain Infrastructure as Code using Terraform and Terragrunt when needed.
  • Manage Kubernetes clusters on GKE, including upgrades, scaling, and security improvements.
  • Create and maintain Helm charts and Kubernetes manifests.

Observability with Datadog

  • Implement and maintain observability systems in Datadog: metrics, logs, APM, dashboards, monitoring, and alerting.

About Aghanim

Aghanim is a forward-thinking technology company focused on delivering innovative SaaS solutions. We prioritize reliability and scalability, providing a robust platform that empowers both businesses and consumers. Join us to be part of a dynamic team that values creativity and technical excellence.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.