About the job
About CI&T
CI&T brings together human expertise and AI to build scalable technology solutions. With a team of over 8,000 professionals worldwide and more than 1,000 client partnerships over the past 30 years, CI&T focuses on real-world artificial intelligence and digital transformation.
Location Requirement
Important: Candidates living in the Metropolitan Region of Campinas must work onsite at our city offices, following our current attendance policy.
Role Overview
We are hiring a Senior Site Reliability Engineer (SRE) based in Brazil to join CI&T and support one of our projects. This role calls for someone who takes ownership of applications, manages their own backlog, and collaborates closely with cross-functional teams. Strong communication and analytical skills are essential.
What You Will Do
- Analyze reliability, performance, and availability of applications.
- Monitor deployments, address performance and security issues, and apply lessons learned to prevent future incidents.
- Proactively manage and prioritize the task backlog, identify improvement areas, and suggest collaborative solutions.
- Communicate efficiently with teams across the application lifecycle to clarify needs and priorities.
- Stay informed about industry trends, best practices, and new technologies in cloud computing and DevOps/SRE.
Technical Requirements
- Previous experience as a Site Reliability Engineer (SRE) and understanding of key reliability metrics.
- Background in monitoring Java backend applications.
- Strong experience with FinOps practices and cloud cost management.
- Hands-on with observability tools such as Datadog, Grafana, Prometheus, and Thanos.
- Experience working with AWS platforms (ECS, EKS), Kubernetes, and Docker.
- Proficient in Linux environments.
- Familiarity with GitHub, Jenkins, and Splunk (these are desirable but not strictly required).
- Experience building and maintaining CI/CD pipelines (GitHub Actions, Code Build, Code Pipeline).
- Knowledge of Infrastructure as Code using Terraform.
- Strong analytical and problem-solving skills, with adaptability and willingness to learn.
- Experience with performance and stress testing.
- Understanding of Chaos Theory, including what to test, how to validate, which failures to simulate, and how to analyze application impact.
