qodeworld logoqodeworld logo

Senior Site Reliability Engineer - Unified Observability & AIOps

qodeworldSouth Carolina, South Carolina, United States
Hybrid Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Senior

Qualifications

15+ years of experience in site reliability engineering, software engineering, or a related field. Strong knowledge of observability best practices, including monitoring, alerting, and incident response. Proficiency with AI/ML technologies and their application in operational contexts. Experience in working with cloud platforms and modern infrastructure tools.

About the job

qodeworld is seeking a Senior Site Reliability Engineer with a focus on Unified Observability and AIOps. This full-time, hybrid role is based in either Austin, TX or Fort Mill, SC. The position supports distributed financial services platforms, emphasizing modern observability practices, proactive detection, and AI/ML-driven operations.

Key Responsibilities

  • Design and implement unified observability dashboards that display metrics, logs, traces, events, and system topology for complex systems.
  • Define and manage service level indicators (SLIs), service level objectives (SLOs), and error budgets to align with business goals.
  • Develop actionable dashboards tailored for operations, engineering, and leadership teams.
  • Create alerting strategies using both static and dynamic thresholds to improve reliability.
  • Apply AI/ML and AIOps techniques to detect anomalies, forecast incidents, and reduce mean time to resolution (MTTR).
  • Shift monitoring from reactive alerts to proactive operational insights, implementing noise reduction and alert correlation.
  • Use baseline modeling, seasonality detection, and anomaly scoring to enhance system reliability.
  • Monitor and resolve issues in distributed environments, including microservices, downstream APIs, Kafka and streaming platforms, and cloud infrastructure managed through Terraform and Infrastructure as Code.
  • Identify root causes across dependencies, streaming platforms, infrastructure, and application code.
  • Work hands-on with Dynatrace (required), and experience with OpenTelemetry, Prometheus, Grafana, ELK/EFK stacks, and cloud-native monitoring tools on AWS, Azure, or GCP.
  • Manipulate and enhance JSON-based telemetry data.
  • Leverage GenAI and large language models for incident summarization, root cause explanation, runbook improvements, and auto-remediation guidance.
  • Collaborate with platform teams to securely integrate GenAI solutions into operational workflows.

Requirements

  • Hands-on experience with Dynatrace is required.
  • Background in observability, reliability engineering, and distributed systems monitoring.
  • Experience with cloud-native monitoring tools and infrastructure as code practices.
  • Familiarity with AI/ML and AIOps concepts applied to operational monitoring and incident response.

About qodeworld

qodeworld is a leader in providing innovative solutions for complex financial services platforms. We foster a collaborative environment that values creativity and continuous improvement, making us a prime choice for professionals eager to make an impact.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.