About the job
qodeworld is seeking a Senior Site Reliability Engineer with a focus on Unified Observability and AIOps. This full-time, hybrid role is based in either Austin, TX or Fort Mill, SC. The position supports distributed financial services platforms, emphasizing modern observability practices, proactive detection, and AI/ML-driven operations.
Key Responsibilities
- Design and implement unified observability dashboards that display metrics, logs, traces, events, and system topology for complex systems.
- Define and manage service level indicators (SLIs), service level objectives (SLOs), and error budgets to align with business goals.
- Develop actionable dashboards tailored for operations, engineering, and leadership teams.
- Create alerting strategies using both static and dynamic thresholds to improve reliability.
- Apply AI/ML and AIOps techniques to detect anomalies, forecast incidents, and reduce mean time to resolution (MTTR).
- Shift monitoring from reactive alerts to proactive operational insights, implementing noise reduction and alert correlation.
- Use baseline modeling, seasonality detection, and anomaly scoring to enhance system reliability.
- Monitor and resolve issues in distributed environments, including microservices, downstream APIs, Kafka and streaming platforms, and cloud infrastructure managed through Terraform and Infrastructure as Code.
- Identify root causes across dependencies, streaming platforms, infrastructure, and application code.
- Work hands-on with Dynatrace (required), and experience with OpenTelemetry, Prometheus, Grafana, ELK/EFK stacks, and cloud-native monitoring tools on AWS, Azure, or GCP.
- Manipulate and enhance JSON-based telemetry data.
- Leverage GenAI and large language models for incident summarization, root cause explanation, runbook improvements, and auto-remediation guidance.
- Collaborate with platform teams to securely integrate GenAI solutions into operational workflows.
Requirements
- Hands-on experience with Dynatrace is required.
- Background in observability, reliability engineering, and distributed systems monitoring.
- Experience with cloud-native monitoring tools and infrastructure as code practices.
- Familiarity with AI/ML and AIOps concepts applied to operational monitoring and incident response.
