Qualifications
Key ResponsibilitiesManage and sustain container orchestration platforms and containerized workloads.Monitor and troubleshoot production systems, partaking in on-call rotations to ensure system reliability.Enhance observability by improving monitoring, logging, and alerting capabilities across systems and data platforms.Administer and optimize cloud-based environments across various providers.Support and manage distributed data platforms and real-time processing systems.Develop and maintain continuous integration and delivery pipelines for seamless and reliable deployments.Implement Infrastructure as Code (IaC) practices for uniformity and scalability.Automate and orchestrate infrastructure using various programming and scripting languages.Conduct system administration and networking tasks to support both internal and external environments.Collaborate effectively with engineers and stakeholders across diverse time zones.Qualifications5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.Proven experience leading large-scale production systems in cloud environments (AWS, GCP, Azure, or OCI).Demonstrated leadership in incident response, on-call best practices, and fostering a reliability-focused culture.Extensive experience with production on-call operations and incident management.Advanced skills in Kubernetes administration and troubleshooting.Hands-on experience with observability tools such as Prometheus, Grafana, Loki, and Alertmanager.Familiarity with chat-based operations interfaces and/or automated remediation strategies.
About the job
Embark on a rewarding career with Stellar Cyber, a rapidly expanding global leader in cybersecurity, trusted by top enterprises and government agencies. Almost 30% of the world's leading Managed Security Service Providers (MSSPs) utilize our innovative platform, and our influence continues to grow as more organizations acknowledge the significance of next-generation security solutions. At the forefront of defending against advanced cyber threats, we employ state-of-the-art AI and automation technologies. Our culture promotes diversity, transparency, and teamwork, fostering creativity and innovation that make a tangible impact in the cybersecurity landscape.
We are in search of an exceptionally talented Senior Staff Site Reliability Engineer (SRE) to enhance our team and advance the reliability, scalability, and efficiency of our production systems. The ideal candidate will possess in-depth knowledge of cloud infrastructure, Kubernetes management, observability, and incident response, with a solid history of building and sustaining highly available and resilient platforms. As a pivotal member of the SRE team, you will handle intricate distributed systems while also shaping architecture, tools, and best practices to guarantee operational excellence.
About Stellar Cyber
Stellar Cyber is a pioneering cybersecurity firm dedicated to providing advanced solutions that protect organizations from sophisticated cyber threats. Our rapidly growing client base includes some of the largest enterprises and government agencies worldwide, making us a trusted leader in the field.