About the job
Embracing the benefits of remote work, we at Tecsys promote a digital-first culture that enhances employee morale, boosts productivity, and reduces the environmental impact associated with commuting. Our commitment to remote work is complemented by our well-equipped offices and collaborative spaces, offering flexibility for our team to work in the most productive manner possible.
About Us
Tecsys is a rapidly growing innovator in supply chain solutions, serving leading healthcare systems, hospitals, pharmacies, distributors, retailers, and 3PLs. We partner with industry leaders to revolutionize their supply chains through cutting-edge technology. If you enjoy overcoming challenges and are eager for continuous learning, Tecsys may be the perfect place for you!
About the Role
We are seeking a Site Reliability Engineer to join our Network and Security Operations Center (NOC), which is integral to ensuring platform reliability for our mission-critical SaaS environments. In this role, you will be responsible for maintaining, optimizing, and ensuring the reliability and performance of our cloud infrastructure across AWS and Kubernetes. Your focus will be on automation, observability, and continuous improvement. This position combines reliability engineering with incident command, granting you significant ownership of uptime, performance, and innovation. You will join a team of highly skilled professionals who value creative problem-solving, operational excellence, and continual enhancement through automation and resilience engineering.
Your Responsibilities
- Collaborate with Engineering teams to support services pre-launch through system design consulting, software platform development, capacity planning, and launch reviews.
- Drive innovation: Identify issues, propose creative solutions, and implement initiatives to simplify, scale, and strengthen the platform.
- Monitor and maintain live services by evaluating availability, latency, and overall system health.
- Enhance observability: Expand monitoring and alerting with Datadog; define SLOs/SLIs and create actionable dashboards to promote reliability.
- Automate processes: Develop and improve internal tools, IaC frameworks, and pipelines (e.g., Terraform, GitLab CI/CD) to minimize manual intervention and enable self-healing systems.
- Achieve sustainable system scaling through automation and advocate for changes that enhance reliability and velocity.
- Function as an orchestrator using Amazon Kiro: Execute multiple activities concurrently leveraging AI agents to expedite processes while personally validating outcomes.
