About the job
Airbyte stands at the forefront of open-source data movement, enabling data teams to seamlessly transfer information from diverse applications, APIs, unstructured sources, and databases to data warehouses, lakes, and AI applications. With tens of thousands of connectors and a widespread adoption across hundreds of thousands of companies, we have demonstrated the viability of large-scale data integration. Our ongoing mission is to construct an advanced agentic data infrastructure, meticulously designed for AI agents requiring swift and accurate access to data across numerous sources. We aim to make data universally accessible and actionable.
Having secured $181M from leading investors such as Benchmark, Accel, Altimeter, Coatue, and Y Combinator, we are committed to a product-led growth strategy where we create exceptional solutions that resonate with our users. This funding empowers us to explore innovative avenues while maintaining a nimble and experimental approach in an AI-driven landscape.
The Role:
As a critical member of the Data Replication team, you will serve as an infrastructure and reliability engineer within a full-stack product team that executes over 3 million sync jobs weekly, facilitating thousands of data use cases across various regions and cloud environments. You will be responsible for building and maintaining the infrastructure, establishing reliability standards, reducing incidents, and streamlining the shipping process for engineers through enhanced tooling. You should feel equally at home working with Terraform files, Kubernetes clusters, and postmortem documentation.
We encourage our engineers to actively leverage AI as a force multiplier—utilizing agentic tools to automate repetitive tasks, enhance incident response, and develop smarter internal tooling. If you haven't yet embraced this approach, we hope you're eager to start. We value how you work just as much as what you produce. Trust, transparency, and craftsmanship are paramount here.
What You’ll Do:
Take ownership of the infrastructure that supports the Data Replication platform, including Kubernetes clusters, CI/CD pipelines, secrets management, networking, and cloud resource configuration across AWS and GCP.
Collaborate with product engineers to ensure reliable integration of product features with infrastructure.
Enhance observability, alerting, and anomaly detection systems with a focus on LLM automation.
Develop and improve AI-augmented release and internal tooling, including canary deployments, progressive rollouts, automated release qualification, and rollback automation—all with a focus on LLM automation.
Establish high standards for infrastructure within the team by creating self-serve tools, writing runbooks, and mentoring engineers.

