About the job
About LangChain
LangChain is dedicated to revolutionizing the landscape of intelligent agents. Our innovative agent engineering platform and open-source frameworks empower developers to efficiently create reliable agents with speed and precision.
With over 90 million downloads monthly, our frameworks, LangChain and LangGraph, are instrumental for developers looking to build sophisticated agents. LangSmith enhances this experience by providing observability, evaluation, and deployment capabilities that facilitate rapid iteration, ensuring teams can seamlessly transition LLM systems into dependable production environments.
Trusted by millions globally, LangChain supports AI teams at leading organizations such as Replit, Clay, Cloudflare, Harvey, Rippling, Vanta, Workday, and many others.
About the Role
This position requires in-person attendance five days a week in either San Francisco, CA or New York, NY. We are developing specialized infrastructure tailored for the execution of AI agents. Unlike conventional web applications, these agents operate for extended periods, engage asynchronously with both humans and other agents, and must be resilient to failures during execution. The LangSmith Deployments runtime is designed to support these functionalities through durable checkpointing, fault-tolerant orchestration, and horizontal scaling across both cloud and self-hosted environments.
We are seeking a Senior Backend Engineer to contribute to this vital system. While the primary focus will be on backend development, a strong understanding of Kubernetes (K8s), Terraform (Tf), and other DevOps tools is highly desirable.
Design and implement distributed queue and worker systems to manage concurrent agent execution, background tasks, and multi-agent coordination within horizontally scalable infrastructure.
Take ownership of core data infrastructure, including state persistence, atomic job claiming, connection management, and schema evolution.
Collaborate on architectural decisions to ensure scalability and robustness of solutions.
Develop resumable streaming infrastructure, allowing clients to disconnect and reconnect during execution without loss of state.
Monitor and instrument production systems, including tracing, metrics, and alerting to maintain platform health.
Participate in on-call rotations and manage incident response for the runtime.
Create and maintain technical documentation, encompassing system design and operational runbooks.
Contribute to and enhance the open-source LangGraph, utilized by thousands of developers to create agent applications.

