About the job
- Internal tools for efficient running and analysis of evaluations, such as a system that investigates thousands of agentic evaluation runs in parallel and automatically surfaces insights.
- Automated evaluation pipelines designed to streamline the process from accessing a new model for pre-deployment testing to analyzing key results and disseminating them.
- Orchestration tools enabling researchers to execute thousands of agentic evaluations in parallel on secure remote machines.
- LLM proxy services that allow real-time monitoring of our coding agent traffic, supporting automatic identification of undesired behaviors.
- Development of LLM agents and MCP tools that automate internal software engineering and research tasks, with safeguards to prevent significant failures.
- Continuous Integration (CI) pipeline optimizations to enhance execution speed and eliminate unreliable tests.
- Telemetry API and enhanced instrumentation of existing tools to monitor usage and improve reliability.
- A data warehousing pipeline to store thousands of evaluation transcripts for research and dataset building.
- Improvements to the Inspect framework and ecosystem, including support for modern agentic scaffolds.

