About the job
Please submit your CV in English and include your English proficiency level.
This freelance, project-based contract with toloka-ai is remote and open to candidates based in Uruguay. Mindrift connects skilled professionals with project-based AI roles at leading tech companies, with a focus on evaluating and improving AI systems. This is not a permanent position.
Role overview
The Freelance AI Agent Evaluation Engineer builds datasets to measure how well AI coding agents perform real-world software development tasks. The work centers on designing complex tasks and evaluation criteria inside detailed simulated environments.
What you will do
- Create virtual companies from high-level blueprints, including realistic codebases, infrastructure, and context like conversations, documentation, and tickets to simulate authentic development environments with history.
- Curate and adjust tasks at different stages of the virtual company. This includes developing prompts, defining evaluation criteria, and ensuring tasks are solvable and fairly assessed.
- Design challenges in isolated settings that mimic a developer's workstation: a Linux environment with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
- Develop tests that reliably accept all correct solutions and reject incorrect ones, aiming for a balance between strictness and fairness.
- Work alongside an AI agent on these tests, ensuring the agent catches real issues, does not accept poor solutions, and passes valid ones.
- Review code generated by AI agents, analyze both successes and failures, and design edge cases and adversarial scenarios to further challenge the models.
- Iterate on your approach based on feedback from expert QA reviewers who assess your work for quality.
What this role does not include
- Data labeling
- Prompt engineering
- Writing code from scratch (the AI agent will handle most coding; your focus is on guiding and evaluating)
This role involves close collaboration with advanced AI models, crafting tasks that push their capabilities and evaluating their performance in realistic scenarios.

