About the job
Please submit your CV in English and include your English proficiency level.
Project Overview
Toloka AI, in partnership with Mindrift, offers project-based freelance roles focused on testing and evaluating AI systems. As a Freelance AI Evaluation Engineer, the main responsibility is to build and assess datasets that measure how well AI coding agents perform on tasks similar to those faced by real-world developers. This is a contract-based position and does not lead to permanent employment.
Key Responsibilities
- Design virtual companies from broad outlines, creating codebases, infrastructure, and realistic supporting materials such as documentation, tickets, and internal communications to simulate development history.
- Develop and refine tasks that represent intermediate milestones within these virtual companies. Define prompts, set evaluation standards, and ensure tasks are both solvable and fairly judged.
- Create assignments in isolated environments that mirror a developer's workstation. This setup includes a Linux machine with development tools, MCP servers for repositories, task tracking, messaging, documentation, and a working web application codebase.
- Write tests that accept all valid solutions and reject incorrect ones, maintaining a fair but rigorous standard.
- Collaborate with AI agents to ensure tests surface real problems, filter out poor solutions, and confirm strong results.
- Review code generated by AI agents, analyze success or failure reasons, and design edge cases or adversarial scenarios to challenge model capabilities.
- Apply feedback from expert QA reviewers who check your work against established quality benchmarks.
What This Role Does Not Include
- Data labeling
- Prompt engineering
- Writing code from scratch (the AI agent handles most coding; your focus is on guidance and evaluation)
This work relies heavily on collaboration with AI systems. Building tasks that truly test advanced models requires direct interaction with these agents and thoughtful evaluation of their outputs.

