About the job
Please submit your CV in English and specify your English proficiency level.
toloka-ai, in partnership with Mindrift, offers contract-based, project work for professionals interested in evaluating, testing, and improving AI systems. This is a freelance role, not a permanent staff position.
Role overview
The Freelance AI Evaluation Engineer will contribute to building a dataset focused on assessing AI coding agents. The main goal is to evaluate how these agents perform on real-world developer tasks.
Main responsibilities
- Set up virtual companies using a strategic plan, including codebases, infrastructure, and supporting context such as conversations, documentation, and tickets to create realistic development scenarios.
- Design and calibrate tasks representing different stages within the virtual company. This involves writing prompts, defining evaluation metrics, and ensuring tasks are solvable and fairly assessed.
- Create tasks in isolated environments that mimic a developer’s workstation. Use a Linux setup with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
- Develop tests that accept all valid solutions and reject incorrect ones. Balance strictness and flexibility to avoid penalizing good approaches or allowing flawed solutions.
- Work with an AI agent on test cases to ensure the tests catch real issues and do not miss poor solutions or unfairly block correct ones.
- Review code generated by AI agents, analyze reasons for success or failure, and create edge cases and adversarial scenarios to deepen evaluation.
- Refine your work based on feedback from expert QA reviewers to meet established quality standards.
Important notes
- This is not a data labeling position.
- This is not a prompt engineering role.
- Writing code from scratch is not required. The AI agent handles most coding; the focus is on guiding and evaluating its output.
Much of the work involves collaborating closely with AI systems. Crafting tasks that challenge advanced models means working alongside those same models throughout the process.

