About the role
Please submit your CV in English and include your English proficiency level.
This freelance, project-based role with toloka-ai (via Mindrift) focuses on evaluating AI coding agents for leading technology companies. The position is remote, open to candidates based in Glasgow, Scotland, United Kingdom.
Role overview
The Freelance AI Evaluation Engineer will contribute to building a dataset that measures how well AI models perform realistic developer tasks. The work includes designing technical challenges, setting up evaluation frameworks, and operating within simulated development environments.
Main responsibilities
- Create virtual companies based on high-level plans, building out codebases, infrastructure, and realistic context such as documentation, tickets, and team conversations.
- Develop and calibrate tasks reflecting various stages of a virtual company’s lifecycle: write prompts, define evaluation methods, and ensure tasks are fair and solvable.
- Design challenges in isolated environments that mirror a developer’s daily setup, including a Linux machine with development tools, MCP servers (repository, task tracker, messenger, documentation), and a working web application codebase.
- Write tests that reliably accept all valid solutions and reject incorrect ones, balancing strictness and flexibility to avoid false positives or negatives.
- Work with AI agents to confirm that tests detect real issues, do not overlook flawed solutions, and do not fail on correct answers.
- Review AI-generated code, analyze results, and design edge cases or adversarial scenarios to identify model weaknesses.
- Refine tasks and evaluation criteria in response to expert QA feedback to meet quality standards.
What this role does not involve
- No data labeling.
- No prompt engineering.
- No writing code from scratch; the AI agent generates most of the code, while your focus is on guidance and evaluation.
This project requires close collaboration with AI systems. The main challenge is to design tasks that push advanced models, using creativity and technical insight, without relying on those same models to generate the tasks themselves.
