About the role
Please submit your CV in English and indicate your English proficiency level.
toloka-ai, working with Mindrift, offers freelance, project-based roles that connect experienced professionals to AI-driven projects for major technology companies. This contract position is not permanent employment.
Role overview
This freelance AI Evaluation Engineer role centers on building datasets to assess AI-powered coding agents in realistic software development scenarios. The work involves designing complex tasks and assessment criteria that reflect actual development workflows, all within simulated environments.
Main responsibilities
- Create virtual companies using a defined strategy. Develop a codebase, infrastructure, and contextual materials, such as documentation, conversations, and tickets, that mirror real development histories.
- Design and configure tasks at various stages of the virtual company. Write prompts, set evaluation standards, and ensure tasks are solvable and fairly assessed.
- Set up tasks in isolated environments that simulate a developer's workstation. These environments include a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
- Develop tests that reliably accept all correct solutions and reject incorrect ones, balancing strictness with flexibility.
- Collaborate with an AI agent to test and validate assessments, confirming the agent can spot genuine issues and does not overlook valid solutions.
- Review code generated by AI agents, analyze both successes and failures, and create edge cases or challenging scenarios to further evaluate capabilities.
- Incorporate feedback from expert QA reviewers to refine tasks and assessments, aligning with quality benchmarks.
What this role is not
- This is not a data labeling job.
- This is not a prompt engineering position.
- This does not require writing code from scratch. The AI agent handles most coding tasks; your focus is on guidance and evaluation.
Much of the work involves direct collaboration with advanced AI systems. Designing meaningful challenges for these models requires hands-on interaction with them.

