About the job
Please submit your CV in English and include your English proficiency level.
This contract role supports toloka-ai through Mindrift, focusing on building and evaluating datasets that measure how well AI coding agents perform real-world developer tasks. The position is project-based and not a permanent employment opportunity.
Role overview
The Freelance AI Agent Evaluation Engineer creates and refines simulated development environments to test advanced AI models. The work centers on designing authentic scenarios that mirror the challenges developers face, ensuring that AI agents are thoroughly evaluated in realistic contexts.
What you will do
- Build virtual companies from scratch, including codebases, infrastructure, and supporting materials like documentation, conversations, and tickets to establish convincing development histories.
- Curate and refine tasks drawn from different stages of the virtual company, draft prompts, set evaluation criteria, and ensure tasks are solvable and assessed fairly.
- Design assignments within isolated environments that resemble a developer's workstation, using a Linux machine equipped with development tools, MCP servers (repository, task tracker, messenger, documentation), and a working web application codebase.
- Write tests that accept all correct solutions while rejecting incorrect ones, carefully balancing strictness to avoid excluding valid approaches or allowing flawed solutions.
- Collaborate with AI agents to ensure tests catch real issues, support valid solutions, and do not miss poor ones.
- Review code generated by AI agents, analyze factors behind success or failure, and design edge cases and adversarial scenarios to challenge the models.
- Iterate on your work based on feedback from expert QA reviewers who assess your output against quality standards.
What this role does not involve
- Data labeling
- Prompt engineering
- Writing code from scratch (the AI agent handles most coding; the focus here is on guidance and evaluation)
Collaboration with AI models is central to this work. Crafting tasks that truly test advanced models requires both technical expertise and creativity in using these systems as part of the process.
Location
Remote , Belo Horizonte, State of Minas Gerais, Brazil

