About the job
Please submit your CV in English and indicate your language proficiency.
Mindrift connects skilled professionals with project-based AI roles at leading technology companies. This freelance position is remote and based in Pretoria, Gauteng, South Africa. The work is project-based and does not constitute permanent employment.
Role overview
The Freelance AI Agent Evaluation Engineer will help build a dataset to assess AI coding agents. The main focus is evaluating how these agents perform on practical developer tasks. This involves designing complex assignments and creating fair evaluation criteria within simulated environments that reflect real-life development settings.
Main responsibilities
- Create virtual companies according to a strategic plan, including setting up codebases, infrastructure, and realistic context such as conversations, documentation, and tickets to simulate a development history.
- Develop and refine tasks based on the evolving state of these virtual companies. Draft prompts, define evaluation criteria, and ensure tasks are solvable and fairly assessed.
- Design assignments for isolated environments that mimic a developer’s workstation: a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
- Write tests that accept all valid solutions and reject incorrect ones. Find the right balance between strictness and leniency to ensure good approaches are not penalized and weak solutions do not pass.
- Work with AI agents on test cases, making sure tests uncover genuine issues, do not miss faulty solutions, and properly validate successful ones.
- Review code produced by AI agents, analyze reasons for success or failure, and design edge cases and adversarial scenarios.
- Iterate on your work based on feedback from expert QA reviewers who check your output against quality standards.
What this role does not cover
- Data labeling
- Prompt engineering
- Writing code from scratch (the AI agent generates most code; your focus is on guidance and evaluation)
Much of the work involves collaborating with AI systems. Creating tasks that challenge advanced models means working closely with these agents.

