About the role
Please submit your CV in English and indicate your English proficiency level.
Mindrift matches experienced professionals with project-based AI assignments for leading technology companies. This Freelance AI Evaluation Engineer position is remote, based in Porto (Portugal), and offered on a project basis rather than as a permanent job.
Role overview
This role focuses on building a dataset to evaluate AI coding agents using real-world developer tasks. The main goal is to design tasks and evaluation standards that reflect actual software development work.
What you will do
- Create simulated companies from high-level plans, including codebases, infrastructure, and realistic context such as documentation, tickets, and conversations to mimic real development histories.
- Develop and refine tasks for different phases of these virtual companies: draft prompts, set evaluation standards, and confirm that tasks are both achievable and fairly assessed.
- Design assignments inside isolated environments that resemble a developer’s workstation, including a Linux setup with development tools, MCP servers (for repositories, task tracking, messaging, and documentation), and a functioning web application codebase.
- Write tests that accept all valid solutions while rejecting incorrect ones, making sure tests are neither overly strict nor too lenient.
- Collaborate with an AI agent to check that tests catch real problems, avoid missing errors, and do not penalize correct submissions.
- Review code generated by AI agents, analyze the causes of their successes or failures, and create edge cases or challenging examples.
- Revise your work based on feedback from expert QA reviewers who evaluate your output against quality standards.
What this role does not include
- Data labeling
- Prompt engineering
- Writing code from scratch (the AI agent handles most coding; your focus is on guidance and evaluation)
Direct collaboration with AI models is a key part of this work, since developing challenging tasks for advanced systems means working closely with those same models.
