About the job
Please submit your CV in English and specify your English proficiency level.
Mindrift connects skilled professionals with contract-based AI projects from leading technology companies. This freelance role centers on evaluating, testing, and refining AI systems. It is a contract position only, not a path to permanent employment.
Role overview
This project focuses on building a dataset to assess how well AI coding agents perform real-world software development tasks. The work requires designing realistic challenges and evaluation methods within simulated developer environments.
Key responsibilities
- Create virtual companies from high-level plans, building out codebases, infrastructure, and context such as conversations, documentation, and tickets to simulate authentic development histories.
- Develop and refine tasks for different stages of the virtual company, including writing prompts, setting evaluation criteria, and ensuring tasks are solvable and fairly assessed.
- Design assignments in isolated environments that closely resemble a developer’s workstation, including a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a live web application codebase.
- Implement tests that accept all correct solutions and reject incorrect ones, carefully balancing strictness and leniency.
- Collaborate with an AI agent during test runs to confirm that tests catch real issues, flagging faulty solutions without penalizing correct ones.
- Review code generated by AI agents, analyze reasons for success or failure, and design edge cases and adversarial scenarios to expose weaknesses.
- Iterate on your work based on feedback from expert QA reviewers who will check your outputs against quality standards.
What this role does not involve
- Data labeling
- Prompt engineering
- Writing code from scratch (the AI agent generates most of the code; your focus is on guiding and evaluating)
Much of the work involves close collaboration with AI models, as developing challenging tasks for advanced systems requires working directly with them.

