About the job
Please submit your CV in English and indicate your English proficiency level.
Mindrift connects experienced specialists with project-based AI work for technology companies. Assignments focus on testing, evaluating, and improving AI systems. This freelance, project-based position does not offer permanent employment.
Role overview
As a Freelance AI Evaluation Engineer, the primary focus is building a dataset to assess AI coding agents using real-world developer tasks. The work involves designing detailed tasks and evaluation methods in realistic simulated environments.
Main responsibilities
- Create virtual companies from high-level plans, including codebases, infrastructure, and realistic context such as conversations, documentation, and tickets that reflect authentic development history.
- Develop and refine tasks for different stages of the virtual company. This includes writing prompts, setting evaluation criteria, and ensuring tasks are solvable and assessments are fair.
- Design assignments for isolated environments that mimic a developer's workstation, using a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
- Build tests that accept all valid solutions and reject incorrect ones, aiming for balanced strictness.
- Work with an AI agent to confirm that tests detect real issues, do not overlook errors, and validate correct solutions.
- Review code generated by agents, analyze why solutions succeed or fail, and invent edge cases and adversarial scenarios.
- Incorporate feedback from expert QA reviewers to improve your work and meet quality standards.
Scope clarifications
- This position does not include data labeling.
- This position does not cover prompt engineering.
- Writing code from scratch is not required. The AI agent handles most coding; your focus is on guidance and evaluation.
Much of the work involves collaborating directly with AI systems, as designing challenges for advanced models requires hands-on interaction with those models.

