About the job
Please submit your CV in English and include your English proficiency level.
Mindrift offers project-based freelance roles for specialists interested in AI system evaluation. This contract position focuses on assessing and improving AI coding agents for technology clients. The role is remote, with a preference for candidates based in Hyderabad, Telangana, India. Please note: this is a freelance contract, not a permanent position.
Role overview
The Freelance AI Agent Evaluation Engineer works on building datasets to measure how well AI coding agents handle realistic developer tasks. The position centers on creating and refining simulated development environments and evaluating model performance in those settings.
What you will do
- Set up virtual companies using detailed plans, including codebases, infrastructure, and supporting materials (documentation, tickets, conversations) to mirror real-world development environments.
- Design and adapt tasks as these virtual companies evolve: write prompts, define fair evaluation criteria, and ensure tasks are solvable and judged objectively.
- Create assignments within isolated environments that simulate a developer’s workstation, including a Linux machine with development tools, MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
- Develop tests that accept all valid solutions and reject incorrect ones, ensuring the tests are neither too strict nor too lenient.
- Collaborate with an AI agent to check that tests catch real issues, avoid missing faulty solutions, and do not penalize correct ones.
- Review agent-generated code, analyze agent performance, and design edge cases and adversarial scenarios to further challenge the models.
- Incorporate feedback from expert QA reviewers to refine and improve your work to meet quality standards.
What this role is not
- This is not a data labeling position.
- This is not prompt engineering.
- You will not write code from scratch; the AI agent produces most of the code. The main focus is on guidance and evaluation.
Much of the work involves collaborating closely with AI systems. Creating tasks that challenge advanced models requires direct interaction with these agents.

