About the job
Please submit your CV in English and include your English proficiency level.
This project-based contract with toloka-ai connects engineers to AI evaluation work for leading technology companies. The focus is on testing, assessment, and improvement of AI systems. This is not a permanent position.
Role overview
The Freelance AI Evaluation Engineer builds realistic virtual companies, complete with codebases, infrastructure, and supporting context such as documentation, conversations, and tickets. These environments simulate authentic development settings for AI systems to operate within.
What you will do
- Design simulated companies based on strategic plans, including the creation of codebases and infrastructure.
- Develop and refine tasks within these environments, setting clear prompts and evaluation metrics to ensure tasks are solvable and fairly assessed.
- Set up isolated developer workstations, configuring Linux machines with development tools, repositories, task trackers, messaging platforms, and real web application codebases.
- Create tests that accurately accept all valid solutions and reject incorrect ones, maintaining a careful balance to avoid blocking correct approaches or allowing flawed ones.
- Iterate with AI agents during testing, ensuring they identify real issues, avoid missing mistakes, and do not incorrectly flag correct solutions.
- Review AI-generated code, analyze agent performance, and design edge cases and adversarial scenarios to strengthen evaluation processes.
- Incorporate feedback from expert QA reviewers, refining deliverables to meet quality standards.
What this role does not include
- Data labeling
- Prompt engineering
- Writing code from scratch (the AI agent handles most code generation; your focus is on guidance and evaluation)
This role centers on collaborating with advanced AI systems. Much of the work involves designing and refining tasks that challenge these models, requiring close interaction with AI agents throughout the process.

