company

Freelance AI Evaluation Engineer

toloka-aiRemote — Israel
Remote Contract $40/hr - $40/hr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

This position is ideal for experienced developers, software engineers, or test automation specialists looking for part-time, non-permanent projects. Preferred candidates will possess:A degree in Computer Science, Software Engineering, or a related field. A minimum of 5 years of experience in software development, with a strong emphasis on Python (FastAPI, pytest, async/await, subprocess, file operations). A full-stack development background, including experience in building React-based interfaces (JavaScript/TypeScript) and robust backend systems. Proficiency in writing tests (functional, integration—not just executing them). Familiarity with Docker containers and infrastructure tools (Postgres, Kafka, Redis). An understanding of CI/CD practices (GitHub Actions: triggers, labels, interpreting results). Fluency in English.

About the job

Please submit your CV in English and specify your English proficiency level.

toloka-ai, in partnership with Mindrift, offers contract-based, project work for professionals interested in evaluating, testing, and improving AI systems. This is a freelance role, not a permanent staff position.

Role overview

The Freelance AI Evaluation Engineer will contribute to building a dataset focused on assessing AI coding agents. The main goal is to evaluate how these agents perform on real-world developer tasks.

Main responsibilities

  • Set up virtual companies using a strategic plan, including codebases, infrastructure, and supporting context such as conversations, documentation, and tickets to create realistic development scenarios.
  • Design and calibrate tasks representing different stages within the virtual company. This involves writing prompts, defining evaluation metrics, and ensuring tasks are solvable and fairly assessed.
  • Create tasks in isolated environments that mimic a developer’s workstation. Use a Linux setup with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
  • Develop tests that accept all valid solutions and reject incorrect ones. Balance strictness and flexibility to avoid penalizing good approaches or allowing flawed solutions.
  • Work with an AI agent on test cases to ensure the tests catch real issues and do not miss poor solutions or unfairly block correct ones.
  • Review code generated by AI agents, analyze reasons for success or failure, and create edge cases and adversarial scenarios to deepen evaluation.
  • Refine your work based on feedback from expert QA reviewers to meet established quality standards.

Important notes

  • This is not a data labeling position.
  • This is not a prompt engineering role.
  • Writing code from scratch is not required. The AI agent handles most coding; the focus is on guiding and evaluating its output.

Much of the work involves collaborating closely with AI systems. Crafting tasks that challenge advanced models means working alongside those same models throughout the process.

About toloka-ai

Mindrift is at the forefront of connecting talented specialists with innovative AI projects aimed at enhancing the capabilities of leading tech companies. Our focus is on fostering a collaborative environment where professionals can contribute to the advancement of AI technologies.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.