company

Freelance AI Evaluation Engineer

Toloka AIRemote — Sweden
Remote Contract $50/hr - $50/hr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Desired QualificationsThis opportunity is ideal for seasoned developers, software engineers, and test automation specialists interested in part-time, project-based engagements. The ideal candidates will possess:A degree in Computer Science, Software Engineering, or a related field5+ years of experience in software development, primarily using Python (FastAPI, pytest, async/await, subprocess, file operations)A background in full-stack development, including experience in building React-based interfaces (JavaScript/TypeScript) and robust back-end systemsProficiency in writing tests (functional, integration—not just executing them)Familiarity with Docker containers and infrastructure tools (Postgres, Kafka, Redis)An understanding of CI/CD (GitHub Actions as a user: triggers, labels, reading results)Strong English proficiency

About the job

Please submit your CV in English and include your English proficiency level.

Project Overview

Toloka AI, in partnership with Mindrift, offers project-based freelance roles focused on testing and evaluating AI systems. As a Freelance AI Evaluation Engineer, the main responsibility is to build and assess datasets that measure how well AI coding agents perform on tasks similar to those faced by real-world developers. This is a contract-based position and does not lead to permanent employment.

Key Responsibilities

  • Design virtual companies from broad outlines, creating codebases, infrastructure, and realistic supporting materials such as documentation, tickets, and internal communications to simulate development history.
  • Develop and refine tasks that represent intermediate milestones within these virtual companies. Define prompts, set evaluation standards, and ensure tasks are both solvable and fairly judged.
  • Create assignments in isolated environments that mirror a developer's workstation. This setup includes a Linux machine with development tools, MCP servers for repositories, task tracking, messaging, documentation, and a working web application codebase.
  • Write tests that accept all valid solutions and reject incorrect ones, maintaining a fair but rigorous standard.
  • Collaborate with AI agents to ensure tests surface real problems, filter out poor solutions, and confirm strong results.
  • Review code generated by AI agents, analyze success or failure reasons, and design edge cases or adversarial scenarios to challenge model capabilities.
  • Apply feedback from expert QA reviewers who check your work against established quality benchmarks.

What This Role Does Not Include

  • Data labeling
  • Prompt engineering
  • Writing code from scratch (the AI agent handles most coding; your focus is on guidance and evaluation)

This work relies heavily on collaboration with AI systems. Building tasks that truly test advanced models requires direct interaction with these agents and thoughtful evaluation of their outputs.

About Toloka AI

Toloka AI is at the forefront of advancing AI technology, connecting experts with dynamic projects that push the boundaries of what AI can achieve. We specialize in enhancing AI systems through rigorous testing and evaluation, contributing to the development of smarter, more efficient technologies in the industry.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.