company

Freelance AI Agent Evaluation Engineer

toloka-aiRemote — Uruguay
Remote Contract $21/hr - $21/hr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

What We Are Looking ForThis role is ideal for seasoned developers, software engineers, or test automation specialists open to part-time, non-permanent roles. Preferred candidates will possess:A degree in Computer Science, Software Engineering, or a related discipline. Over 5 years of experience in software development, primarily in Python (including FastAPI, pytest, async/await, subprocess, and file operations). A background in full-stack development, with experience in building React-based interfaces (JavaScript/TypeScript) and robust backend systems. Experience in writing tests (functional, integration—not just executing them). Familiarity with Docker containers and infrastructure tools (Postgres, Kafka, Redis). Understanding of CI/CD processes (using GitHub Actions: triggers, labels, reading results). Proficiency in English.

About the job

Please submit your CV in English and include your English proficiency level.

This freelance, project-based contract with toloka-ai is remote and open to candidates based in Uruguay. Mindrift connects skilled professionals with project-based AI roles at leading tech companies, with a focus on evaluating and improving AI systems. This is not a permanent position.

Role overview

The Freelance AI Agent Evaluation Engineer builds datasets to measure how well AI coding agents perform real-world software development tasks. The work centers on designing complex tasks and evaluation criteria inside detailed simulated environments.

What you will do

  • Create virtual companies from high-level blueprints, including realistic codebases, infrastructure, and context like conversations, documentation, and tickets to simulate authentic development environments with history.
  • Curate and adjust tasks at different stages of the virtual company. This includes developing prompts, defining evaluation criteria, and ensuring tasks are solvable and fairly assessed.
  • Design challenges in isolated settings that mimic a developer's workstation: a Linux environment with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
  • Develop tests that reliably accept all correct solutions and reject incorrect ones, aiming for a balance between strictness and fairness.
  • Work alongside an AI agent on these tests, ensuring the agent catches real issues, does not accept poor solutions, and passes valid ones.
  • Review code generated by AI agents, analyze both successes and failures, and design edge cases and adversarial scenarios to further challenge the models.
  • Iterate on your approach based on feedback from expert QA reviewers who assess your work for quality.

What this role does not include

  • Data labeling
  • Prompt engineering
  • Writing code from scratch (the AI agent will handle most coding; your focus is on guiding and evaluating)

This role involves close collaboration with advanced AI models, crafting tasks that push their capabilities and evaluating their performance in realistic scenarios.

About toloka-ai

Mindrift is dedicated to bridging the gap between skilled professionals and innovative AI projects across top technology firms. Our focus lies in the evaluation and enhancement of AI systems through specialized project engagements.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.