company

Freelance AI Agent Evaluation Engineer

toloka-aiRemote — Pretoria, Gauteng, South Africa
Remote Contract $24/hr - $24/hr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

This opportunity is ideal for experienced developers, software engineers, and test automation specialists seeking part-time, non-permanent projects. Ideal candidates will possess:A degree in Computer Science, Software Engineering, or a related field. A minimum of 5 years of experience in software development, particularly with Python (FastAPI, pytest, async/await, subprocess, file operations). A background in full-stack development, with experience in building React-based interfaces (JavaScript/TypeScript) and robust back-end systems. Experience in writing tests (functional, integration — not merely executing them). Familiarity with Docker containers and infrastructure tools (Postgres, Kafka, Redis). An understanding of CI/CD (GitHub Actions as a user: triggers, labels, result interpretation). Proficiency in English.

About the job

Please submit your CV in English and indicate your language proficiency.

Mindrift connects skilled professionals with project-based AI roles at leading technology companies. This freelance position is remote and based in Pretoria, Gauteng, South Africa. The work is project-based and does not constitute permanent employment.

Role overview

The Freelance AI Agent Evaluation Engineer will help build a dataset to assess AI coding agents. The main focus is evaluating how these agents perform on practical developer tasks. This involves designing complex assignments and creating fair evaluation criteria within simulated environments that reflect real-life development settings.

Main responsibilities

  • Create virtual companies according to a strategic plan, including setting up codebases, infrastructure, and realistic context such as conversations, documentation, and tickets to simulate a development history.
  • Develop and refine tasks based on the evolving state of these virtual companies. Draft prompts, define evaluation criteria, and ensure tasks are solvable and fairly assessed.
  • Design assignments for isolated environments that mimic a developer’s workstation: a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
  • Write tests that accept all valid solutions and reject incorrect ones. Find the right balance between strictness and leniency to ensure good approaches are not penalized and weak solutions do not pass.
  • Work with AI agents on test cases, making sure tests uncover genuine issues, do not miss faulty solutions, and properly validate successful ones.
  • Review code produced by AI agents, analyze reasons for success or failure, and design edge cases and adversarial scenarios.
  • Iterate on your work based on feedback from expert QA reviewers who check your output against quality standards.

What this role does not cover

  • Data labeling
  • Prompt engineering
  • Writing code from scratch (the AI agent generates most code; your focus is on guidance and evaluation)

Much of the work involves collaborating with AI systems. Creating tasks that challenge advanced models means working closely with these agents.

About toloka-ai

toloka-ai is a forward-thinking company that connects skilled professionals with innovative AI projects, emphasizing collaboration to enhance AI systems for top tech firms.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.