Position has been filled
toloka-ai logo

Freelance AI Evaluation Engineer

toloka-aiRemote — Porto, Porto District, Portugal
Remote Part-time $30/hr - $30/hr
Position filled

Experience Level

Mid to Senior

Qualifications

Desired QualificationsThis position is ideal for seasoned developers, software engineers, or test automation specialists looking for part-time, non-permanent projects. Preferred candidates will possess:A degree in Computer Science, Software Engineering, or a related field. Over 5 years of experience in software development, particularly in Python (FastAPI, pytest, async/await, subprocess, file I/O). A background in full-stack development, with experience in building React-based interfaces (JavaScript/TypeScript) and robust back-end systems. Experience in writing tests (functional, integration - beyond just executing them). Familiarity with Docker containers and infrastructure tools (Postgres, Kafka, Redis). Understanding of CI/CD practices (GitHub Actions: triggers, labels, interpreting results). Proficiency in English.

About the role

Please submit your CV in English and indicate your English proficiency level.

Mindrift matches experienced professionals with project-based AI assignments for leading technology companies. This Freelance AI Evaluation Engineer position is remote, based in Porto (Portugal), and offered on a project basis rather than as a permanent job.

Role overview

This role focuses on building a dataset to evaluate AI coding agents using real-world developer tasks. The main goal is to design tasks and evaluation standards that reflect actual software development work.

What you will do

  • Create simulated companies from high-level plans, including codebases, infrastructure, and realistic context such as documentation, tickets, and conversations to mimic real development histories.
  • Develop and refine tasks for different phases of these virtual companies: draft prompts, set evaluation standards, and confirm that tasks are both achievable and fairly assessed.
  • Design assignments inside isolated environments that resemble a developer’s workstation, including a Linux setup with development tools, MCP servers (for repositories, task tracking, messaging, and documentation), and a functioning web application codebase.
  • Write tests that accept all valid solutions while rejecting incorrect ones, making sure tests are neither overly strict nor too lenient.
  • Collaborate with an AI agent to check that tests catch real problems, avoid missing errors, and do not penalize correct submissions.
  • Review code generated by AI agents, analyze the causes of their successes or failures, and create edge cases or challenging examples.
  • Revise your work based on feedback from expert QA reviewers who evaluate your output against quality standards.

What this role does not include

  • Data labeling
  • Prompt engineering
  • Writing code from scratch (the AI agent handles most coding; your focus is on guidance and evaluation)

Direct collaboration with AI models is a key part of this work, since developing challenging tasks for advanced systems means working closely with those same models.

About toloka-ai

toloka-ai specializes in connecting skilled professionals with innovative AI projects, allowing for the enhancement and evaluation of cutting-edge AI systems in collaboration with top tech firms.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.