Position has been filled
company

Freelance AI Evaluation Engineer

MindriftRemote — Stuttgart, Baden-Württemberg, Germany
Remote Contract $50/hr - $50/hr
Position filled

Experience Level

Mid to Senior

Qualifications

This role is well-suited for experienced developers, software engineers, or test automation specialists seeking part-time, non-permanent project engagements. Ideal candidates will possess:A degree in Computer Science, Software Engineering, or related disciplines. Over 5 years of experience in software development, predominantly in Python (experience with FastAPI, pytest, async/await, subprocess, file operations). A background in full-stack development, including experience in building React-based interfaces (JavaScript/TypeScript) and robust back-end systems. Proficiency in writing tests (functional, integration, and not merely executing them). Experience with Docker containers and familiarity with infrastructure tools (Postgres, Kafka, Redis). An understanding of CI/CD processes (specifically GitHub Actions regarding triggers, labels, and result interpretation). English proficiency at a professional level.

About the role

Please submit your CV in English and indicate your English proficiency level.

Mindrift connects experienced specialists with project-based AI work for technology companies. Assignments focus on testing, evaluating, and improving AI systems. This freelance, project-based position does not offer permanent employment.

Role overview

As a Freelance AI Evaluation Engineer, the primary focus is building a dataset to assess AI coding agents using real-world developer tasks. The work involves designing detailed tasks and evaluation methods in realistic simulated environments.

Main responsibilities

  • Create virtual companies from high-level plans, including codebases, infrastructure, and realistic context such as conversations, documentation, and tickets that reflect authentic development history.
  • Develop and refine tasks for different stages of the virtual company. This includes writing prompts, setting evaluation criteria, and ensuring tasks are solvable and assessments are fair.
  • Design assignments for isolated environments that mimic a developer's workstation, using a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation), and a real web application codebase.
  • Build tests that accept all valid solutions and reject incorrect ones, aiming for balanced strictness.
  • Work with an AI agent to confirm that tests detect real issues, do not overlook errors, and validate correct solutions.
  • Review code generated by agents, analyze why solutions succeed or fail, and invent edge cases and adversarial scenarios.
  • Incorporate feedback from expert QA reviewers to improve your work and meet quality standards.

Scope clarifications

  • This position does not include data labeling.
  • This position does not cover prompt engineering.
  • Writing code from scratch is not required. The AI agent handles most coding; your focus is on guidance and evaluation.

Much of the work involves collaborating directly with AI systems, as designing challenges for advanced models requires hands-on interaction with those models.

About Mindrift

Mindrift specializes in connecting skilled professionals with cutting-edge AI projects, focusing on enhancing and evaluating AI systems for prominent technology firms. We prioritize collaboration and innovation, making us a valuable partner in the evolving landscape of artificial intelligence.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.