About the role
Gramian Consultancy seeks an AI Evaluation Engineer with a strong background in software engineering and coding. This remote contract role is open to candidates based in Brazil, as well as Bangladesh, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria, Turkey, and Vietnam.
Role overview
This position centers on creating and implementing benchmark tasks that reflect real-world software engineering challenges. The work involves evaluating AI systems by designing scenarios where codebases require targeted changes, such as bug fixes, refactoring, or migrations, and then verifying the accuracy of AI-generated solutions.
What you will do
- Create and implement multi-agent benchmark tasks that simulate practical code modifications, including bug fixes, migrations, and refactoring.
- Apply the Harbor evaluation framework to run and validate tasks in containerized environments.
- Draft clear task instructions, specifying file paths, function signatures, expected behaviors, and constraints.
- Write Python validation scripts to check the correctness of code changes.
- Decompose complex tasks into steps for specialized agents.
- Review large open-source codebases to identify realistic scenarios for tasks.
- Run, debug, and refine tasks within Docker to ensure they are reproducible.
- Iterate on task quality and complexity based on evaluation feedback.
Key details
- Contract type: Contractor assignment (no medical or paid leave)
- Duration: 4 weeks or longer
- Schedule: 8 hours per day, with at least 4 hours overlapping Pacific Standard Time (PST)
- Interview process: Take-home assessment
Gramian Consultancy is a boutique firm specializing in IT professional services and engineering talent solutions. The team focuses on software engineering and leadership, helping organizations build effective teams by connecting them with professionals who match their needs.
