Research blog · Evaluation

Benchmarking AI Tutors:
From Answer Correctness to Pedagogical Guidance

19 Jan 2026 · London AI School

Large language models have achieved impressive performance on benchmarks that reward answer correctness and task completion. However, when these systems are deployed in educational settings, this evaluation paradigm quickly breaks down.

Education does not primarily require systems that solve problems. It requires systems that support learning.

An AI Tutor should help learners reason, diagnose misunderstandings, and decide what to do next—without providing the final answer. As a result, benchmarks designed for solvers are fundamentally misaligned with the goals of AI tutoring.

Why answer-based evaluation fails

Most widely used benchmarks assume that the best model is the one that produces the correct output as efficiently as possible. In educational contexts, this assumption fails for several reasons:

• Providing answers can undermine learning and exploration.

• Execution skill is not equivalent to pedagogical skill.

• Optimising for correctness encourages solution leakage.

• The benchmark reward signal conflicts with learning goals.

Systems optimised under this paradigm tend to behave like assistants or solvers, not tutors.

What recent tutoring benchmarks show

Recent work on AI tutoring benchmarks and evaluation frameworks demonstrates a clear shift away from answer correctness toward evaluating pedagogical behaviour.

Across independently developed efforts, several shared principles have emerged:

• Human-curated pedagogical ground truth. Expert annotators define pedagogical intent through rubrics or structured labels (e.g. diagnosis quality, appropriateness of guidance, degree of answer revealing), rather than specifying a single correct solution [1,2].

• Explicit separation between tutoring and solving. Evaluation frameworks increasingly penalise direct answer giving and reward scaffolding behaviours such as diagnosis, hinting, and next-step guidance [1,2].

• Automated evaluation at test time. While pedagogical criteria are authored by humans offline, evaluation is designed to scale via automated scoring or rubric-based judges, without students or instructors in the loop [1,2].

• Multi-dimensional assessment. Tutoring quality is evaluated along multiple pedagogical dimensions, reflecting the inherently multi-faceted nature of teaching [1,2].

Representative work

Two recent efforts exemplify this shift:

• Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach argues for evaluation-first design of educational AI systems, emphasising human-defined pedagogical criteria and explicit separation between guidance and answer production [1].

• TutorBench introduces a large-scale tutoring benchmark using instance-specific expert rubrics and automated evaluation to assess adaptive explanations, actionable feedback, and active learning support—explicitly penalising answer revealing [2].

What is still missing

Despite this progress, important gaps remain.

Most existing benchmarks focus on tutoring individual questions. However, much real learning—particularly at university or professional level— takes place through:

• Extended projects.

• Iterative experimentation.

• Debugging, refinement, and strategic decision-making.

In addition, fluent explanations can score well even when the guidance is strategically wrong. Benchmarks must therefore evaluate not just how guidance is expressed, but whether it proposes the right kind of pedagogical action.

Our research direction

Building on recent work, we argue that the next generation of benchmarks for AI Tutors should:

• Evaluate guidance, not execution.

• Use human pedagogical ground truth defined offline.

• Operate at the level of actions and decisions, not answers.

• Explicitly penalise solution leakage.

This framing aligns benchmarking with what tutoring is meant to support: learning rather than outsourcing cognition.

Conclusion

As AI Tutors become integrated into higher education, professional training, and project-based learning environments, benchmarks must reflect what effective tutoring actually entails.

Designing benchmarks around pedagogical guidance is not merely an evaluation choice—it is a statement about what kind of educational AI we believe should exist.

References
[1] Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach, arXiv:2407.12687.
[2] TutorBench: Evaluating Tutoring Capabilities of Large Language Models, arXiv:2510.02663.

Benchmarking AI Tutors:From Answer Correctness to Pedagogical Guidance