Blog

Humanity’s last exam and education in AI

This op-ed is part of a series hosted by the AI Observatory, offering perspectives on key issues in AI. We hope you find this opinion piece thought-provoking and encourage further discussion on the topic.

Tests conjure up anxiety and dread in many, but beyond a certain age, we are rarely tested without the aid of the world’s collective knowledge accessible at our fingertips.

For some of us, the tough questions of a game show on TV or a pub quiz are the only opportunity we have for the frisson of excitement that comes from checking whether our neural pathways can still recall raw data without turning to a computer or phone.

Calculators are allowed into higher maths and more emphasis is placed on writing and essays to complete unsupervised. Increasingly, the possibility that AI is completing all these tasks for students becomes a real concern, and an arms race develops between the teachers’ AI/plagiarism-detectors and the generative AI used by the hapless (or harried) student who has submitted it for a mark (grade). Does this undermine the value of tests altogether, or does it call into question only the type of high-stakes tests which put pressure on results, and pit teacher against student?

The sort of adversarial relationship where students are trying to trick their teachers can run both ways — I’m sure many of us remember a teacher in our lives who made a point of setting impossible exams. Mine was in the first year of secondary school when a grammar pedant of an English teacher required us to correct a page full of so many errors, that it was possible to get a negative score on the test. The idea that a test should be made to trip you up is on the one hand sadistic and cruel. On the other hand, it remains common practice to test the boundaries of our knowledge and capabilities, as well as to emphasise that failure is a part of learning. Mastery of a subject requires not only a grasp of the core concepts and main rules, but the exceptions and quirky edge cases which occur infrequently. A quick search of the term “adversarial testing” will underscore the role of using the most difficult and challenging examples to try to trip up machine learning algorithms, and therefore refine their accuracy. Within education, our understanding of the psychology of testing and the value of assessment for indicating learning has led to an emphasis on “keeping the main thing the main thing” in testing, rather than using outliers as a proxy test for whether the core concepts are understood. 

With recent improvements, many tests or “benchmarks” are showing that GenAI is giving pretty reasonable results almost all the time – so does this way of testing work in the era of AI?

In response to the high performance of many GenAI large language models (LLMs) on key benchmarks, a group of researchers and global experts have come together with one goal:

A project called Humanity’s Last Exam has determined that we need to come up with a test so hard that GenAI doesn’t stand a chance right now. They write on their site 👇🏾

Quote Box
“LLMs now achieve over 90% accuracy on popular benchmarks like MMLU [Massive Multi-task Language Understanding], limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity’s Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 3,000 challenging questions across over a hundred subjects”.

(Safe.ai. (n.d.). Screenshot of Example 1. Safe.ai.)

On this new benchmark, leading LLMs get a paltry 3-13% of answers correct. What a great time to be human, right!?

But what does this tell us about testing? We humans can collectively come up with questions obscure enough that AI can’t figure them out — is that it? Does that make us any different then from the cruel teacher who sets impossible exams, testing AI on things we haven’t covered in class? Are we engaging with AI in this way simply to reassure ourselves of our own humanity?

AI generated image showing a teacher holding a dummy, presenting to an empty classroom
DeepAI. (n.d.). AI-generated image. DeepAI.

I would suggest that the purpose of educational tests in relation to AI is not fundamentally different from the tests we give each other as humans. We test students on their attentiveness — not only to content and facts that are shared in a classroom, but also on ways of thinking, debating and articulating those facts. When we test AI, we aren’t just testing the ability of an LLM to ingest information and spurt it back in the appropriate way, we are also testing for “attentiveness” – the current explosion of generative AI use, built on LLMs, traces its origins largely to an academic breakthrough in AI architecture that posited that Attention is all you need. The ability of AI to pay attention: not only to rote memorisation of facts, but also to ways of thinking and articulating knowledge, is taking many by surprise. At the same time, the ability of children to pay attention in educational contexts is seemingly disrupted by technology as much as, or more than it is encouraged. This leaves us in a rather messy situation, like a ventriloquist doing a mic check where the audience isn’t responding, only the ventriloquist’s dummy. Understanding this sort of awkward position that AI can put teachers in is important to leveraging its positive benefits for education. And perhaps, drawing on the good practices of teachers today, in testing for attentiveness and understanding, rather than adversarial testing, can likewise lead to AI models reflecting our educational values, more than a mirage of our collective memory.


Connect with Us

Get a regular round-up of the latest in clear evidence, better decisions, and more learning in EdTech.

Connect with Us​

Get a regular round-up of the latest in clear evidence, better decisions, and more learning in EdTech.

EdTech Hub is supported by

The findings, interpretations, and conclusions expressed in the content on this site do not necessarily reflect the views of The UK government, Bill & Melinda Gates foundation or the World Bank, the Executive Directors of the World Bank, or the governments they represent.

EDTECH HUB 2025. Creative Commons Attribution 4.0 International License.

to top