How might AI transform the way we assess and understand learning?

This summary discusses the conversation from a Community of Interest event on Artificial Intelligence for Assessment and Evaluation, a collaboration by the World Bank and EdTech Hub’s AI Observatory and Action Lab, supported by FCDO.
AI is changing how we know what students are learning. It is already impacting assessments ranging from formative classroom-based evaluations to large-scale system diagnostics. To help ministries, practitioners, and researchers explore the opportunities and risks associated with AI-enabled assessment, the World Bank and EdTech Hub convened its AI in Education Community of Interest for a session on October 22nd, 2025.
Assessment has long been a cornerstone of effective teaching and policymaking, yet one of the most challenging parts of education systems to modernise. Can AI help make assessment more continuous, actionable, and cost-effective — without losing fairness or trust?
Here’s what we heard from three experts who are exploring AI-enabled assessment, and what questions came up from our community.
Watch the webinar here
Featured speakers 🔉
- Mike Trucano, Senior Advisor, EdTech Hub AI Observatory and Action Lab and Non-Resident Fellow, Brookings Institution
- Daniel Plaut, Innovation Learning Lead, EdTech Hub
- Anthony Udeh, Senior Technical Advisor, RTI International
- Diego Luna-Bazaldua, Senior Education Specialist, World Bank
- Nawaz Aslam, Pakistan Country Lead, EdTech Hub
Key takeaways 📝
Here are four key takeaways from the discussion regarding the opportunities and risks associated with AI-enabled assessment, including some important technical considerations:
Our session’s moderator introduced a few different types of learning assessment at the top of the call by comparing their functions to steps testing a new soup recipe:
1. The purpose and function of the assessment really matters when it comes to considering AI’s role.
- System or diagnostic assessment: done to inform policymaking and system-wide decision-making is like checking what’s in the kitchen and what people are hungry for — it helps understand resources and needs before you start cooking.
- Formative assessment: typically done continuously and to inform the teaching process is like tasting the soup while you’re cooking — it helps you adjust ingredients and improve learning as you go.
- Summative assessment: done at important evaluation points like an end of term, is like tasting the soup after it’s served — it judges the final outcome and tells you how well you did.
Our discussion spanned all three of these assessment types, but one important theme came up when considering AI integration into assessments: purpose matters. Diego noted that AI’s value varies by use case. He further argued that its use for helping teachers gather feedback within a classroom (where speed and adaptability are key) is much less risky than its use in national or diagnostic assessments (where reliability and cross-comparability are crucial).
2. AI could strengthen assessment-informed instruction — but it is crucial that teachers stay in the loop.
Speakers agreed that AI holds great promise to close the feedback gap between assessments and teaching in formative assessment. Assessment-informed instruction often refers to a practice that leverages continuous formative assessments to help inform adaptations in classroom teaching. AI-enabled assessment tools are being developed and tested currently which catalyse this feedback loop: allowing learning data to be collected and made more easily available to teachers, including through direct recommendations for improved teaching practices.
Nawaz and Antony shared examples of AI-enabled tools in Pakistan and the Philippines, for this purpose. Teachers in both places found value in the ways that AI tools have enhanced formative assessment questions and made implementation more efficient. However, both speakers also emphasised a common design principle that should be followed —assessment tools should be designed and implemented along-side teachers. Nawaz noted the importance of enabling teachers to review assessment results to ensure validity and Anthony noted the importance of teacher-centred design for ensuring effectiveness and uptake.
3. AI assessments face important limitations when it comes to linguistic diversity and infrastructure.
Panellists highlighted that local languages and dialects, connectivity, and device availability all are important barriers to address when considering AI’s usefulness in assessment. Anthony described a post-COVID reading assessment pilot where AI was able to significantly decrease the time teachers spent on assessment. However, he also shared a significant limitation: the tool struggled to process local Filipino dialects due to limited training data.
Infrastructure is also a clear limitation, especially when considering AI-enabled large-scale system-wide assessments. Nawaz highlighted the extensive challenges associated with providing enough devices to scale an AI assessment for foundational learning in Pakistan. To address these challenges, speakers called for modular language models built on more local data which may be more able to address linguistic diversity, as well as a hybrid assessment tool design which could allow for both online and offline use.
4. Leveraging existing assessment frameworks is key to ensure the quality of AI-assessment tools.
Diego reminded participants that sound assessment design starts with a rigorous framework — not a prompt. While AI can generate assessment items and score responses quickly, psychometric rigor and curriculum alignment remain essential, especially in high-stakes assessment which can help shape policy decisions. He warned that AI-generated items have been shown to over-emphasize rote knowledge and show lower reliability for high-stakes use. For governments and donors considering AI-assessments for system-wide assessments, he urged a focus on quality control, and emphasised the importance of building on existing assessment frameworks (such as those developed by the UNESCO’s Global Alliance to Monitor Learning and the World Bank’s Accelerating Learning Measurement for Action (ALMA) program).
Questions 🙋🏾♀️
The following questions were posed from community members. We’re sharing to help stimulate further discussions and knowledge exchanges. Please note some questions may have been edited for spelling or clarity.
Equity, Context & Fairness →
- How can AI-driven assessments ensure fairness across diverse languages and cultures?
- How do we prevent bias in AI-generated test items, especially when training data comes from the Global North?
- Can AI be used responsibly in contexts where digital infrastructure is weak or uneven?
- How can local educators contribute to training and validating AI tools to make them more contextually relevant?
Teacher Agency & the Role of Humans →
- What safeguards are needed to keep teachers in the loop when AI generates or scores assessments?
- What kinds of professional development will teachers need to use AI-informed assessment tools effectively
- How can we maintain student and teacher trust when AI is involved in evaluating performance?
Design, Data & Implementation →
- How do we validate AI-generated questions or scoring rubrics before using them at scale?
- How can ministries start experimenting safely with AI in diagnostic or national assessments?
- What governance or ethical protocols should guide data collection and sharing in AI-enabled testing?
Policy, Systems & Scale→
- Could AI help make national or regional assessments more adaptive without losing comparability across countries?
- What capacity is needed within ministries to interpret and act on AI-generated assessment data?
- How can countries collaborate to share open-source AI tools or frameworks for assessment design?
Resources 📚
🔎 The following resources were shared by community members and participants. These have not been reviewed by the World Bank or EdTech Hub, but are useful indicators of what conversations, evidence, and methods are being explored in the sector.
World Bank
ALMA resources: the World Bank’s Accelerating Learning Measurement for Action program includes an extensive list of free resources related to quality learning assessments, including:
World Bank’s Diego Luna-Bazaldua shared several key resources that he has been reading:
Mead & Zhou (2023) – Evaluating the Quality of AI-Generated Items for a Certification Exam: This study compares GPT-3 models’ ability to write multiple-choice exam questions, finding that while most AI-generated items are usable, many still require human revision to correct flaws in wording, keys, or distractors
Burstein (2025) – The Duolingo English Test Responsible AI Standards: Duolingo introduces the first comprehensive Responsible AI framework for educational assessment, outlining principles of validity, fairness, privacy, and accountability to guide ethical AI use in the Duolingo English Test
Baudin (2025) – Assessing the Psychometric Properties of AI-Generated Multiple-Choice Exams in a Psychology Subject: Analyzing ChatGPT-4-generated questions for an undergraduate psychology course, this research finds that AI can produce reliable and moderately valid items but struggles to assess higher-order cognitive skills, supporting a hybrid human-AI approach
Brennan (2024) – Current Psychometric Models and Some Uses of Technology in Educational Testing: Brennan reviews how classical test theory, generalizability theory, and item response theory align with emerging technology-based
EdTech Hub
Learning Brief on AI Use Cases for Improved Service Delivery: EdTech Hub’s AI Observatory and Action Lab recently published a learning brief titled: How is AI Being Used by Education Ministries to Improve Service Delivery in Low- and Middle-Income Countries? This brief synthesises existing evidence for how Ministries of Education are leveraging AI-enabled solutions for improving education service delivery — including assessment and evaluation. This brief is aligned with our Ministry of Education AI Challenge, where 6 Ministries of Education are currently prototyping and testing AI-enabled tools for this purpose.
RTI
Tangerine Open-Source Assessment Platform — cited by RTI’s Carmen Strigel as a detailed example of AI-enabled early-literacy scoring.
This is part of an on-going series hosted by the World Bank and EdTech Hub’s AI Observatory and Action Lab. The AI Observatory is made possible by support from UK International Development. Please follow along and join the conversation on LinkedIn!
