Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality control

Even before the age of artificial intelligence, automated scoring received considerable attention in educational measurement. However, its application to constructed response (CR) items in international large-scale assessments (ILSAs) has remained a challenge, primarily due to the difficulty of hand...

Full description

Saved in:
Bibliographic Details
Main Authors: Ji Yoon Jung, Lillian Tyack, Matthias von Davier
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Computers and Education: Artificial Intelligence
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666920X25000153
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832540407415701504
author Ji Yoon Jung
Lillian Tyack
Matthias von Davier
author_facet Ji Yoon Jung
Lillian Tyack
Matthias von Davier
author_sort Ji Yoon Jung
collection DOAJ
description Even before the age of artificial intelligence, automated scoring received considerable attention in educational measurement. However, its application to constructed response (CR) items in international large-scale assessments (ILSAs) has remained a challenge, primarily due to the difficulty of handling multilingual responses spanning many languages. This study addresses this challenge by investigating two machine learning approaches — supervised and unsupervised learning — for scoring multilingual responses. We explored various scoring methods to assess three science CR items from TIMSS 2023 across all participating countries and 42 languages. The results showed that the supervised learning approach, particularly combining multiple machine translations with artificial neural networks (MMT_ANNs), showed comparable performance to human scoring. The MMT_ANN model demonstrated impressive accuracy, correctly classifying up to 94.88% of responses across all languages and countries. This remarkable performance can be attributed to MMT_ANNs providing more suitable translations at both individual response and language levels. Furthermore, MMT_ANNs consistently generated accurate scores for identical or borderline responses within and across countries. These findings indicate the potential of automated scoring as an accurate and cost-effective measure for quality control in ILSAs, reducing the need to hire additional human raters to ensure scoring reliability.
format Article
id doaj-art-91b00d6d9564417d8a17faa503efc7ce
institution Kabale University
issn 2666-920X
language English
publishDate 2025-06-01
publisher Elsevier
record_format Article
series Computers and Education: Artificial Intelligence
spelling doaj-art-91b00d6d9564417d8a17faa503efc7ce2025-02-05T04:32:46ZengElsevierComputers and Education: Artificial Intelligence2666-920X2025-06-018100375Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality controlJi Yoon Jung0Lillian Tyack1Matthias von Davier2Corresponding author.; Boston College, TIMSS & PIRLS International Study Center, United StatesBoston College, TIMSS & PIRLS International Study Center, United StatesBoston College, TIMSS & PIRLS International Study Center, United StatesEven before the age of artificial intelligence, automated scoring received considerable attention in educational measurement. However, its application to constructed response (CR) items in international large-scale assessments (ILSAs) has remained a challenge, primarily due to the difficulty of handling multilingual responses spanning many languages. This study addresses this challenge by investigating two machine learning approaches — supervised and unsupervised learning — for scoring multilingual responses. We explored various scoring methods to assess three science CR items from TIMSS 2023 across all participating countries and 42 languages. The results showed that the supervised learning approach, particularly combining multiple machine translations with artificial neural networks (MMT_ANNs), showed comparable performance to human scoring. The MMT_ANN model demonstrated impressive accuracy, correctly classifying up to 94.88% of responses across all languages and countries. This remarkable performance can be attributed to MMT_ANNs providing more suitable translations at both individual response and language levels. Furthermore, MMT_ANNs consistently generated accurate scores for identical or borderline responses within and across countries. These findings indicate the potential of automated scoring as an accurate and cost-effective measure for quality control in ILSAs, reducing the need to hire additional human raters to ensure scoring reliability.http://www.sciencedirect.com/science/article/pii/S2666920X25000153Automated scoringArtificial intelligenceMachine learningNatural language processingMachine translationTIMSS
spellingShingle Ji Yoon Jung
Lillian Tyack
Matthias von Davier
Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality control
Computers and Education: Artificial Intelligence
Automated scoring
Artificial intelligence
Machine learning
Natural language processing
Machine translation
TIMSS
title Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality control
title_full Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality control
title_fullStr Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality control
title_full_unstemmed Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality control
title_short Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality control
title_sort towards the implementation of automated scoring in international large scale assessments scalability and quality control
topic Automated scoring
Artificial intelligence
Machine learning
Natural language processing
Machine translation
TIMSS
url http://www.sciencedirect.com/science/article/pii/S2666920X25000153
work_keys_str_mv AT jiyoonjung towardstheimplementationofautomatedscoringininternationallargescaleassessmentsscalabilityandqualitycontrol
AT lilliantyack towardstheimplementationofautomatedscoringininternationallargescaleassessmentsscalabilityandqualitycontrol
AT matthiasvondavier towardstheimplementationofautomatedscoringininternationallargescaleassessmentsscalabilityandqualitycontrol