The ghost in the machine: AI and language assessment

Discussions and debates around artificial intelligence in general have included the notion that ‘we’ (humans, that is) don’t know exactly what is going on inside the models and algorithms that are increasingly managing our lives and society more broadly. The application of AI to the assessment of language and human communication is no exception: what exactly is the automated model measuring? Is pronunciation more of a factor than grammar? Can test wise test takers ‘game’ the system by speaking faster even if what they’re saying is gobbledygook? How do the individual features that are measured by the machine relate to a more holistic concept of comprehensibility? Are some features that are more easily quantified (such as syllables per second) prioritised over other features that defy objective evaluation (such as overall communicative effect)?

These questions have led to a call for explainable artificial intelligence, or ‘XAI’ - machine ‘intelligence’ that humans can understand. At the very least, it has been recommended that developers of AI systems should be transparent about how their models are developed, including what data is being used to train the algorithms, for example, by providing model cards that explain what has been done and define the population that the model can be applied to (see Mitchell, et al., 2019). Should testing organisations be doing the same? Do developers of automated assessment tools have an ethical responsibility around XAI? And while AI is explainable through complex mathematical models, how can these explanations be presented in a way that key language assessment stakeholders can relate to? When a parent wants to know why ‘the machine’ has rated their child’s spoken language as a ‘fail’, how should developers of automated assessment be expected to respond?

Many of these discussions seem to presuppose that the ‘ghost’ is in the machine and not in the heads of human raters. The suggestion is that we have a better understanding of what human examiners are doing when they apply a score to a linguistic performance. Perhaps this is because we feel more secure when we know they are being trained on clear performance descriptors that highlight certain ‘can-do’ outcomes and present linguistic benchmarks like ‘a wide range of vocabulary’. But do we really know whether humans are prioritising some features over others, or whether they might also be taking into account aspects of communication that are not described in the performance indicators?

Of course, test developers – whether human, machine, or ‘hybrid’ rating is used – monitor the performance of the raters or rating system and report reliability statistics, and responsible test developers describe the underlying ability being measured in a way that test takers preparing for a test and score users (such as administration departments) can understand. Should we be expecting more from XAI than ‘XI’ (explainable intelligence)? Does the large scale use of technology demand more of a critical approach or is the mistrust misplaced?

Many of the presentations and discussions at New Directions Viet Nam focus on technology – please see the conference programme. Join us at New Directions, 27-29 October 2023.