New Delhi: Debates about AI in academia have moved from its use by students to its use by professors. How efficient is it at grading assessments? A new study led by researchers at the University of Cambridge has found that AI systems used for marking university essays tend to prioritise “style over substance.”
The report, AI in University Assessment: Evaluating the Opportunities and Risks of Automated Marking, analysed 761 undergraduate psychology essays across three UK universities: Cambridge, Manchester Metropolitan University, and the University of Nottingham. Researchers tested the latest large language models, including Gemini 3 Flash, GPT-5.4, and Claude Opus 4.6, both individually and in ensemble systems, against routine human marking styles.
Psychology, the researchers say, was chosen for its focus on critical judgement rather than a single correct answer.
The findings point to a consistent pattern across all tested systems: A strong sensitivity to surface-level linguistic cues. The models tended to assign higher marks to essays with greater length, more complex vocabulary and sentence structures—even when these features are unrelated to academic standards. This, the report argues, suggests that current AI evaluators may be “overweighting style-related features” rather than accurately assessing conceptual understanding or argument strength.
The study acknowledges the institutional pressures knocking at the door of universities, including staff workload constraints and growing student populations. Under such conditions, researchers note that AI tools could still play a supportive role in assessment workflows, but prefer to position it as a “second pair of eyes”, while holding the ground that entirely relying on it would result in “homogenised” grading that “underestimates brilliance”.
Also read: Northeastern, Columbia and more—Why American college students are protesting against AI
Centrality bias tendency of AI
According to the findings, essays awarded high marks (around 75 out of 100) by human examiners were, on average, marked lower by AI systems. Conversely, essays that received mid-marks, such as around 50, were typically scored higher by AI.
This, the researchers argue, is because human marking is grounded in reasoning and conceptual merits while AI scoring is based on statistical prediction, leading to AI being bound by a “central tendency bias.”
Agreement between AI models was notably strong, with pairwise comparisons suggesting consistently high alignment across all three systems. In effect, the models tended to converge on very similar judgements when grading the same essays. When viewed together, overall reliability was described as excellent. In more than 99 per cent of cases, at least two of the three AI systems assigned the same degree band, with most differences limited to adjacent categories.
Crucially, this consistency between AI models was stronger than their alignment with human markers, showing how the systems were more closely aligned with each other than with human judgement. However, when a stricter test was applied, requiring all three models to agree exactly, the level of consistency dropped to 56 per cent of submissions, showing that while AI systems often move in sync with one another, full agreement is still limited.
Also read:
The importance of the social contract
In the middle of these concerns, students raised questions about what education loses when AI becomes part of the grading process. A Nottingham student said it had become “really easy to complete assignments with AI—you just don’t really need to think that much,” stressing “I’m kind of worried, like, what did I actually learn from the essay?”
Assessments, the study suggests, are also tied to how students experience fairness and motivation, with some saying they would feel “cheated” if AI were used to mark their work.
Participants across institutions agreed on one core point: The importance of social contract in education.
“Assessment is not simply a system for distributing marks. It is also part of how educational meaning is made: How students feel seen, how standards are enacted, how trust is maintained, and how institutions reproduce their own values,” said Dr Steve Watson, Advisory Board Member, from the University of Cambridge, in the report.
(Edited by Theres Sudeep)

