This week was A-Level results day. It was also the day that Ofqual published its long-awaited standardisation algorithm. Full details can be found in the 319-page report. In this blog post, I’ve set down my initial thoughts after reading the report.
I would like to begin by saying that Ofqual was not given an easy task: produce a system to devise A-level and GCSE grades without exams or coursework. Reading the report, it is clear that they worked hard to do the best they could within the confines they operate, and I respect that work. Nevertheless, I have several concerns to share.
1. Accounting for Prior Attainment
The model corrects for differences between historical prior attainment and prior attainment of the 2020 cohort in the following way (first taking into account any learners without prior attainment measures.) For any particular grade, the proportion to be awarded is equal to the historical proportion at that grade adjusted by a factor referred to in the report as . (See p.92-93 of the report, which incidentally has a typo here — should read .) As noted by the Fischer Family Trust, it appears that this factor is based solely on national differences in value added, and this could cause a problem. To illustrate this requires an artificial example. Imagine that Centre A has a historical transition matrix looking like this – all of its 200 students have walked away with A*s in this subject in recent years, whether they were in the first or second GCSE decile (and half were in each). Well done Centre A!
Meanwhile, let’s say the national transition matrix looks more like this:
Let’s now look at 2020 outcomes. Assume that this year, Centre A has an unusual cohort: all students were second decile in prior attainment. It seems natural to expect that it would still get mainly A*s, consistent with its prior performance, but this is not the outcome of the model. Instead, its historical distribution of 100% A*s is adjusted downwards because of the national transition matrix. The proportion of A*s at Centre A will be reduced by 40% – now only 60% of them will get A*s! This happens because the national transition matrix expects a 50/50 split of Decile 1 and Decile 2 students to end up with 50% A* and a Decile 2-only cohort to end up with 10% A*, resulting in a downgrade of 40%.
2. Model accuracy
Amongst the various possible standardisation options, Ofqual evaluated accuracy based on trying to predict 2019 exam grades and seeing how well they matched to awarded exams. This immediately presents a problem: no rank orders were submitted for 2019 students, so how is this possible? The answer provided is “the actual rank order within the centre based on the marks achieved in 2019 were used as a replacement“, i.e. they back-fitted 2019 marks to rank orders. This only provides a reasonable idea of accuracy if we assume that teacher-submitted rank orders in 2020 would exactly correspond to mark orders of their pupils, as noted by Guy Nason. Of course this will not be the case, so the accuracy estimates in the Ofqual report are likely to be significant overestimates. And they’re already not great, even under a perfect-ranking assumption: Ofqual report that only 12 out of 22 GCSE subjects were accurate to within one grade, with some subjects having only 40% accuracy in terms of predicting the attained grade – so one is left wondering what the accuracy might actually be for 2020 once rank-order uncertainty is taken into account.
There may also be a systematic variation in the accuracy of the model across different grades, but this is obscured by using the probability of successful classification across any grade as the primary measure of accuracy. Graphs presented in the Ofqual report suggest, for example, that the models are far less accurate at Grade 4 than at Grade 7 in GCSE English.
3. When is a large cohort a large cohort?
A large cohort, and therefore one for which teacher-assessed grades are used at all, is defined in the algorithm to be one with at least 15 students. But how do we count these 15 students? The current cohort or the historic cohort, or something else? The answer is given in Ofqual’s report: the harmonic mean of the two. As an extreme example of this, centre cohorts can be considered “large” with only 8 pupils this year – so long as they had at least 120 in the recent past. It seems remarkable that a centre could have fewer pupils than GCSE grades and still be “large”!
4. Imputed marks fill grade ranges
As the penultimate step in the Ofqual algorithm, “imputed marks” are calculated for each student – a kind of proxy mark equally spaced between grade end-points. So, for example, if Centre B only has one student heading for a Grade C at this stage then – by definition – it’s a mid-C. If they had two Grade C students, they’d be equally spaced across the “C spectrum”. This means that in the next step of the algorithm, cut-score setting, these students are vulnerable to changing grades. For centres which tend to fill the full grade range anyway, this may not be an issue. But I worry that we may see some big changes at the edges of centre distributions as a result of this quirk.
5. No uncertainty quantification
Underlying many of these concerns is, perhaps, a more fundamental one. Grades awarded this year come with different levels of uncertainty, depending on factors like how volatile attainment at the centre has been in the past, the size of the cohorts, known uncertainty in grading, etc. Yet none of this is visible in the awarded grade. In practice, this means that some Grade Cs are really “B/C”s while some are “A-E”, and we don’t know the difference. It is not beyond possibility to quantify the uncertainty – in fact I proposed awarding grade ranges in my original consultation response to Ofqual. This issue has been raised independently by the Royal Statistical Society and even for normal exam years, given the inherent unreliability of exam grades, by Dennis Sherwood. For small centres, rather than a statistically reasonable approach to widen the grade range, the impact of only awarding a single grade with unquantified uncertainty is that Ofqual have had to revert to teacher-assessed grades, leading to an unfair a “mix and match” system where some centres have had their teacher-assessed grades awarded while some haven’t.
What Must Happen Now?
I think everyone can agree that centres need to immediately receive all the intermediate steps in the calculations of their grades. Many examinations officers are currently scratching their heads, after having received only a small part of this information. The basic principle must be that centres are able to recalculate their grades from first principles if they want to. This additional information should include the proportion of pupils in both historical and current cohorts with matched prior attainment data for each subject and which decile each student falls into, the national transition matrices used for each subject, the values of and for each subject / grade combination, the imputed marks for each 2020 student, and the national imputed mark cut-points for each grade boundary in each subject.
At a political level, serious consideration should now be given to awarding teacher-assessed grades (CAGs) this year. While I was initially supportive of a standardisation approach – and I support the principles of Ofqual’s “meso-standardisation” – I fear that problems with the current standarisation algorithm are damaging rather than preserving public perception of A-Level grades. We may have now reached the point that the disadvantages of sticking to the current system are worse than the disadvantages of simply accepting CAGs for A-Levels.
Ofqual states in their report that “A key motivation for the design of the approach to standardisation [was] as far as possible [to] ensure that a grade represents the same standard, irrespective of the school or college they attended.”. Unfortunately, my view is that this has not been achieved by the Ofqual algorithm. However, despite my concerns over Ofqual’s algorithm, it is also questionable whether any methodology meeting this objective could be implemented in time under a competitive education system culture driven by high-stakes accountability systems. Something to think about for our post-COVID world.