This week was A-Level results day. It was also the day that Ofqual published its long-awaited standardisation algorithm. Full details can be found in the 319-page report. In this blog post, I’ve set down my initial thoughts after reading the report.

## Prelude

I would like to begin by saying that Ofqual was not given an easy task: produce a system to devise A-level and GCSE grades without exams or coursework. Reading the report, it is clear that they worked hard to do the best they could within the confines they operate, and I respect that work. Nevertheless, I have several concerns to share.

## Concerns

**1. Accounting for Prior Attainment**

The model corrects for differences between historical prior attainment and prior attainment of the 2020 cohort in the following way (first taking into account any learners without prior attainment measures.) For any particular grade, the proportion to be awarded is equal to the historical proportion at that grade adjusted by a factor referred to in the report as . (See p.92-93 of the report, which incidentally has a typo here — should read .) As noted by the Fischer Family Trust, it appears that this factor is based solely on national differences in value added, and this could cause a problem. To illustrate this requires an artificial example. Imagine that Centre A has a historical transition matrix looking like this – all of its 200 students have walked away with A*s in this subject in recent years, whether they were in the first or second GCSE decile (and half were in each). Well done Centre A!

GCSE Decile | A* | A |

1 | 100 | 0 |

2 | 100 | 0 |

Meanwhile, let’s say the national transition matrix looks more like this:

GCSE Decile | A* | A |

1 | 90% | 10% |

2 | 10% | 90% |

Let’s now look at 2020 outcomes. Assume that this year, Centre A has an unusual cohort: all students were second decile in prior attainment. It seems natural to expect that it would still get mainly A*s, consistent with its prior performance, but this is not the outcome of the model. Instead, its historical distribution of 100% A*s is adjusted downwards because of the *national* transition matrix. The proportion of A*s at Centre A will be reduced by 40% – now only 60% of them will get A*s! This happens because the national transition matrix expects a 50/50 split of Decile 1 and Decile 2 students to end up with 50% A* and a Decile 2-only cohort to end up with 10% A*, resulting in a downgrade of 40%.

**2. Model accuracy**

Amongst the various possible standardisation options, Ofqual evaluated accuracy based on trying to predict 2019 exam grades and seeing how well they matched to awarded exams. This immediately presents a problem: no rank orders were submitted for 2019 students, so how is this possible? The answer provided is “*the actual rank order within the centre based on the marks achieved in 2019 were used as a replacement*“, i.e. they back-fitted 2019 marks to rank orders. This only provides a reasonable idea of accuracy if we assume that teacher-submitted rank orders in 2020 would exactly correspond to mark orders of their pupils, as noted by Guy Nason. Of course this will not be the case, so the accuracy estimates in the Ofqual report are likely to be significant overestimates. And they’re already not great, even under a perfect-ranking assumption: Ofqual report that only 12 out of 22 GCSE subjects were accurate to within one grade, with some subjects having only 40% accuracy in terms of predicting the attained grade – so one is left wondering what the accuracy might actually be for 2020 once rank-order uncertainty is taken into account.

There may also be a systematic variation in the accuracy of the model across different grades, but this is obscured by using the probability of successful classification across any grade as the primary measure of accuracy. Graphs presented in the Ofqual report suggest, for example, that the models are far less accurate at Grade 4 than at Grade 7 in GCSE English.

**3. When is a large cohort a large cohort?**

A large cohort, and therefore one for which teacher-assessed grades are used at all, is defined in the algorithm to be one with at least 15 students. But how do we count these 15 students? The current cohort or the historic cohort, or something else? The answer is given in Ofqual’s report: the harmonic mean of the two. As an extreme example of this, centre cohorts can be considered “large” with only 8 pupils this year – so long as they had at least 120 in the recent past. It seems remarkable that a centre could have fewer pupils than GCSE grades and still be “large”!

**4. Imputed marks fill grade ranges**

As the penultimate step in the Ofqual algorithm, “imputed marks” are calculated for each student – a kind of proxy mark equally spaced between grade end-points. So, for example, if Centre B only has one student heading for a Grade C at this stage then – by definition – it’s a mid-C. If they had two Grade C students, they’d be equally spaced across the “C spectrum”. This means that in the next step of the algorithm, cut-score setting, these students are vulnerable to changing grades. For centres which tend to fill the full grade range anyway, this may not be an issue. But I worry that we may see some big changes at the edges of centre distributions as a result of this quirk.

**5. No uncertainty quantification**

Underlying many of these concerns is, perhaps, a more fundamental one. Grades awarded this year come with different levels of uncertainty, depending on factors like how volatile attainment at the centre has been in the past, the size of the cohorts, known uncertainty in grading, etc. Yet none of this is visible in the awarded grade. In practice, this means that some Grade Cs are really “B/C”s while some are “A-E”, and we don’t know the difference. It is not beyond possibility to quantify the uncertainty – in fact I proposed awarding grade ranges in my original consultation response to Ofqual. This issue has been raised independently by the Royal Statistical Society and even for normal exam years, given the inherent unreliability of exam grades, by Dennis Sherwood. For small centres, rather than a statistically reasonable approach to widen the grade range, the impact of only awarding a single grade with unquantified uncertainty is that Ofqual have had to revert to teacher-assessed grades, leading to an unfair a “mix and match” system where some centres have had their teacher-assessed grades awarded while some haven’t.

## What Must Happen Now?

I think everyone can agree that centres need to immediately receive all the intermediate steps in the calculations of their grades. Many examinations officers are currently scratching their heads, after having received only a small part of this information. The basic principle must be that centres are able to recalculate their grades from first principles if they want to. This additional information should include the proportion of pupils in both historical and current cohorts with matched prior attainment data for each subject and which decile each student falls into, the national transition matrices used for each subject, the values of and for each subject / grade combination, the imputed marks for each 2020 student, and the national imputed mark cut-points for each grade boundary in each subject.

At a political level, serious consideration should now be given to awarding teacher-assessed grades (CAGs) this year. While I was initially supportive of a standardisation approach – and I support the principles of Ofqual’s “meso-standardisation” – I fear that problems with the current standarisation algorithm are damaging rather than preserving public perception of A-Level grades. We may have now reached the point that the disadvantages of sticking to the current system are worse than the disadvantages of simply accepting CAGs for A-Levels.

Ofqual states in their report that “*A key motivation for the design of the approach to standardisation [was] as far as possible [to] ensure that a grade represents the same standard, irrespective of the school or college they attended.”. *Unfortunately, my view is that this has not been achieved by the Ofqual algorithm. However, despite my concerns over Ofqual’s algorithm, it is also questionable whether any methodology meeting this objective could be implemented in time under a competitive education system culture driven by high-stakes accountability systems. Something to think about for our post-COVID world.

George,

A really measured and informative blog. Thank you.

I have had a nagging feeling since reading the tech docs on Thursday along the lines of your bullet point 1. I don’t want to prejudice your reply y telling you my concern but could you run your mind past the same scenario but with all students in the 2020 cohort being in decile 1. What happens to the centre results with this fictional cohort that is stronger than the usual cohort?

Cheers

Steve

LikeLike

Thank you. If Centre A were all in Decile 1 in 2020, then the application of the formula would boost A* proportions by 40%. Since this is now over 100%, it gets rounded down to 100% (Step X7 (ii) in Annex E of https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/909042/Requirements_for_the_calculation_of_results_in_summer_2020_inc._Annex_E.pdf) – see p.38. Hope that helps!

LikeLike

> As an extreme example of this, centre cohorts can be considered “large” with only 8 pupils this year – so long as they had at least 120 in the recent past. It seems remarkable that a centre could have fewer pupils than GCSE grades and still be “large”!

Does that mean that cohorts will also be considered “large” if they had 8 pupils last year and 120 pupils this year? and, if so, that the CAGs for this year’s 120 pupils will be discarded in favour of statistically determined grades based on the results that 8 pupils achieved last year?

LikeLike

It would need to be 8 pupils over the whole historical time window (3 years for A-levels, 2 years for some GCSEs, 1 year for others), rather than just last year. But apart from that, yes. See p.50 of the Ofqual report for analysis of time window for historical data.

LikeLike

Thanks. Although it seems like a rather far-fetched scenario, it’s possible that something very like this happened in some cases. For example, after Ofqual changed the rules about external candidates at the end of April this year, one exam centre took on an extra 500 candidates:

https://www.tutorsandexams.uk/tutors-exams-give-hope-to-over-500-displaced-candidates-amidst-the-covid-19-pandemic/

LikeLike

Does the mark interpolation mean that larger centres/subjects are more vulnerable to downgrading at the final stage? If you only have a few kids in each grade then by definition they won’t be able to get that close to the cut point, right?

LikeLike

That would be my suspicion. Unfortunately, they have not yet release the cut-points used to recalibrate grade boundaries. I asked JCQ and Ofqual on twitter for these on Friday without reply yet – though admittedly their social media must be inundated. We need to see these cut-points.

LikeLike

But they do state that the cut-points were set nationally, so I can’t see how this effect can be avoided: “a single prior-attainment matched student mark distribution is formed across exam boards with the cut-score being set against a single national prediction at each grade.”

LikeLike

Agreed, though we don’t yet have any idea how big this effect will be. We should know.

LikeLike

True. It seems a likely cause for anecdotal reports of large centres having their worst grade profile for years though…

LikeLiked by 1 person

Could also be down to my Concern 1, which I think could have major implications. I want the data from Ofqual to be able to estimate the relative impact of these two issues.

LikeLiked by 1 person

Makes sense, thanks. Look forward to hearing more.

LikeLike

Ofqual’s algorithmic calculation of a student’s A/L grade was the wrong solution to the problem in this pandemic as the 40% downgrading of 280,000 students’s grades testifies.. It has also made it very difficult and labour intensive for a Centre to seek redress for each affected student. The use of CAG predicted grades was the right and pragmatic approach and to prevent grade inflation the algorithm should have been run on the CAGs and where a Centre was identified as overinflating grades the Centre should have been informed of the fact and instructed to redo their CAGs or justify the inflated grades to Ofqual. This lays the responsibility at the Centre’s door, where it should be. It must have been shocking for a student to have all his or her grades downgraded with no explanation. Any self-respecting algorithm would award Ofqual a U grade for Creative Thinking and an A* for Incompetence.

LikeLike

Great piece, thanks. I’m no statistician, but wonder whether the following is a way to clarify the unfairness of the system as a whole, let alone the algorithm applied this year. Is there any way of showing how a ‘student’ would have received different grade outcomes from precisely the same input data simply by being on the roll at different centres? If you’re feeling cheeky, that might be a fun activity!

LikeLike

I welcome yesterday’s decision to award CAGs in line with my recommendation on this post.

Today there has been some discussion about when people were aware that they may be problems with standardisation. I first raised my concerns over standardisation in my post on this site in April (https://constantinides.net/2020/04/16/award-of-gcses-and-a-levels-in-2020), which I then shared with Ofqual via their consultation mechanism. I also expressed my concerns in May due to their consultation response, including highlighting the probable direction of travel over small centres and the inequity this may lead to https://constantinides.net/2020/04/16/award-of-gcses-and-a-levels-in-2020/#comment-1757. Finally, I submitted comments, including my concerns over the standardisation methodology, to the Education Select Committee inquiry, also in May, which they published in June (https://committees.parliament.uk/work/202/the-impact-of-covid19-on-education-and-childrens-services/publications/written-evidence/).

I would be happy to work with anyone in the future to build more robust systems.

LikeLike

Since this post was published, I’ve pointed several people to the algorithmic detail provided to exam boards here https://www.gov.uk/government/publications/requirements-for-the-calculation-of-results-in-summer-2020 (Annex E). Anyone interested in digging further into the algorithm should supplement their understanding of the algorithm with this document, as some detail is missing from the 319-page interim report I link to in the post. In particular, some nonlinear clamping is performed on the adjusted cumulative grade distribution under certain circumstances. I’m posting this here too, so this is all in one place.

LikeLike