LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>Michael’s comments have made me reflect a little further on questions of ordering of attainment, i.e. when can we say that one pupil has attained more than another? I’d like to explain this issue from an abstract mathematical perspective.

One option is not compare attainment at all, but generally this is the kind of summary information that is of great value to leaders, so let’s assume for the moment that we do want to devise such a way that we can talk about better or worse attainment. Let’s further assume that individual curriculum statements are indivisible: a pupil either “gets / can do” or “doesn’t get / can’t do” a particular concept / activity. To make things really simple, let’s also only consider ordering of attainment within a year group. Each year’s programme of studies contains a number of statements .

Each child’s attainment is actually a subset of these statements – those achieved. Clearly a property we would like to have between two attainments is that , that is if child b can do everything in the curriculum that child a can do then child b’s attainment is at least as high as child a’s. Let’s call these “meaningful orders”.

We could actually define and be done. Yet this leads to many incomparabilities which Michael alludes to. Is the child that can do and higher or lower attaining than the child that can do , and ? One option is to live with this. These children are genuinely incomparable in attainment, and that’s all there is to it. I think the intuitive adoption of such an ordering by set inclusion (https://en.wikipedia.org/wiki/Partially_ordered_set) is probably partly behind the rejection that an average February attainment is distinguishable from an average October attainment.

But there are other meaningful orders that could be defined, such as ordering by cardinality: , which is effectively what I’ve argued for above and is in wide use in primary education. There are many, many, other choices of course. The advantage of this one is that it is a total order (https://en.wikipedia.org/wiki/Total_order), with all the nice mathematical properties that brings – properties heavily used throughout the primary education system. The disadvantage, as Michael implies, is that this is somewhat artificial.

Take your pick: summary but artificial, or the inability to summarise but totally real. In practice, schools often choose the former for SLT and governors and the latter for classroom teachers. Perhaps with good reason!

LikeLike

]]>One issue you highlight is that these data are often interpreted by people without the statistical skills to understand the uncertainty and its implications. I agree. But I think there is no way to avoid uncertainty – it is inherent, even with testing. So really we should be upskilling our school leaders and our governors, in my view.

The other issue worth picking up relates to the statement “I don’t believe that any teacher could reasonably define the difference between an October average writer and a February average writer”. If we use the coverage metric, wouldn’t an October average writer be one with a small number of statements under his / her belt while the February writer would be one with about half the statements under his / her belt? This goes back to my initial point – I agree these are positions are likely to be indistinguishable with respect to individual statements, but surely not with respect to the proportion of such statements achieved? (I will post separately below on notions of mathematical ordering of curriculum coverage, which touches on this point but is probably a little off on a tangent!)

I of course agree with you about a governing body hypothetically drawing conclusions about quality of teaching. But I would suggest that a good set of governors would not be using a lower proportion in one class than another to draw conclusions over quality of teaching but rather as a basis for discussion with the subject coordinator, as a prompt to ask “why is the proportion lower in Class A?” To which, I’m sure, any subject coordinator worth their salt will begin by answering in the way you have!

To answer the specific question you ask: “if we measured heights of two classes of children every 6 weeks, and one seemed to fall behind, at what point would we feel that there was enough statistical significance to warrant drawing any conclusions?” The standard statistical answer would be: when the probability of any such difference happening purely due to chance (due to a random allocation of children to classes combined with the inherent variation in children’s growth) is smaller than some significance value – typically 5% or 1%. I don’t know what that height difference is, but it can be calculated, because we know the distributions. So rigorous statistics can be applied. I don’t see the educational setting as any different, except – as you point out – the variations will be higher (including the measurement error).

LikeLike

]]>1. I think the issue about accuracy is key. If you fancy running a scorable test every term, then that seems fine to me. However, even with something so precise, as you say the individual level there are risks. At cohort level it then might become useful, although most primary cohorts are small – some very small – so such data again is hazardous (and bear in mind that we’re talking about largely untrained statistical eyes here!).More to the point, in fact most such judgements are a broad teacher assessment judgement (or worse, a calculated total of a number of teacher assessment judgements). I think it unwise to be trying to draw such regular inferences from such fuzzy information.

Perhaps more regular data on a specific task is useful, e.g. knowledge/recall of number bonds. Part of the problem here is that we’re often talking about broad judgements across a fairly large domain (e.g. Writing) and then trying to imagine a category system that shows a step of progress every 6 weeks. I don’t believe that any teacher could reasonably define the difference between an October average writer and a February average writer, let alone a December one.

That’s the bit I have trouble with. I collect near daily data on tables tests, but I don’t imagine that the child who scored 56 yesterday and 54 today has got worse any more than the reverse would suggest improvement. Perhaps my biggest concern is the sense of precision which doesn’t really exist. Which links to point 2

2. With height we can very accurately measure children, comfortably to the nearest 1/2cm, perhaps the nearest millimetre – but as you say, we would only be gravely concerned by a 15cm difference. Medical practitioners might choose to monitor someone who was 10cm shorter than expected more regularly. Even that height difference represents well over a year of typical growth.

We come back again to the cohort issue. Perhaps over large groups we might iron out some of the issues, but if we measured heights of two classes of children every 6 weeks, and one seemed to fall behind, at what point would we feel that there was enough statistical significance to warrant drawing any conclusions? We certainly wouldn’t base it on a single half-term’s data, I’m sure you’d agree?

3. Yes, this is the key. And perhaps my main point here is that by attempting to draw conclusions so regularly, we add to the likelihood of error. It would concern me if a governing body were drawing conclusions about the quality of teaching in any given class because in one class 45% of children had reached step 18, but in a parallel class only 40% had. That’s quite possibly 2 children, each lacking one concept. I would consider that to be well within the likely margin of error for such judgements. The level of precision we are able to attain at such small scale doesn’t merit the effort it might take, while the risks of collecting such data and sharing it with those who are not statistically-minded (such as the vast majority of those in education!) are far greater. Not only is it probably not useful, I’d argue that it could be positively damaging.

]]>