Thoughts on Comparative Judgement

This is just a quick post to collect some thoughts on (Adaptive) Comparative Judgement. Recently it seems to have gotten a lot of traction in the UK education blogosphere, to the extent that even the minister and experts for the Education Select Committee are already mentioning it, and the NAHT already mention it. I think the technique, originally Thurstone and in the adaptive form by Pollitt technically is great but I think the advantages might be exaggerated, certainly on a national, summative scale. Nevertheless, I hardly see any critical voices, except for this blog, which is excellent on this and being an arch-skeptic I’d thought I’d write down some of my thoughts.

Firstly it is important to take into account to the nature of the tasks being assessed. For example the subject. If we are talking about some subjects, mark-schemes are perfectly fine and pretty reliable (comment on what we mean by that later on); it is with more subjective tasks there are challenges. So perhaps maths less of a problem than say an English essay. This Ofqual review is sometimes referenced, and it gives a far more nuanced position on reliability of marking, although they base a lot on Meadows and Billington (2005).
But even then, as the Ofqual review of marking reliability shows, there are decades of procedures and actions you can take to get increased validity and reliability. It seems as if the idea has taken hold that we have marked unreliably for decades. The question is not whether it’s unreliable or not, but whether reliability was enough (note that this also means you need a good conception of what reliability and validity are, not always sure of that).
The ‘enough’ question partly depends on what you want to do with it, I guess. The more high stakes the nature of the assessment the more important it is. One could even see a ‘cut’ between whether you use the assessment for formative or summative purposes. I do not see enough discussion about these aspects.
In the discussion it also is important to say what ‘reliability’ is any way. Agreeing on a rank (“I think A is better than B”) is different to agreeing on a quantification of the mark (“A is 80, B=60”). To compare ‘like with like’ you need to use the same type of measure.
Some CJ literature has shown than some of the challenges of traditional marking of course still apply: for example the influence of the length of assessments, or multidimensional nature (also note the potential subject differences again). The assumption that you just ‘say which is the better work’, holistically, and it then will lead to a lot of agreement (statistically) seems tricky to me. Even if your conception of the “better piece of writing”, “OK to go with your gut instinct”, is clear (is it? Can’t groups engage in a groupthink?), it still is important to know what to look for e.g. originality, spelling, handwriting style. Certainly if at one point you do want to give students some feedback or exemplars of higher scoring work. This also touches the ‘summative/formative’ issue.
Which leads me to think that the ‘old’ situation is painted too much as ‘not good enough’ and the new one as improving many things, in the summative sense.
Certainly if we take into account claims like ‘more efficient’, ‘less time’ and ‘reduces workload’, I think it’s too facile to say we can get all that AND more reliable.

I think the developments around ‘comparative judgement’ at a potential policy level are going far too quickly, and are based on some misconceptions of validity and reliability, and -maybe most of all- purported time-savings and workload reduction. In my view ‘more efficient’ and ‘more reliable’ aren’t reasons why you would want CJ (or actually ACJ), as in my view these advantages over traditional assessments only might hold for only a small part of summative assessments, namely those that are relatively short, uni-dimensional and subjective (e.g. a short English essay). And even then it pays to check the costs involved, from scanning work, all the training courses, the marking time etc. Simply suggesting that at one point those costs will all go and you are left with a nice 30-second comparison, is not really painting the full picture. We always need to ask that age-old important social media question ‘is that worth the opportunity cost’. This does not mean we couldn’t continue trying out piloting and testing (producing evidence) of exactly the issues I’ve just mentioned. And if we’re doing that any way, we might actually look at some applications that look more promising to me than a limited, national, summative application:

As tool for Continuing Professional development with teams of teachers, to increase awareness of marking practices. We sometimes actually did that with our department when I was teaching in the Netherlands.
Meso-moderation: at the level of groups of schools.
Experimenting with assessment types not yet used, for example open-ended questions in maths.
Peer assessment. The evidence base could be linked to emerging research on peer assessment and formative practices.

Finally, what I always would like is that pilots and experiments (no policy changes please!) are facilitated, and do not involve promoting paid services at this point. Apart from No More Marking (whom I think have started to charge?), maybe the open source platform featured by this Belgium project D-PAC might be interesting.

From what I can gather, the idea is to use this to get data on writing at the end of KS2 because of the claim that ‘teacher assessment of writing at KS2 is unreliable’. This seems to ignore the value of the moderation part of the process, and it also seems to confuse agreement about what ‘looks good at a glance’ with the validity of what people are agreeing on. If you only glance at writing, you are bound to prioritise neatness and length at least to some extent. This would hold I guess unless you compare samples of the same size, which have all been typed up (which would seem to negate the idea that it is faster or cheaper to do). It’s good to see some voices questioning this as it appears to be rapidly turning into a policy without much exploration of whether it’s a good one.

6 replies on “Thoughts on Comparative Judgement”

I think many recent experiments have taken place in Primary Writing so probably, yes. I think the idea ‘what is better’ and accompanying ‘holistic’ approach is indeed what’s key here. Also how it is described: as moderation. But as it stands it suggests to do more, basically attribute marks as well, based on many pair-wise comparisons: marking and moderation in one. I think if groups of schools/teachers manage to moderate some of their writing teacher assessments this way, including feedback and with added CPD bonus, that might be useful, even if it takes a bit more time. But steps towards ‘the future of assessment’, nationally and summatively, when efficiency claims are arbitrary, and reliability depends so much on the assessment items, is unwise.

[…] this book in the first place: improving summative assessments through comparative judgement (CJ). This previous post, which I wrote right after reading this chapter, asks some questions about CJ. The chapter starts […]

[…] for the quality model was far too much geared towards comparative judgement, an approach that in my view has limited scope; descriptor-based assessments can still play a role, especially in relation to exemplars. What the […]

[…] Comparative Judgement is worth examining (critically), but (i) no silver bullet, (ii) probably only applicable for niche objectives, (iii) several pressing questions still to ask, (iv) maybe its strength lies even more in the formati…. […]

[…] constantly challenging myself with regard to Comparative Judgement. In a first blog I explained why I think there might be some better reasons to use it than ‘efficient’, […]

Share this:

By cbokhove

6 replies on “Thoughts on Comparative Judgement”

Leave a comment Cancel reply