Sometimes I just get carried away a bit. I managed to get an early copy of Daisy Christodoulou’s new book on assessment called Making Good Progress. I read it, and I made notes. It seems a bit of a shame to do nothing with them, so I decided to publish them as blogs (6 of them as it was about 6000 words). They are only mildly annotated. I think they are fair and balanced, but you will only think so if you aren’t expecting an incredulous ‘oh, it’s the most important book ever’ or ‘it is absolutely useless’. I’ve encountered both in Twitter discussions.
PART 5 – COMPARATIVE JUDGEMENT AND CURRICULUM-LINKED ASSESSMENTS
This part addresses chapter 8.
Chapter 8 addresses the topic I started out reading this book in the first place: improving summative assessments through comparative judgement (CJ). This previous post, which I wrote right after reading this chapter, asks some questions about CJ. The chapter starts by repeating some features for summative assessments. The first is ‘standard tasks in standard conditions’. But it isn’t really about that in the subsequent section, but ‘marker reliability’ (p. 182). The distinction between the previously described difficulty model and quality model, is useful. It is clear to me that essays and such are harder to mark objectively, even with a (detailed) rubric. It is pertinent to describe the difference between absolute and relative judgements. However, when the author concludes “research into marking accuracy…distortions and biases” she again disregards ways to mitigate these issues, even while the referenced Ofqual report does mention them. Indeed, many of the distortions are ‘frustrating’ judgements, and therefore a big danger of rubrics. I, however, find it strange that this risky point of rubrics is disregarded, when comparative judgement, it is suggested, can work with ‘exemplars’. As Christodoulou pointed out on p. 149 there is a danger that students work towards those exemplars. I saw this often in some of the Master modules I taught: a well-scoring exemplar’s structure of sub-headings was (awfully) applied by many of the students, as if they thought that adopting those headers surely had to result in top marks. So in a sense I agree with the author’s critique, I just don’t see how the proposed alternative isn’t just as flawed. Then comparative judgement is described as very promising. It is notable that most examples of its affordances feature English essays. It also is notable that ‘extended writing’ is mentioned, while some examples are notably shorter. The process of CJ is described neatly. I think the ‘which one is better’ is glanced over i.e. ‘in what way’? I also think more effort could have been put in describing the algorithm that ‘combines all those judgements, work out the rank order of all the scripts and associate a mark for each one’ (p. 187). The algorithm is part of the reason why reliability is high: inter-rater reliability can be compared with regard to rank orders; I am not sure if the criticised traditional method is based on rank orders. For instance, if one rater says 45% and another 50% it seems reasonable to say that both raters did not agree. Yet, if we just look at the rank order they might have agreed that one was better than the other. As CJ simply looks at those comparisons, reliability is high. But it’s not comparing like with like. A similar process with one marker and 30 scripts would involve ordering scripts, not marking them.I have to think about several challenges that are mentioned in this older AQA report. I don’t think these challenges have yet been addressed nor discussed.
I think it is also interesting that Christodoulou correctly contrasts with (p. 187) ‘traditional moderation methods’. Ah, so not the assessment method per se, but the moderation. The Jones et al. article is referenced, but the book fails to mention how the literature also mentions several caveats e.g. multidimensionality and length of the assessment. The mentioning of ‘tacit knowledge’ is fine but it is not necessarily tacit knowledge that improves reliability, in my view. It can be collective bias. I think it’s a far stretch to actually see the lack of feedback for the scripts as an advantage, because it ‘separates out the grading process from the formative process’. It even distributes the grading process over a large group of people; to a student it can be seen as an anonymous procedure in the background. Who does the student turn to if he/she wants to know why he/she got the mark she received? Sure, post-hoc you can analyse misfit, but can you really say -as classroom teacher- you ‘own’ the judgement? Maybe that is the reason why it is seen as advantage, but one can rightly so say the exact opposite. It is interesting to note that the Belgium D-PAC process actually seems to embrace the formative feedback element CJ affords. The section ends with ‘significant gain of being able to grade essays more reliably and quickly than previous’. I think the ‘reliably’ should be seen in the context of ‘rank ordering’, length of the work, and multidimensionality. ‘Quickly’ could be seen in more than just the pairwise comparisons (it is clear that they are short, if ‘holistic’ is only needed); but the collective time needed often surpasses the ‘traditional’ approach. ‘Opportunity cost’ comes to mind if we are talking about summative purposes through CJ. I am disappointed that these elements are not covered a bit more. The section, however, ends with what I *would* see as one big affordance of CJ: CJ as way to CPD and awareness of summative and formative marking practices. But this is something different than a complete overhaul of (summative) assessment, because of the limitations:
That’s quite a narrow field of application. And with the desire to stay in the summative realm, in England, only summative KS2 not-too-extended writing seems to be the only candidate (see for formative suggestions the previous blog on CJ). But careful:
In my opinion, page 188 also repeats a false choice regarding rubrics, as the described ‘exemplars’ can also be used with rubrics, not only with CJ. We do that in aforementioned Masters module (with the disadvantage it becomes a target for students). So although I agree this would be ‘extremely useful’, it actually is not CJ. Another unmentioned element is that CJ could be linked to peer assessment. To return to page 105 where bias is seen as human nature, one could argue that a statistical model is used to pave over human bias. In my opinion, this does not mean it’s not there, it’s just masked.
The second half of the chapter addresses curriculum-linked assessments. I don’t understand the purpose of mentioning GL assessment, CEM and NFER apart from using the unrealistic nature of using them to argue ‘we need something in-between’ summative and formative, and then to argue ‘curriculum linked’. As previous chapters, good points are raised but it feels the purported solutions aren’t really solutions; the problems are used to argue *something* must change but not so much why the suggested changes would really make a difference. For example, the plea for ‘scaled scores’ is nice but I would suggest only people who know how to deal with them, should use scaling; simply applying a scaling algorithm might also distort (think of some of the age-related assessments used in EEF studies, or PISA rankings).