…the blog that shall not be named
Sometimes I just get carried away a bit. I managed to get an early copy of Daisy Christodoulou’s new book on assessment called Making Good Progress. I read it, and I made notes. It seems a bit of a shame to do nothing with them, so I decided to publish them as blogs (6 of them as it was about 6000 words). They are only mildly annotated. I think they are fair and balanced, but you will only think so if you aren’t expecting an incredulous ‘oh, it’s the most important book ever’ or ‘it is absolutely useless’. I’ve encountered both in Twitter discussions.
PART 3 – DESCRIPTOR- AND EXAM-BASED ASSESSMENT
This part addresses chapters 4 and 5.
Chapter 4 critiques descriptor-based assessments. I think it is important here to distinguish a bad implementation of a good policy or simply a bad policy. It starts by describing ‘assessment with levels’. I notice that the author often takes reading examples, which in principle is fine, but the danger is that we too quickly think it applies to all subjects. I think the chapter does a good job at describing the drawbacks of descriptor-based systems. I do, however, feel that some of them are not less prominent in alternatives presented later. I also get the feeling that apples and oranges are sometimes compared in the ‘descriptive, not analytic’ section because there is no reason to not simply do both. The comment on ‘generic, not specific’ with regard to feedback is spot on, but again there is no reason to not then do both: generic AND more specific feedback, in my opinion. Actually, throughout the book I feel that the novice/expert cut that had so skillfully been exposed, is not taken into account in many of the pages. As reviews of feedback use have shown, the type of feedback (and timing) interacts with levels of expertise. The examples of different questions seem related to their specific goal e.g. on page 94 the question on Stalin can be an excellent multiple choice question on certain knowledge. However, if it was more about relationships of certain events multiple choice questions might give away the game too much. The same with equations: multiple choice questions do not make sense if your aim is to check equation solving skill, but would make sense if you want to check if they can check the correctness of solutions. I think there is some confusion about reliability and validity here, most prevalent in the example on fractions. Yes, the descriptor on fractions is general but that is often part of a necessarily somewhat vague set of descriptors in a curriculum. What Christodoulou then gives as example (page 99) seems to be more about validity and reliability of tests and assessments. Decades of psychometric research have provided insight in how to reliably improve assessment for summative purposes. It feels as if this is under-emphasised. Also, descriptor systems can be made more precise by mark-schemes and exemplars (as, by the way, later on presented in the comparative judgement context). A pattern in the book seems to be that
This could lead to a situation where readers might nod along with the critique but then incorrectly assume the proposed solutions will solve them. I think it is admirable to describe the challenges in this accessible way but would have preferred a more balanced approach. As a case in point, take the ‘bias and stereotyping’ of page 104. This is a real challenge, and rightly so seen as a point to address in descriptor-based assessment. Yet, as said before, there are ways to mitigate these drawbacks. Instead, the case is made that reform is necessary, and later on in the book a ‘solution’ is given that still uses teacher judgements but ‘simpler’ (not really, holistic judgement is not simple per se, only if you have a short uni-dimensional judgement to make, but the condemnation of teacher judgement wasn’t about that, it was about complex judgements). In my view it just ‘pretends’ to be a solution for these well-observed challenges.
Chapter 5 critically assesses another assessment type, namely exam-based assessment. The somewhat exaggerated style is exemplified by the first sentence “we saw that descriptor-based assessment struggles to produce valid formative and summative information”. The chapter first links the exam model to chapter 3’s distinction of the quality and the difficulty model. I am not convinced by the arguments that then try to explain why exam-based (summative) assessments are difficult to use for formative purposes. Sure, they are samples from a domain, but one can simply collect all summative questions on a certain topic or subject, to make valid inferences. Sure, questions differ in difficulty, but there are ways to analyse the difficulty. The comments on page 120 and 121 are fair (hard to say why right or wrong) but I can’t help but think about the ‘solutions’ provided later on with comparative judgement, which uses Rasch analysis and ‘just correct or incorrect’, suffer a same problem (granted, they are presented as ‘summative’ solution). With maths exams there are mark-schemes, so a more fine-grained analysis *is* possible for formative purposes. The chapter *does* provide a nice insight in the difficulties regarding marking and judgement. A third problem, it is suggested, is that marks aren’t designed to measure formative progress. I think again that the book asks some good critical questions, but ultimately too much sends out the message that old practices are bad. From page 130 the author argues there are issues with the summative affordances of exams as well. I think that this section, again with the fractions examples, exaggerates the ‘non validity’ of exams. Testing agencies have developed a raft of tools to make exams valid over years and between samples. Again, the challenges and difficulties are described well, but ways to mitigate the challenges are undermentioned. Further, the suggested ‘modular’ approach is good but is this really new? The next four chapters are about alternative systems.