Education Education Research

Can research literacy help our schools?

This is the English text of a blog that appeared on a Swedish site (kind translation by Sara Hjelm).

In efforts to debunk education myths there is a real danger that research is oversimplified. This is wholly understandable from the perspective of a teacher. Finding and understanding research is a hard and difficult process. The ‘wisdom of the crowds’ might help in this, but it often remains a challenge for all involved to translate complex research findings to concrete recommendations for teachers. It is certainly not the case that teacher simply can adopt and integrate these ideas in their daily practice. Furthermore, you can shout as often as you want that ‘according to research X should work’ but if it’s not working during teaching, you will make adjustments.

Why is it such a challenge for teachers to interpret research findings? As Howard-Jones (2014) indicates, this firstly might be because of cultural conditions, for example with regard to differences in terminology in language (e.g. see Lilienfeld et al., 2015; 2017). An example of this can be seen in the use of the words ‘significance’ and ‘reliability’. Both have a typical ‘daily use’ but also a specific statistical and assessment meaning. A second reason Howard-Jones mentions, is that counter-evidence might be difficult to access. A third element might be that claims simply are untestable, for example because they assume knowledge about cognitive processes, or even the brain, that are unknown to us (yet). Finally, an important factor we can’t rule out is bias. When we evaluate and scrutinise evidence, a range of emotional, developmental and cultural biases interact with emerging myths. One particularly important bias is ‘publication bias’, which might be one of the biggest challenges for academia in generak. Publication bias is sometimes called the ‘file drawer problem’ and refers to the situation what you read in research articles often are just the positive outcomes. If a study does not yield a ‘special’ finding, then unfortunately it is less likely to be published.

Because of these challenges, navigating your way through the research landscape is very time-consuming and requires a lot of research knowledge, for example on research designs, prior literature, statistical methods, key variables used and so forth. And even with this appropriate knowledge, understanding research still will take a lot of time. For a quick scan this might be 15 minutes or so, but for the full works you would have to look in detail at the instruments, the statistical methods or you would have to follow-up other articles referenced in a paper, often amounting to hours of works. This is time that busy practitioners haven’t got. Science is incremental ie we build on an existing body of knowledge, and every new study provides a little bit more insight in the issue at hand. One study most likely is not enough to either confirm or disprove a set of existing studies. A body of knowledge can be more readily formed through triangulation and looking at the same phenomenon from different perspectives:ten experimental studies might sit next to ten qualitative studies, economic papers might sit next to classroom studies.

In my view, there are quite a lot of examples where there is a danger that simple conclusions might create new myths or misconceptions. Let me give two of them, which have been popular on social media. The first example is the work by E.D. Hirsch. I think his views can’t be seen separate from the US context. Hirsch is passionate about educational fairness, but the so-called GINI coefficient seems to indicate that systemic inequality is much larger in the US. Hirsch in my view also tends to disregard different types of knowledge: he is positive about ‘knowledge’ but quite negative about ‘skills’, for example. However, ‘skills’ could simply be seen as ‘practical knowledge’ (e.g. see Ohlsson, 2011), emphasising the important role of knowledge, but still acknowledging you need more than ‘declarative knowledge’ to be ‘skilled’. In his last book, Hirsch also contends that a student-centred curriculum increased educational inequality in France, while more recent data and a more comprehensive analysis, seems to indicate this is not the case (see A second example might be the currently very popular Cognitive Load Theory by Professor John Sweller. Not everyone seems to realise that this theory does not include a view on motivation. Sweller is open about this and that’s fine of course. It does, however, not mean that it is irrelevant. Research needs to indicate what its scope is, and what it does or does not include, and subsequent conclusions need to be commensurate with the research questions and scope. This precision in wording is important, but inevitably suffers from word count restrictions, whether in articles, blogs or 280 character tweets. There is a tension between brevity, clarity and doing justice to the complex nature of the education context.

Ideally, I think, we can help each other out. We need practitioners, we need academics, we need senior leadership, we need exam boards, we need subject specialists, to all work together. We also need improved incentives to build these bridges. I am hopeful that, if we do that, we can genuinely make a positive contribution to our schools.

Dr. Christian Bokhove was a secondary maths and computer science teacher in the Netherlands from

1998 to 2012 and now is a lecturer in mathematics education at the University of Southampton. He tweets as @cbokhove and has a blog he should write more for at


Howard-Jones, P. (2014). Neuroscience and education: myths and messages. Nature Reviews Neuroscience, 15(12), 817-824.

Lilienfeld, S.O., Pydych, A.L., Lynn, S.J., Latzman, R.D., & Waldman, I.D. (2017). 50 Differences that make a difference: A compendium of frequently confused term pairs in psychology. Frontiers in Psychology,

Lilienfeld, S.O., Sauvigné, K.C., Lynn, S.J., Cautin, R.L., Latzman, R.D., & Waldman, I.D. (2015). Fifty psychological and psychiatric terms to avoid: a list of inaccurate, misleading, misused, ambiguous, and logically confused words and phrases. Frontiers in Psychology,

Ohlsson, S. (2011). Deep Learning: How the Mind Overrides Experience. Cambridge University Press: New York.

Education Education Research

Presentation for HoDs mathematics of Trinity group

I gave a presentation about the spatial research I did recently with 85 year 7 pupils.

Education Education Research

researchEd presentation on myths about myths

Last weekend I gave a Dutch and English version of my ‘This is the new myth’ talk. This talk did not come about in some vain attempt to take over the mythical status of some other excellent ‘mythbusters’, like Pedro De Bruyckere, Paul Kirschner and Casper Hulshof in their excellent book, but more with frustration how some facts opposed to certain myths, became simplified beyond recognition, often distorting the original message. In other words, a danger that the debunking of myths became new myths on their own. In this talk I go into how myths might come about, and give some examples, including one on iron in spinach. I then give some examples where I think facts are misrepresented on and in the (social) media. I have mainly chosen themes that are often highlighted by those who endeavour for a more evidence-informed approach to teaching, in that process purport to combat myths, but then -in my view- give an overly simplistic representation of some research findings. In the talk I cover sources that for example purportedly show ‘peer disruption costs students money’, ‘we believe research quicker if there is a brain picture’, ‘ less load is best and so there is no place for problem-based learning and inquiry in education’ and ‘student-centred policies cause inequality’. Maybe there are other robust studies that show this (although I would need to be convinced) but the sources I have observed on the web, are almost always misrepresented, in my opinion.  I realise that these descriptions *also* simplify these judgements, but the aim is not to focus on the errors per se, but that we need to be vigilant and aware of the mechanisms behind myth creation.

The slides for the talk are here:

A video of the talk is here:

I recently also saw an article (only in Dutch, I think) that nicely complements my talk and I might integrate some of the sources in a future version.

Education Education Research

Educational inequality: old paper by Hanushek

Probably one of the most influential people in OECD policy has been Hanushek. For someone from the Netherlands, the constant ‘bashing’ of selection and ‘early tracking’ has been particularly noteworthy. Mainly, because anecdotally I feel that system equality is a big factor, and also because ‘despite’ early tracking the Netherlands tends to do reasonably well in large-scale assessments (except, for some years now TIMSS year 4, which is worrying).

The most often cited paper is this paper by Hanushek and Woesmann. The important image is:

I have got some issues with the inference that ‘early tracking’ tends to increase inequality, based don this data, certainly for the Netherlands.

  1. The data is based on the dispersion of achievement (standard deviation). The Netherlands has the lowest spread in both situations, but contributes to ‘early tracking is bad’ because the SD increases. Yet it still is lowest of all included countries.
  2. PIRLS and PISA reading are two very different large-scale assessments. PIRLS is published by the IEA and their studies tend to be more curriculum focused, while PISA reading less so. I don’t think you can compare them this way.
  3. This also is hard because, as far as I know, the cross-sectional sampling is different, with one looking at classrooms (PIRLS) and the other schools (PISA). At least, that is the case now. There are several years of schooling between the two measurements, and also the samples are different.
  4. Achievement scores in large-scale assessments are typically standardised around a mean of 500, and standard deviation of 100. Standardising this again to help a comparison of two completely different tests seems rather strange. Especially if you then argue that the *slopes* denote increase or decrease of inequality.
  5. Finally, of course, causation/correlation issues.

In sum, I think it is an original study but hard to draw conclusions.


Education Education Research

researchEd national conference

On 9 September 2017 I gave a talk at the national researchEd conference in London. The presentation was about how mythbusting might lead to new myths. The presentation covered the following:

  • I started by explaining how myths might come about, by referencing some papers about neuromyths.
  • I then used the case of iron in spinach to illustrate how criticising myths can lead to new myths (paper by Rekdal).
  • I gave examples of some themes that are in danger of becoming new myths.
  • I concluded that it is important to read a lot, stay critical and observe nuance. No false dichotomies please.

I will endeavor to write this up at one point. Slides below.

Education Education Research

Hirsch: the case of France

(click on the image for a larger version)

I wanted to do a relatively quick post on something I have been looking at in some tweets. It is related to part of Hirsch’s book on which I had already written. I think it’s quite clear that I like Hirsch’s emphasis on the ‘low achieving’, although we probably disagree on the role ‘systemic unfairness’ plays in schooling. This post, though, wants to focus on one of the pivotal examples Hirsch presents to argue that a skills-oriented curriculum, contrary to a knowledge-based curriculum, increases unfairness: the case of France from 1987 to 2007 (Loi Jospin). I can probably write pages full on the ‘knowledge’ versus ‘skills’ (aren’t skills just practical knowledge?) but let’s just assume that these labels are wholly justified. I will also assume, but find the justification lacking, that what Hirsch says on the page regarding amount of funding, buildings etc. to *not* have had an influence on this, is true. I think it’s quite difficult to simply ascribe changes to just a change of curriculum, even though I grant him the curriculum change in France has been vast.

I tried to track down the data Hirsch used. In Appendix II Hirsch refers to documents from the DEPP. The data seems to come from a French department, and coincidentally 2015 data has recently (November 2016) been added to the database. The data is tabulated in this document on page 3. One of the headers states that ‘social inequality is still apparent but remains stable’.

The raw data indicates ‘errors’ made (the column Moyenne denotes the mean number of errors per sub-group, Ecart-type denotes standard-deviation). I did not look at the detail of the standardised tests themselves. The document mentions some limitations, for example that labels have changed a bit over time but also the massive increase in people not wanting to give there social class of the parents (PCS de la personne responsable d’eleve).

Compared with the graph in Hirsch’s book two things can be seen:

  1. There seem to be more SES (Socio-Economic Status) categories in the data than in the book. My French is a bit rusty but I think at least one large category, probably Employes, is missing. I think that is strange.
  2. The gaps between the different groups seem to have diminished between 2007 and 2015, or -looking from 1987 to 2015- there does not seem to be a SES gap increase ie no increase in unfairness.

To go a little bit further than just ‘having a look’ I then proceeded to create some graphs. I did not create Z-values and I wonder why Hirsch did, as the ‘number of errors’ in the test used is quite a standardised measure. I also tried to use the supplied ‘standard deviations’ to try and replicate the Z-values, but could not get all the numbers matched. Here is the graph Hirsch did, but now with only the errors:


Based on this graph (sorry, I know, labels are missing etc., and I’m embarrassed to say I used Excel) one could indeed conclude that from 1987 to 2007 gaps have increased, although the gap between ‘White collar and shopkeepers (‘Artisans, commercants’) and ‘Profession intermediaires’ has decreased. As mentioned before, maybe it’s my command of the French language that conflicts with this. I also plotted the same graph but now with all categories.

It seems as if the picture conveyed on page 145 of Hirsch’s book is far less pronounced. Of course, some of these categories are quite small, but in any case, one of the five largest groups has not been included in Hirsch’s book. I wonder what causes this discrepancy; it seems implausible that the use of Z-scores could explain that difference, but I’m open to be proven wrong.

The categories was the first element I wanted to look at, the second is the 2015 data. I plotted the new graphs with 2015 data included. I did this for both the graph with only 5 categories and all others.

It is clear that more errors are being made over the years, but the unfairness (ie increasing gaps between the different socio-economic strata) seems hard to maintain. Certainly the argument that if we ‘trace the lines backward in time in time’ would mean positive equity effects (see below) of what Hirsch calls the ‘knowledge curriculum’ seems unlikely. Like the DEPP themselves state, I think it is hard to maintain that the unfairness has increased; at least based on this French data.


Hirsch – notes on ‘Why knowledge matters’

hirschThis is a quick post with some thoughts on E.D. Hirsch jr.’s last book ‘Why knowledge matters’. Of course Hirsch is mentioned a lot by the ‘knowledge oriented’ blogosphere, and I can see why. I also had the feeling his message was somewhat distorted though, and so set out to read his latest book. It is an ‘ideas rich’ book, that could have been presented a bit more coherently. It makes it hard to summarise the book. Nevertheless, there were numerous interesting points (my interpretation of course).

Importantly, we need to acknowledge the large role the US context plays; Hirsch is truly interested in inequality and the US has a lot of it. It especially became clear that he was far more communitarian than sometimes depicted. Rather than individualism, he favours community and that is something that appeals to me a lot. He does not seem overly attached to a *certain* curriculum or certain systems and structures, just as as long as there is a coherent, knowledge-oriented curriculum. He gives several favourable examples from Japan where the system is not like the charter system. I find this slightly ironic in the English context, because I feel that a lot of the communitarian aspect has actually been undermined in the last five years with systems and structures, like academies and free schools, even allowing to divert from a national curriculum. Of course, I know the reactions to this, namely that the curriculum wasn’t fit for purpose but given the communitarian ideals behind Hirsch’s thoughts I wonder whether getting rid of one, and changing the system so it becomes more fragmented, to then start a campaign to let everyone adopt one particular vision, really is a communitarian thing to do. My feeling is it actually has caused less ‘overall’ community within England, although within certain sub-cultures there is more.

Hirsch is critical of the interpretation of the Coleman Report as ‘it’s not the schools’. I can understand that; it seems that especially in the US a coherent curriculum was not on the mind. Devolving responsibility from education, in my view isn’t a good thing. Yet, I now see the opposite: education as the ‘great equaliser’, allowing governments to get away with not addressing systemic inequality. I think there is plenty of evidence that shows that different levels contribute to inequality (or equity): at the individual level, family SES, teachers, schools but also country level policies.

Another interesting aspect was the numerous times that France featured. I thought the narrative of the introduction of the Jospin law was quite compelling with regard to the decrease in France’s achievement. A weaker point was the impact it had on increasing inequality. I base this on the latest measurement done in France:

The communitarian aspect returns again, and I appreciate that Hirsch tries to detach the developments from a political colour. He does this, for example, by contrasting the I would say center left developments in France with the I would say center right developments in Sweden. I would not, though, say it’s non-political: for both countries the communitarian ideal of a coherent (knowledge) curriculum was undermined. In one by generic skills ideals, in another by system changes (friskolar, Sweden experts, correct me if I’m wrong).

Quite some space in the book is devoted to ‘educationally invalid testing’. It builds on what in the introduction is described regarding ‘generic skills’. Hirsch really seems to have big problems with the term ‘skills’ and at a certain point (p. 13) says “Think how significantly our view of schooling might change if suddenly policy makers, instead of using the term skill, had to use the more accurate, knowledge-drenched term expertise.”. I can see how one would start to dislike an opaque term like ‘skills’ when people use it interchangeably for all-sorts. But what’s in a name? If we would redefine skills, as for example Ohlsson does, to ‘practical knowledge’, I’m not sure if that really makes a difference. Also, the term ‘expertise’, according to Hirsch, might be ‘knowledge-drenched’, but in becoming an expert surely one needs to practice. We can stop using ‘skills’ in describing practice, but I feel this is more semantics: people favour generic skills, let’s get rid of the word.

The best part of a whole chapter (one) is then used to describe how in the US the use of reading tests is educationally invalid. I’m not sure how much of it can be ascribed to the US situation, but I do sense some overlap with other educational jurisdictions. Hirsch at first seemed to suggest that high stakes were best removed, to allow teachers to pay more attention to the ‘long arc of knowledge acquisition’. I don’t, however, think this should be read as Hirsch being against testing per se, just as long as they were ‘based on good, knowledge-based standards’ (p. 33). I find Hirsch slightly inconsistent here, apart from the ever-present ‘coherent knowledge based curriculum and standards’.

Hirsch, rightly so is against scapegoating teachers and goes into Value Added Models. I think it makes sense, and I had to think about the balanced American Statistical Association statement on Value Added. The links, plus a balanced evaluation can be found here. In another chapter, Hirsch covers the phenomenon of ‘fadeout’, which is challenging for every programme. Some took his mention of Direct Instruction and Success for All to be criticism of direct instruction (small letters) but it’s more the Engelmann style (capital letters). Project Follow Through makes another appearance, as does the Reggio Emilia schools as example of ‘naturalistic’. It is interesting, though, that he mentions that all programmes suffer fadeout; it seems the reason why he wants a long-term coherent curriculum. I think that makes sense, but think it does make it hard to do evaluations. Hirsch mentions he’said not very interesting in, for example, Randomised Controlled Trials. I understand his position but this does contrast with the Core Knowledge evidence base, which is rather mixed.

In sum, I enjoyed the themes in this book, although delivered in a fragmented way. I think Hirsch’s aims regarding equality are genuine and noteworthy, and is clearly fed up with teachers getting the blame. I think he really focuses on ‘a coherent knowledge curriculum’ and not, as some seem to think, systems and structures. I think his dislike of ‘skills’ being abused has been taken too far though. At first it seems he’s against testing but he’s not, as he wouldn’t mind knowledge tests. Interesting ideas, I hope we take them in, and not just pick what suits.


Notes on Making Good Progress – Summary blog

Just because I had written extensive notes, I’d thought I’d just post them in a series of blogs. All blogs together in this pdf (might have made some slight changes over time in the blogs, which are not in the pdf).

Part 1 – foreword, introduction, chapter 1
Part 2 – chapters 2 and 3
Part 3 – chapters 4 and 5
Part 4 – chapters 6 and 7
Part 5 – chapter 8
Part 6 – chapter 9 and conclusion

In conclusion, I think that  if a teacher wants to read a timely book with a lot of interesting content on assessment, they do well to read this one. They should, however, read it with the frame of mind that in places the situation is presented somewhat one-sidedly, in my view too negative about the ‘old’ situation and too positive about alternative models. Teachers can profit from that, but it can also mean that they miss out on decades of unmentioned research on curriculum, psychometrics and assessment. I would therefore encourage them to (i) read the book (ii) follow up the references and (iii) also read a bit wider. Of course, one cannot write a 1000 page ‘accessible’ book but given the number of footnotes a bit more depth in some places would have been good. Particular points are:

  • Yes, the implementation of  Assessment for Learning (AfL) has been problematic. The book covers some on the importance of feedback but not enough prior research is covered.
  • I recognise the generic versus specific domain skills discussion but in my view it is presented in a too dichotomous way. There is more than Willingham, for example Sternberg and Roediger on critical thinking. In addition, linking it to leading to certain assessment practices (e.g. teaching to the test) is unevidenced. There also exist fair criticisms of deliberate practice.
  • The introduction of a quality and difficulty model is useful but again rather binary.
  • Reliability and validity are covered but only quite superficially (types of validity, threats to validity etc.), and reliability -in my view- is not covered correctly (the example with 1kg on a scale is an example of reliable AND valid and does not tease out the essential test-retest characteristic of reliability).
  • Yes, there are problems with descriptor-based assessments but there is a raft of research addressing their validity and reliability.
  • The progression model makes sense but haven’t people been doing this for decades? (e.g. in good textbooks).
  • The attention given to the testing effect, spaced practice, multiple choice questions is well done.
  • Comparative Judgement is worth examining (critically), but (i) no silver bullet, (ii) probably only applicable for niche objectives, (iii) several pressing questions still to ask, (iv) maybe its strength lies even more in the formative realm.
  • The proposed integrated system describes what already is in place, with a plea to collaborate. This is good but we must realise that it not having worked out over the years, mainly is a funding issue, in my view.

One might wonder ‘why mention this, it’s great that this topic gets some attention?’ but I simply have to refer to what the author states towards the end of the book. ‘Assessment is a form of measurement’ and ‘flawed ideas about assessment have encouraged flawed classroom practice’ (p. 212). If these are the main aims behind the book, then it surely increases awareness of this, but without covering the basics more, I fear we don’t get the complete picture. Overall, I would say it’s an interesting, good book, but not outstanding. 3.5*.


Notes on Making Good Progress – Conclusion

progressSometimes I just get carried away a bit. I managed to get an early copy of Daisy Christodoulou’s new book on assessment called Making Good Progress. I read it, and I made notes. It seems a bit of a shame to do nothing with them, so I decided to publish them as blogs (6 of them as it was about 6000 words). They are only mildly annotated. I think they are fair and balanced, but you will only think so if you aren’t expecting an incredulous ‘oh, it’s the most important book ever’ or ‘it is absolutely useless’. I’ve encountered both in Twitter discussions.


This part addresses chapter 9 and the conclusion.
Finally, chapter 9 tries to tie several things together in one ‘integrated assessment system’. There are no references in this chapter. Many elements have already been discussed. For example, the ‘progression model’ which did not seem to offer really new insights (at least to me). Lesson plans and schemes of work appear out of the blue, together with ‘curriculum’. I agree that textbooks would be most helpful here. Another element is a ‘formative item bank’. Again, very useful, and there are already plenty out there. I am not sure if the summative item bank would need to be a different bank, just the way the items are used and compiled in valid, rigorous summative assessments needs scrutiny. I felt the ‘summative item bank’ for the quality model was far too much geared towards comparative judgement, an approach that in my view has limited scope; descriptor-based assessments can still play a role, especially in relation to exemplars. What the model *does* emphasise is that an assessment system should draw from several summative and formative sources, perhaps a little bit contradicting earlier parts of the book. This is also expressed on page 206 with the benefits (coherence, pupil ownership with adaptivity and gamification, self-improving with more adaptivity). Ultimately, though, I am left with the feeling that all these elements are already readily understood and even in place. Christodoulou seems to realise and state this on page 207 “Every individual element of this system exists already”, but does not address *how* organisations could come to an ‘unprecedented collaboration’. Maybe the challenge is that so many people *have* already tried and failed. Ideally I would have wanted the author to have touched on the costs for the resources as well. Many item banks cost money, GL and CEM assessments cost money, No More Marking is not free, Textbooks and exam boards charge money. All in all, with a funding squeeze, it is unrealistic to not address the costs.

The conclusion in the book is rather meagre with three pages. There is some repetition and bold claims again ‘flawed ideas about assessment have encouraged flawed classroom practice’. I think this caricaturises the situation. Sure, there are flawed practices but one could also say -in the quest for valid and reliable assessments- there always are flaws, even in some of the solutions Christodoulou proposes. Rather than exaggerate by calling practices flawed, it is better to look how practices can be improved. Christodoulou has some suggestions that should be taken seriously, but also critically evaluated in light of the wide body of research on assessment.


Notes on Making Good Progress – Part 5

progressSometimes I just get carried away a bit. I managed to get an early copy of Daisy Christodoulou’s new book on assessment called Making Good Progress. I read it, and I made notes. It seems a bit of a shame to do nothing with them, so I decided to publish them as blogs (6 of them as it was about 6000 words). They are only mildly annotated. I think they are fair and balanced, but you will only think so if you aren’t expecting an incredulous ‘oh, it’s the most important book ever’ or ‘it is absolutely useless’. I’ve encountered both in Twitter discussions.



This part addresses chapter 8.
Chapter 8 addresses the topic I started out reading this book in the first place: improving summative assessments through comparative judgement (CJ). This previous post, which I wrote right after reading this chapter, asks some questions about CJ. The chapter starts by repeating some features for summative assessments. The first is ‘standard tasks in standard conditions’. But it isn’t really about that in the subsequent section, but ‘marker reliability’ (p. 182). The distinction between the previously described difficulty model and quality model, is useful. It is clear to me that essays and such are harder to mark objectively, even with a (detailed) rubric. It is pertinent to describe the difference between absolute and relative judgements. However, when the author concludes “research into marking accuracy…distortions and biases” she again disregards ways to mitigate these issues, even while the referenced Ofqual report does mention them. Indeed, many of the distortions are ‘frustrating’ judgements, and therefore a big danger of rubrics. I, however, find it strange that this risky point of rubrics is disregarded, when comparative judgement, it is suggested, can work with ‘exemplars’. As Christodoulou pointed out on p. 149 there is a danger that students work towards those exemplars. I saw this often in some of the Master modules I taught: a well-scoring exemplar’s structure of sub-headings was (awfully) applied by many of the students, as if they thought that adopting those headers surely had to result in top marks. So in a sense I agree with the author’s critique, I just don’t see how the proposed alternative isn’t just as flawed. Then comparative judgement is described as very promising. It is notable that most examples of its affordances feature English essays. It also is notable that ‘extended writing’ is mentioned, while some examples are notably shorter. The process of CJ is described neatly. I think the ‘which one is better’ is glanced over i.e. ‘in what way’? I also think more effort could have been put in describing the algorithm that ‘combines all those judgements, work out the rank order of all the scripts and associate a mark for each one’ (p. 187). The algorithm is part of the reason why reliability is high: inter-rater reliability can be compared with regard to rank orders; I am not sure if the criticised traditional method is based on rank orders. For instance, if one rater says 45% and another 50% it seems reasonable to say that both raters did not agree. Yet, if we just look at the rank order they might have agreed that one was better than the other. As CJ simply looks at those comparisons, reliability is high. But it’s not comparing like with like. A similar process with one marker and 30 scripts would involve ordering scripts, not marking them.I have to think about several challenges that are mentioned in this older AQA report. I don’t think these challenges have yet been addressed nor discussed.

I think it is also interesting that Christodoulou correctly contrasts with (p. 187) ‘traditional moderation methods’. Ah, so not the assessment method per se, but the moderation. The Jones et al. article is referenced, but the book fails to mention how the literature also mentions several caveats e.g. multidimensionality and length of the assessment. The mentioning of ‘tacit knowledge’ is fine but it is not necessarily tacit knowledge that improves reliability, in my view. It can be collective bias. I think it’s a far stretch to actually see the lack of feedback for the scripts as an advantage, because it ‘separates out the grading process from the formative process’. It even distributes the grading process over a large group of people; to a student it can be seen as an anonymous procedure in the background. Who does the student turn to if he/she wants to know why he/she got the mark she received? Sure, post-hoc you can analyse misfit, but can you really say -as classroom teacher- you ‘own’ the judgement? Maybe that is the reason why it is seen as advantage, but one can rightly so say the exact opposite.  It is interesting to note that the Belgium D-PAC process actually seems to embrace the formative feedback element CJ affords. The section ends with ‘significant gain of being able to grade essays more reliably and quickly than previous’. I think the ‘reliably’ should be seen in the context of ‘rank ordering’, length of the work, and multidimensionality. ‘Quickly’ could be seen in more than just the pairwise comparisons (it is clear that they are short, if ‘holistic’ is only needed); but the collective time needed often surpasses the ‘traditional’ approach. ‘Opportunity cost’ comes to mind if we are talking about summative purposes through CJ.  I am disappointed that these elements are not covered a bit more. The section, however, ends with what I *would* see as one big affordance of CJ: CJ as way to CPD and awareness of summative and formative marking practices. But this is something different than a complete overhaul of (summative) assessment, because of the limitations:

  • Needs to be a subjective task (quality model, because otherwise there are more reliable methods)
  • Can’t be too long (holistic judgement would most probably not suffice)
  • Can’t be multidimensional (holistic judgement would most probably not suffice)

That’s quite a narrow field of application. And with the desire to stay in the summative realm, in England, only summative KS2 not-too-extended writing seems to be the only candidate (see for formative suggestions the previous blog on CJ). But careful:

In my opinion, page 188 also repeats a false choice regarding rubrics, as the described ‘exemplars’ can also be used with rubrics, not only with CJ. We do that in aforementioned Masters module (with the disadvantage it becomes a target for students). So although I agree this would be ‘extremely useful’, it actually is not CJ. Another unmentioned element is that CJ could be linked to peer assessment. To return to page 105 where bias is seen as human nature, one could argue that a statistical model is used to pave over human bias. In my opinion, this does not mean it’s not there, it’s just masked.

The second half of the chapter addresses curriculum-linked assessments. I don’t understand the purpose of mentioning GL assessment, CEM and NFER apart from using the unrealistic nature of using them to argue ‘we need something in-between’ summative and formative, and then to argue ‘curriculum linked’. As previous chapters, good points are raised but it feels the purported solutions aren’t really solutions; the problems are used to argue *something* must change but not so much why the suggested changes would really make a difference. For example, the plea for ‘scaled scores’ is nice but I would suggest only people who know how to deal with them, should use scaling; simply applying a scaling algorithm might also distort (think of some of the age-related assessments used in EEF studies, or PISA rankings).