Categories
ReproLab

Reproduction of Kestin et al. (2025)

In ReproLab I try to reproduce or replicate findings from research articles.

This reproduction is for:

Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2025). AI tutoring outperforms in-class active learning: An RCT introducing a novel research-based design in an authentic educational setting. Scientific Reports, 15(1), 17458. https://doi.org/10.1038/s41598-025-97652-6

This is quite a viral AI article which is often used for large claims about AI, especially to argue that AI tutors teach physics twice as well as Harvard professors.

Of course, currently, everyone is clambering to get their AI studies out, because they are dated within a week. It was surprising to see that the article was received on 25 March 2025 and accepted 13 days later on 7 April 2025. Good things about the study were:

  • Strong experimental design, compared with an active learning condition rather than ‘lecturing’.
  • Information about the AI design.
  • Sizable effect sizes, engagement and motivation higher in the AI condition (can be novelty effect) but enjoyment and growth mindset similar (so not superficial), and
  • Data made available (see below for the reproduction).

However, we must highlight limitations as well if this study used as blueprint for education systems as a whole:

  • The study is one single institution (with presumably high entry tariff), a single course and for only two weeks.
  • Short-term outcomes.
  • The AI tutor took a long time to design, so this can’t be compared with out-of-the-box AIs.
  • Ceiling effects…but that could also mean that learning effect underestimated.
  • Medium versus pedagogy: the AI condition was asynchronous, self-paced, and at home; the in-class condition was synchronous and in-person.
  • Novelty effect (see positives as well).

I’m sure there are other points, but you’d want more of these studies, and it therefore is slightly annoying that the Tutor and TeachGPT not available, although prompts are made available. LLMs are just to unpredictable to really rely on this, a general challenge in reproducing findings.

What is available is the data though, and I could reproduce the results. Figures produced with R below (they look different but essentially the same):

I used the link to the data in the article.

The R script is here.

Categories
Education Education Research

researchEd national conference

On 9 September 2017 I gave a talk at the national researchEd conference in London. The presentation was about how mythbusting might lead to new myths. The presentation covered the following:

  • I started by explaining how myths might come about, by referencing some papers about neuromyths.
  • I then used the case of iron in spinach to illustrate how criticising myths can lead to new myths (paper by Rekdal).
  • I gave examples of some themes that are in danger of becoming new myths.
  • I concluded that it is important to read a lot, stay critical and observe nuance. No false dichotomies please.

I will endeavor to write this up at one point. Slides below.

Categories
Education Research Math Education Tools

Seminar at Loughborough University

Dr. Christian Bokhove recently gave an invited seminar at Loughborough University:

Using technology to support mathematics education and research

Christian received his PhD in 2011 at Utrecht University and is lecturer at the University of Southampton. In this talk Christian will present a wide spectrum of research initiatives that all involve the use of technology to support mathematics education itself and research into mathematics education. It will cover (i) design principles for algebra software, with an emphasis on automated feedback, (ii) the evolution from fragmented technology to coherent digital books, (iii) the use of technology to measure and develop Mental Rotation Skills, and (iv) the use of computer science techniques to study the development of mathematics education policy.

The talk referenced several articles Dr. Bokhove has authored over the years, for example:

  • Bokhove, C., & Drijvers, P. (2012). Effects of a digital intervention on the development of algebraic expertise. Computers & Education, 58(1), 197-208. doi:10.1016/j.compedu.2011.08.010
  • Bokhove, C., (in press). Using technology for digital maths textbooks: More than the sum of the parts. International Journal for Technology in Mathematics Education.
  • Bokhove, C., & Redhead, E. (2017). Training mental rotation skills to improve spatial ability. Online proceedings of the BSRLM, 36(3)
  • Bokhove, C. (2016). Exploring classroom interaction with dynamic social network analysis. International Journal of Research & Method in Education, doi:10.1080/1743727X.2016.1192116
  • Bokhove, C., &Drijvers, P. (2010). Digital tools for algebra education: criteria and evaluation. International Journal of Computers for Mathematical Learning, 15(1), 45-62. Online first. doi:10.1007/s10758-010-9162-x
Categories
Education Research

Economic papers about education (CPB part 2)

This is a follow-up post from this post in which I unpicked one part of large education review. In that post I covered aspects of papers by Vardardottir, Kim, Wang and Duflo. In this post I cover another papers in that section (page 201).

Booij, A.S., E. Leuven en H. Oosterbeek, 2015, Ability Peer Effects in University: Evidence
from a Randomized Experiment, IZA Discussion Paper 8769.
This is a discussion paper from the IZA series. This is a high quality series of working papers, but this -of course- is not yet a peer-reviewed journal version. Maybe there is one at the moment but clearly this version was used for the review. Previously I had already noticed there could be considerable differences between working papers and the final version, just see Vardardottir’s evaluation of Duflo et al.’s paper.
booij
The paper concerns undergraduate economics students. Of course a first observation would be that it might be difficult to generalize wider than ‘economics undergraduates from a general university in the Netherlands’. Towards the end it is however argued that together with other papers (Duflo, Carrell) a pattern results is emerging. The first main result is in Table 4.
mainresult
The columns show how the models were built. Column (1) has the basic model with only the mean of peers’ Grade Point Average (GPA) and ‘randomization controls’ are included. Column (2) adds controls like ‘gender’, ‘age’ and ‘professional college’. Column (3) adds the Standard Deviation (SD) of peers’ GPA in a tutorial group. Columns (1) to (3) do not show any effect. Only in column (4), where non-linear terms and an interaction are added, some significant variables appear. This can be seen by the **. The main result seems rather borderline, but ok, in the context of ability grouping it is Table 5 that is more interesting.
trackingIn that table different tracking scenarios are studied. The first column is overall effects compared to ‘mixed’, so this looks at the ‘system’ as a whole. Columns (2) to (4) show the differentiated effects. From this table I would deduce:
  • In two-way tracking lower ability gain a little bit (10% significance in my book is not significant), higher ability gain a little bit (borderline 5%)
  • Three way tracking: middle and low gain some, high doesn’t.
  • Track Low: low gains, middle more (hypothesis less held back?), high doesn’t.
  • Track Middle: only middle gains (low slightly negative but not significant!)
  • Separate high ability: no one gains.

This is roughly the same as what is described in the article on page 20. The paper then also addresses average grade and dropout. Actually, the paper goes into many more things (teachers, for example) which I will not cover. It is interesting to look at the conclusions, and especially the abstract. I think the abstract follows from the data, although I would not have said “students of low and medium ability gain on average 0.2 SD units of achievement from switching from ability mixing to three-way tracking.” because it seems 0.20 and 0.18 respectively (so 19% as mentioned in the main body text). Only a minor quibble, which after querying, I heard has been changed in the final version. I found the discussion very limited. It is noted that in different contexts (Duflo, Carrell) roughly similar results are obtained (but see my notes on Duflo).

Overall, I find this an interesting paper which does what it says on the tin (bar some tiny comments). Together with my previous comments, though, I would still be weary about the specific contexts.

 

 

Categories
Education Research

Unpicking economic papers: a paper on direct instruction

This paper has the title “Is traditional teaching really all that bad?” and is by Schwerdt and Wuppermann makes clear that this paper sets out to show it isn’t. And without this paper I would have said the same thing. Simply because I wouldn’t deny that ‘direct instruction’ has had a rough treatment in the last decades.

There are several versions of this paper on SSRN and other repositories. The published version is from ‘Economics of Education Reviw’, and this immediately shows why I have included it. In the advent of economics papers some have preferred to use this paper rather than a more sociological, psychological or education research approach.

Schwerdt

The literature review is, as often the case in my opinion in economics papers, a bit shallow. The study uses TIMSS 2003 year 8 data (I don’t know why they didn’t use 2007 data).

I find the wording “We standardize the test scores for each subject to be mean 0 and standard deviation 1.” a bit strange because the TIMSS dataset, as in later years, does not really have ‘test scores per subject’ because subjects do not make all the assessment items.

pv(link)Instead, there are five so-called ‘plausible values’. Not using them might underestimate the standard error, which might lead to results being significant more swiftly. This variable is the outcome, another variable is the question 20.

teachThe distinction between instruction and problem solving are based on three of these items: b is seen as direct instruction, c and d together problem solving (note that one of course does mention ‘guidance’). There is an emphasis on ‘new material’ so I can see why these are chosen. Of course the use of percentages means that an absolute norm is not apparent, but I can see how lecture%/(lecture%+problemsolving%) denotes a ratio of lecturing. The other five elements are together used as control. Mean imputation was used (I can agree that imputation method probably did not make a difference) and sample weights (also good, contrary to no plausible values).

Table 1 in the paper tabulates all the variables and shows some differences between maths and science teachers, for example in the intensity of lecture style teaching. The paper then proposes a model “standard education production function”. In all the result tables we can certainly see the standard p=.10 and again with large N’s this, to me, seems unreasonable. A key result is in Table 4:

lecturingThe first line is the lecture style teaching variable. Columns 1 and 3 show that Math is significant (but keep in mind, at 5% with high N. However, 0.514 does sound quite high) and Science is not. Columns 2 and 4 then have the same result but now by taking into account school sorting based on unobservable characteristics of students through inclusion of fixed school effects. I find the pooling a bit strange, and reminds me of the EEF pooling of maths mastery for primary and secondary to gain statistically significant results. Yes, here too, both subjects then yield significant results. Together with the plausible values issue I would be cautious.

Table 5 extends the analysis.

table5The same pattern arises. The key variable is significant at the questionable 10% level (column 1) and a bit stronger after adding confounding variables (at the 5% level, but again with high N). The articles notices that over the columns the variable is quite constant, but also that it’s lower than the Table 4 results, showing that there are school effects.

rangeThere is footnote on page 373 that might have received a bit more attention. I find the reporting a bit strange because the first line indicates that variable ranges from 0.11 to 0.14, not 0.14 to 0.1 (and why go from a larger to a smaller number, is this a typo?). Overall, 1% of an SD seems very low. I think the discussion that follows is interesting and adds some thoughts. I thought it was interesting that was said “Our results, therefore, do not call for more lecture style teaching in general. The results rather imply that simply reducing the amount of lecture style teaching and substituting it with more in-class problem solving without concern for how this is implemented is unlikely to raise overall student achievement in math and science.”. Well, that does seem a balanced conclusion, indeed. And again, a strong feature for most economic papers, the robustness checks are good.

In conclusion, I found this an interesting use of a TIMSS variable. Perhaps it could be repeated with 2011 data, and now include all five plausible values (perhaps a source of error). Nevertheless, although I think strong conclusions in favour of lecturing could be debated, likewise it could be said that there also are no negative effects of it: there’s nothing wrong with lecturing!

Categories
Education Research

Unpicking economic papers: a paper on behaviour

One of the papers that made a viral appearance on Twitter is a paper on behaviour in the classroom. Maybe it’s because of the heightened interest in behaviour, for example demonstrated in the DfE’s appointment of Tom Bennett, and behaviour having a prominent place in the Carter Review.

Carrell, S E, M Hoekstra and E Kuka (2016) “The long-run effects of disruptive peers”, NBER Working Paper 22042. link.

disrupt

The paper contends how misbehaviour (actually, domestic violence) of pupils in a classroom apparently leads to large sums of money that people will miss out of later in life. There, as always, are some contextual questions of course: the paper is about the USA, and it seems to link domestic violence with classroom behaviour. But I don’t want to focus on that, I want to focus on the main result in the abstract: “Results show that exposure to a disruptive peer in classes of 25 during elementary
school reduces earnings at age 26 by 3 to 4 percent. We estimate that differential exposure to children
linked to domestic violence explains 5 to 6 percent of the rich-poor earnings gap in our data, and that
removing one disruptive peer from a classroom for one year would raise the present discounted value
of classmates’ future earnings by $100,000.”.

It’s perfectly sensible to look at peer effects of behaviour of course, but monetising it -especially with a back of envelope calculation (actual wording in the paper!)- is on very shaky ground. The paper respectively looks at the impact on test scores (table 3), college attendance and degree attainment (table 4), and labor outcomes (table 5). The latter is also the one reported in the abstract.

table5There are some interesting observations here. The abstract’s result is mentioned in the paper “Estimates across columns (3) through (8) in Panel A indicate that elementary school exposure to one additional disruptive student in a class of 25 reduces earnings by between 3 and 4 percent. All estimates are significant at the 10 percent level, and all but one is significant at the 5 percent level.” The fact economists would even want to use 10% (with such a large N) is already strange to me. Even 5% is tricky with those numbers. However, the main headline in the abstract can be confirmed. But have a look at panel C. It seems there is a difference between ‘reported’ and ‘unreported’ Domestic Violence. Actually, reported DV has a (non-significant) positive effect. Where was that in the abstract? Rather than a conclusion along the lines whether DV was reported or not, the conclusion only focuses on the negative effects of *unreported* DV. I think it would be more fair to make a case for better signalling and monitoring of DV, so that negative effects of unreported DV are countered; after all, there are no negative effects on peers when reported.

 

 

Categories
Education Research Math Education MathEd

Slides from researchEd maths and science

Presentation for researchED maths and science on June 11th 2016.

References at the end (might be some extra references from slides that were removed later on, this interesting 🙂

Interested in discussing, contact me at C.Bokhove@soton.ac.uk or on Twitter @cbokhove

Categories
Education Education Research Games ICT Math Education MathEd Tools

Games in maths education

This is a translation of a review that appeared a while back in Dutch in the journal of the Mathematical Society (KWG) in the Netherlands. I wasn’t able to always check the original English wording in the book.

Computer games for Maths

Christian Bokhove, University of Southampton, United Kingdom

51iyzu1DTlL._SX326_BO1,204,203,200_Recently, Keith Devlin (Stanford University), known of his newsletter Devlin’s Angle and popularisation of maths, released a computer game (app for the iPad) with his company Innertubegames called Wuzzit Trouble (http://innertubegames.net/). The game purports to, without actually calling them that, address linear Diophantine equations and build on principles from Devlin’s book on computer games and mathematics (Devlin, 2011) in which Devlin explains why computer games are an ‘ideal’ medium for teaching maths in secondary education. In twelve chapters the book discusses topics like street maths in Brasil, mathematical thinking, computer games, how these could contribute to the learning of maths, and concludes with some recommendations for successful educational computer games. The book has two aims: 1. To start a discussion in the world of maths education about the potential for games in education. 2. To convince the reader that well designed games will play an important role in our future maths education, especially in secondary education. In my opinion, Devlin succeeds in the first aim simply by writing a book about the topic. The second aim is less successful.

Firstly, Devlin uses a somewhat unclear definition of ‘mathematical thinking’.: at first it’s ‘simplifying’, then ‘what a mathematician does’, and then something else yet again. Devlin remains quite tentative in his claims and undermines some of his initial statements later on in the book. Although this is appropriate it doesweaken some of the arguments. The book subsequently feels like a set of disjointed claims that mainly serve to support the main claim of the book: computer games matter. A second point I noted is that the book seems very much aimed the US. The book describes many challenges in US education that, in my view, might be less relevant for Europe. The US emphasis also might explain the extensive use of superlatives like an ‘ideal medium’. With these one would expect a good support of claims with evidence. This is not always the case, for example when Devlin claims that “to young players who have grown up in era of multimedia multitasking, this is no problem at all” (p. 141) or  “In fact, technology has now rendered obsolete much of what teachers used to do” (p. 181). Devlin’s experiences with World of Warcraft are interesting but anecdotical and one-sided, as there are many more types of games. It also shows that the world of games changes quickly, a disadvantage of a paper book from 2011.

Devlin has written an original, but not very evidenced, book on a topic that will become more and more relevant over time. As avid gamer myself I can see how computer games have conquered the world. It would be great if mathematics could tap into a fraction of the motivation, resources and concentration it might offer. It’s clear to me this can only happen with careful and rigorous research.

Devlin, Keith. (2011). Mathematics Education for a New Era: Video Games as a Medium for Learning.

Categories
Education Research

Some work presented in the last months

snaSome work was presented in the last months.

At Sunbelt XXXV I presented this work on classroom interaction and Social Network Analysis:

At ICTMT and PME my colleague presented our work on c-books

Categories
Education Research

Predatory journals

More and more I’m being confronted with questions about journal publications. I devote some words to it in a session for our MSc programme in the module ‘Understanding Education Research’ and recently, in a panel discussion at our local PGR conference, there were questions about how to judge a journal’s reputation. Note that in answering this question I certainly don’t want be a ‘snob’ i.e. that only the conventional and traditional publication methods suffice. Actually, developments on blogging and Open Access are positive changes, in my view. Unfortunately there also is a darker side to all of this:

One place where I always look first when it comes to ‘vanity press’ and predatory journals is Beall’s List, which is “a list of questionable, scholarly open-access publishers.”. What I like about this list is that they are rather sensible about how to use the list: “We recommend that scholars read the available reviews, assessments and descriptions provided here, and then decide for themselves whether they want to submit articles, serve as editors or on editorial boards.”. The list of criteria for determining predatory open access journals is clear as well. One thing you can do is use the search function to see if a journal or publisher gets a mention. This is exactly what I did recently with some high profile research. I was surprised to find out articles were indeed published in such journals.

The first example is this high profile article mentioned in the Times Educational Supplement. It references a press release from Mitra’s university:  

 
The journal title did not ring a bell so I checked Beall’s list, and yes the journal and publisher are mentioned in this article on the list. Just a quick glance, also the comments, should make most scholars think twice to publish in here, certainly if it is ‘groundbreaking’ stuff. This is not to say that articles per se are bad (although methodologically there is much to criticisise as well, maybe later, although this blog does a good job at concisely flagging up some issues) but I am worried that high profile professors are publishing in journals like these (assuming it was done with the authors’ agreement, predatory journals sometimes just steal content to bump up their reputation). In the case of this person it has happened before, in 2012, when the ‘Center of Promoting Ideas’ (this name would be enough for me to not want to appear in their publications) published this article in a journal, which is also on Beal’s list. It is poignant that an Icelandic scholar really got into problems because of this. Some other examples: this article, CIR world also features on Beall’s list (Council for Innovative Research, again a name which raises suspicion by itself).

  

These publications serve as examples that even high end professors could fall victim of predatory journals. I do not mean that in a judgemental way; it shows that more education on the world of predatory journals is needed. Although I must admit, there might be some naivety at play here, experienced scholars should know ‘positive reviews only’, ‘dubious publishing fees’ and ‘unrealistic publication turnovers’ are very suspicious. Early Career Researchers often are targets of predatory journals and it therefore is important to be aware of this ‘dark side’ of Open Access publishing. Beal’s list covers these but recently there also are more and more ‘non open access’ journals that might be a bit dubious as well. In many cases it’s quite a challenge to judge the trustworthiness of publications. Certainly if in social sciences we would want to go away from the hegemony of the five big publishers, there is a lot to be gained in general skills to judge literature. Now, everyone has their own judgements to make when it comes where they want to publish, but I would be very concerned publishing in any journal (and for any publisher) on Beall’s list.