ReproLab – Christian Bokhove

Reproduction of Kestin et al. (2025)

In ReproLab I try to reproduce or replicate findings from research articles.

This reproduction is for:

Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2025). AI tutoring outperforms in-class active learning: An RCT introducing a novel research-based design in an authentic educational setting. Scientific Reports, 15(1), 17458. https://doi.org/10.1038/s41598-025-97652-6

This is quite a viral AI article which is often used for large claims about AI, especially to argue that AI tutors teach physics twice as well as Harvard professors.

Of course, currently, everyone is clambering to get their AI studies out, because they are dated within a week. It was surprising to see that the article was received on 25 March 2025 and accepted 13 days later on 7 April 2025. Good things about the study were:

Strong experimental design, compared with an active learning condition rather than ‘lecturing’.
Information about the AI design.
Sizable effect sizes, engagement and motivation higher in the AI condition (can be novelty effect) but enjoyment and growth mindset similar (so not superficial), and
Data made available (see below for the reproduction).

However, we must highlight limitations as well if this study used as blueprint for education systems as a whole:

The study is one single institution (with presumably high entry tariff), a single course and for only two weeks.
Short-term outcomes.
The AI tutor took a long time to design, so this can’t be compared with out-of-the-box AIs.
Ceiling effects…but that could also mean that learning effect underestimated.
Medium versus pedagogy: the AI condition was asynchronous, self-paced, and at home; the in-class condition was synchronous and in-person.
Novelty effect (see positives as well).

I’m sure there are other points, but you’d want more of these studies, and it therefore is slightly annoying that the Tutor and TeachGPT not available, although prompts are made available. LLMs are just to unpredictable to really rely on this, a general challenge in reproducing findings.

What is available is the data though, and I could reproduce the results. Figures produced with R below (they look different but essentially the same):

I used the link to the data in the article.

The R script is here.

In ReproLab I try to reproduce or replicate findings from research articles.

This reproduction is for:

Reid, C., & Boeren, E. (2026). Growth mindset is positively associated with mathematics attainment in Scotland—But socioeconomic status plays a greater role. British Educational Research Journal. https://doi.org/10.1002/berj.70156

This article used Scottish PISA 2022 data to look at Growth Mindset in relation to mathematics achievement. The original article used SPSS and especially when doing multilevel modelling it always is a bit of a guess whether findings will replicate with a different package (McCoach et al., 2018). As I mainly work in R, this was the one I went for. The raw data from the OECD website was used. The authors used multilevel models, but no weights and only the first Plausible Value (PV).

The key result is in Table 1:

The first column of this screenshot shows the results for Scotland. It was very easy to add other countries with the code, so I added the home countries:

The result reproduce almost perfectly, a great indication that the authors reported enough information in their article, a sign of rigour. The key result here, according to the article is the non-significant interaction between GM_c and ESCS (growth mindset and socio-economic status). The interaction for GM*ESCS for England is notable: growth mindset matters more for students from lower‑SES backgrounds.

However, I wanted to check if perhaps the choice for not including weights and only using one PV would make a difference, so I also used code using all ten PVs and the best weighting scenario according to Mang et al. (2021) (“The simulation results revealed three weighting approaches performing best in retrieving the true population parameters. One of them implies using only level two weights (here: final school weights) and is because of its simple implementation the most favourable one. “).

Notably, the estimates are different but most of the key results stay the same, except for mean school GM. Also note, though, that if one were to extend to all home countries, the interaction between GM_c and ESCS now isn’t significant anymore for England.

The code is a bit messy but happy to send/give. I might add it to the post later. Interesting extensions could be the home countries and other countries in the global PISA 2022 sample.