Transcribing audio with less pain

forblogLike so many people I’ve never really liked transcribing audio, for example from interviews or focus groups. It is time-consuming and boring. Of course, you can outsource this but that unfortunately costs money. So I thought: “how can I do this quicker with available services.”

Last year with a colleague I wrote an article on exactly this: using the Youtube auto-captioning feature to more quickly transcribe audio. The quality of Youtube’s voice recognition has improved considerably in the last decade. The paper gives three examples, from interview audio, a classroom recording, and a Chilcott inquiry interview to show how useful this can be for transcribing audio ‘as a first transcript version’. I just posted the pre-publication.


To demonstrate the procedure, I applied it to my recent podcast with TES.

  1. You first need to get hold of an audio file. I assume you have it from your data collection. Sometimes you can obtain them like using apps in the browser like DownThemAll! (that one is for Firefox),
  2. Before being able to upload to Youtube, you need to make a video file out of it. For windows, I prefer Movie Maker. Unfortunately this has been discontinued, but you can still find it here. I make a video with an image and the audio as accompanying sound.
  3. Now this ‘movie’ (actually audio with one image) can be uploaded to Youtube. After a few hours Youtube should have created closed captions for the audio. Ensure that privacy settings are set correctly.
  4. The captions can be downloaded as text file via multiple tools like DIY captions or downsub. Some software is non-web-browser based, and some can also work with private settings (just as long as you are the ‘owner’ of the file, of course). The result might be a subtitle file, which could further be edited with subtitle software.
  5. You can see that this version already is pretty good. I think it captures it for around 80%. It took maybe 15 minutes of actual labour and some time for the Youtube captioning to do its work, for a 40 minute audio file.  This saves me a lot of time.

