Last year I wrote about a neat and lightweight implementation of the Whisper speech-to-text model. One of the potential applications I mentioned was converting recorded presentations (seminars, lectures, etc.) into written notes. A few weeks ago a review article I wrote using this approach was published in AAPPS Bulletin. Here's how I did it:
1. Identify source material. In this case, I had an online conference talk that had been recorded and uploaded to Youtube.
2. Download the raw audio using a tool such as yt-dlp
3. Convert audio to a text transcript. I used whisper.cpp (can run on CPU). The base and small models sizes already do pretty well in terms of accuracy and run quickly.
4. Transcript editing. Whisper won't have perfect accuracy, especially when attempting to transcribe scientific jargon. So it's necessary to carefully review the generated text.
5. Figure conversion. In this case since it was my own talk, I had access to high resolution version of the figures I wanted to include in the paper. Minor reformatting required.
6. Add references. While I cited papers in the slides, the citations need to be converted to a .bib file or other reference manager format. It would be helpful to have an AI assistant that could do this automatically.
And with that I had a first draft completed! Very nice, since the first draft is usually the hardest to write. I did spend some more time polishing the text, adding some details that didn't make it into the original talk, and making the language more formal in parts, but it ended up being a lot easier than writing the whole text from scratch!