Audio | kumekay

Transcribe my life

TLDR: speech2md does a solid job transcribing speech in a mix of languages and splitting it by speaker on an NVIDIA GPU with 16 GB of memory. I wired it into an automated pipeline so that recordings from my phone get converted to text notes automatically.

Speech recognition is not a new technology. It’s been around for ages. Every phone has voice input on the keyboard. And it actually works pretty well. But there are some caveats.

First of all, in my case, I often mix languages when I speak. I might be talking in Russian, throw in a bunch of English-language terms, then drop in a couple of Czech words, and then continue in Russian or English.

The small models that ship on phones don’t handle that very well. You need bigger models. I’m using the word “model” because nowadays nearly all speech recognition is done by special large language models (LLMs) with a dedicated speech encoder.

Fortunately,