Transcribe my life

TLDR: speech2md does a solid job transcribing speech in a mix of languages and splitting it by speaker on an NVIDIA GPU with 16 GB of memory. I wired it into an automated pipeline so that recordings from my phone get converted to text notes automatically.

Speech recognition is not a new technology. It’s been around for ages. Every phone has voice input on the keyboard. And it actually works pretty well. But there are some caveats.

First of all, in my case, I often mix languages when I speak. I might be talking in Russian, throw in a bunch of English-language terms, then drop in a couple of Czech words, and then continue in Russian or English.

The small models that ship on phones don’t handle that very well. You need bigger models. I’m using the word “model” because nowadays nearly all speech recognition is done by special large language models (LLMs) with a dedicated speech encoder.

Fortunately,

large models exist and are available, including in the cloud via API. And they’re not really all that expensive. For example, gpt4o-transcribe costs less than a dollar per hour of speech with great recognition quality on a mix of languages. For important meetings and short notes it works just fine (though even there, there are caveats).

But what if I set out to record everything I say, then transcribe it, and slowly collect data for building a model of myself?

Right away I want to flag the ethical question. Most often I’m not talking to myself but to someone else. And you always need to ask the other person for permission to record.

But even if I’m just muttering my ideas and thoughts while pacing around the room, that’s still a fair amount of audio. Usually I enjoy this format of working with a thought, when it’s just me, a voice recorder, and my imagination. In summer the room can be swapped for the park near my house. And during work hours I can walk between the flowerbeds outside the office.

Then I need to do something with these files. Of course, storing them as audio is cheap, but it’s much easier to work with them as text. But if you record several hours per week, that’s not all that cheap either. Plus, the gpt4o-transcribe mentioned above can’t generate timecodes for each word (alignment) and can’t separate different speakers (diarization). So I’d been thinking for a while about a solution of my own that could do all of it, at once and well, while also letting me run it on my own hardware. Conveniently, I have a computer with an Nvidia RTX 3090 that’s almost always on.

Last weekend I once again surveyed what speech recognition models are out there, and after a few experiments I settled on Qwen3 ASR 1.7B. It’s a fairly fresh release from Alibaba that’s probably the best on the planet at recognizing Chinese dialects. But OK, it also supports more than a hundred languages and works at a solid level on European ones too. Maximum file length is up to 20 minutes. It comes in two flavors: 600 million parameters (around 2 GB in RAM) and 1.7 billion (should fit in 4–5 GB). A nice size for a MacBook or a gaming GPU. They even shipped a Python library so you can run it quickly and conveniently. Cool, although for some reason that library requires fairly old versions of its dependencies. And it doesn’t work with the latest Python releases. On the upside, on my machine it lets me transcribe speech roughly 200x faster than real time. An hour of speech can be transcribed in 15–20 seconds.

To handle files longer than 20 minutes, I first find moments of silence in the original audio and split it on those silent points so that each chunk is 15–20 minutes long. The longer the chunk, the better the model maintains consistency. So it’s worth keeping them closer to the upper bound. And the assumption that there will be a moment of silence in any 5 minutes of speech holds up too. For finding silence I use ffmpeg.

The model itself only outputs text, without specifying at what point in the audio it was said. Luckily, there’s another model for this task: Qwen3 ForcedAligner 0.6B. It only works with 11 languages, but it lets you mark when each word was spoken. The catch is that it supports a maximum of 5 minutes of speech at a time. Precise per-word timestamps are needed for subtitles and for the next stage, when there are several people and we want to figure out who said what.

The pyannote package is probably the most popular solution for that task. And even though their top tier model is paid, they have a community edition you (and me) can use for free.

Now for recorded meeting audio I get markdown files in this format:

## SPEAKER-01

What was that?

## SPEAKER-00

Да нет, ничего, все в порядке.

It does make mistakes from time to time and may invent a couple of words or attribute the last thing one person said to another, but overall I’m very happy with the result.

I wrapped all of these tools into a single command-line application, speech2md. In its current form it works only on Linux and only with Nvidia cards, but it shouldn’t be a problem to add support for macOS with the new Apple Silicon chips. Even more

Of course I didn’t write all of the code by hand (this post was written manually in russian, then translated by Claude, then WTF I do not speak like this and edited it a bit) and made heavy use of AI agents, but I still feel that I learned a new applied skill and added a tool to my toolbox.

I also set up automation on my phone via Tasker. As soon as I record audio on my phone, it’s sent through the Folder Sync app to my home NAS, which uses Syncthing to sync it to the desktop, where another automation watches for new audio files, transcribes them, and puts the finished transcripts into my Obsidian. The chain is long, but it works smoothly. Usually less than a minute passes before the file shows up in my vault.

Now I just need to set up automatic knowledge extraction from these notes and organize everything into an LLM WIKI?

Or come up with a more elaborate scenario. For example, narrating all my thoughts during a code review, while in parallel recording which change I’m currently reading. Then an LLM can throw out all the swearing and automatically publish only the substantive comments. But that’s a project for another weekend.

Note: this is almost direct translation of this post in Russian with few details added.