speech2md
View on GitHubA CLI that turns long audio recordings into clean, readable prose markdown. Uses Qwen3-ASR served through vLLM, with optional word-level timestamps via the matching forced aligner.
Can also use pyannote diarization - to split text by speaker with --diarize flag.
Built around a single-GPU, in my case 24 GB workstation with RTX 3090. On that hardware it transcribes at roughly 150–250× realtime — a 51-minute recording finishes in about 16 seconds after the model is loaded.
Install
uv tool install --python 3.12 'speech2md[gpu] @ git+https://github.com/kumekay/speech2md'
Requires Python 3.11/3.12 (issue with newer ones, because old vLLM is used for qwen-asr), Linux with an NVIDIA GPU (~16 GB VRAM free for transcription, +2 GB for alignment), CUDA, and ffmpeg/ffprobe on $PATH.
Usage
speech2md audio/recording.m4a
speech2md /path/to/recordings/*.m4a --skip-existing
Add --json to emit a per-chunk sidecar, then run align-transcription for word-level timestamps and SRT output.