Cross-platform command-line tool (Windows, Linux, and Mac) that transcribes audio from video and audio files (.mp4, .mov, .mkv, .mp3, .m4a, .wav) into Markdown files, using Whisper.cpp locally, 100% offline, without sending data to any server.
Before using vtte, install the following tools on your system:
Mac (Homebrew):
brew install ffmpegLinux (Debian/Ubuntu):
sudo apt install ffmpegWindows (Winget):
winget install ffmpegMac (Homebrew):
brew install whisper-cppLinux / Windows: Build from source:
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build --config ReleaseOnce built, copy the whisper-cli binary (or whisper-cli.exe) to a directory that is in your PATH.
Download the AI model. ggml-base is a good starting point (a balance between speed and accuracy):
# Create a folder for the models
mkdir -p ~/whisper-models
# Download the base model
curl -L https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin \
-o ~/whisper-models/ggml-base.binOther available models (from lightest to most accurate):
ggml-tiny.bin— fastestggml-base.bin— recommendedggml-small.binggml-medium.binggml-large-v3.bin— most accurate, requires more memory
Requirement: Go 1.22+
go install github.com/isadfrn/vtte@latestOr build locally:
git clone https://github.com/isadfrn/vtte
cd vtte
go build -o vtte .vtte [options] <file_or_directory>
| Flag | Default | Description |
|---|---|---|
-lang |
auto |
Transcription language (e.g., pt, en, es) or auto for automatic detection |
-model |
ggml-base.bin |
Path to the Whisper model file |
The model can also be set via the VTTE_MODEL environment variable.
Transcribe a single file (language detected automatically):
vtte meeting.mp4
vtte podcast.mp3
vtte interview.wavForce Portuguese language:
vtte -lang pt meeting.mp4Use a larger model for higher accuracy:
vtte -model ~/whisper-models/ggml-large-v3.bin -lang pt meeting.mp4Transcribe an entire folder (mixed formats):
vtte -lang pt ~/recordings/Set the model via an environment variable:
export VTTE_MODEL=~/whisper-models/ggml-base.bin
vtte folder/with/videos/For each processed file, vtte generates a .md file in the same folder as the original:
meeting.mp4 → meeting.md
podcast.mp3 → podcast.md
interview.wav → interview.md
The Markdown file contains the title with the video name followed by the transcribed text, ready to be imported into Google NotebookLM, Claude, or any other AI tool.
- Audio extraction — FFmpeg extracts the audio from the video and converts it to PCM 16kHz mono (the ideal format for Whisper)
- Transcription — Whisper CLI processes the audio locally with the chosen model
- Markdown — The text is saved as
.mdwith the video name as the title - Cleanup — Temporary
.wavand.txtfiles are removed automatically
This repository is using Gitflow Workflow and Conventional Commits, so if you want to contribute:
- create a branch from develop branch;
- make your contributions;
- open a Pull Request to develop branch;
- wait for discussion and future approval;
I thank you in advance for any contribution.
Maintaining