OpenAI’s Whisper is an automatic speech recognition AI system that can recognize speech and convert it into text. The system works amazingly well for the German language and can also be used on local systems.
For some time now, we have been developing a solution based on this, aimed at those professional groups who still work with a dictation device (whether they like it or not) and subsequently need to convert the dictations into text form – e.g. lawyers, doctors, notaries and others.
Our solution combines the workflow of electronic dictation (e.g. via a smartphone app) with the ability to send the recording via email to a speech-to-text system and receive a transcription back – all on-premise in the customer’s environment, making it GDPR-compliant. In addition to processing formatting commands, our solution offers further advantages over pure Whisper transcription. But that’s not what we want to focus on today.
Since the processing time depends heavily on the hardware used (and works particularly well on GPUs/graphics cards), we wanted to find a sweet spot in terms of price/performance and conducted tests on various systems (and gathered some from the internet).
It is important to note that the largest and best Whisper model requires approximately 10GB of RAM or VRAM (and when processing on a GPU/graphics card, in addition to the minimum 10GB of VRAM, at least 10GB of RAM is also required, otherwise the model cannot be loaded into the graphics card’s memory). This applies to each instance that is to be run in parallel on the same system (i.e. for two parallel instances on one GPU, at least 20GB of VRAM is required).
We used the OpenAI Whisper Large-v3 model (not whisper-faster or similar derived models) in our own solution, but only measured the pure Whisper processing time. We used two different audio files with sizes of 5.7MB and 11.25MB and audio lengths of 245.64s and 368.68s, respectively. Since the processing time can vary greatly even for the same files (regardless of system load), the values are sometimes averaged (indicated by avg in the list, if enough runs were performed).
The PpM (Processing seconds per (audio/recording) minute) in the following list refers to the processing time in seconds per audio minute, i.e., the duration in seconds required to process one minute of audio.
Environment: Pytorch 2.2.2, CUDA 12.4, Debian 12.5 (bare metal, wsl, Hyper-V VM)
GPUs
GPU Model | Environment | Min PpM | Max PpM | AVG PpM | Comment |
RTX 4070 Super 12GB | Docker on Debian 12.5 (bare metal) | 17,60 | 27,91 | 19,11 | both files, four runs each |
RTX 4070 Super 12GB | Podman on Windows WSL2 | 16,85 | 41,38 | 24,10 | both files, seven runs each |
RTX 4060ti 16GB | Podman on Windows WSL2 | 16,26 | 20,76 | 18,01 | both files, five runs each |
RTX 4090 24GB | n/a | n/a | n/a | 7,12 | foreign test https://github.com/openai/whisper/discussions/918 |
RTX 3090 24GB | n/a | 11,49 | 22,22 | n/a | foreign test https://youtu.be/lgP1LNnaUaQ?t=880 https://github.com/openai/whisper/discussions/918 |
RTX 3060 12GB | n/a | n/a | n/a | 34,48 | foreign test https://youtu.be/lgP1LNnaUaQ?t=880 |
CPUs
CPU Model | Environment | Min PpM | Max PpM | AVG PpM | Comment |
Ryzen 7 5700G @ 3.80 GHz, 8 Cores | Podman on Windows WSL2 | 157,89 | 214,29 | 176,47 | both files, three runs each |
i7-6700K @ 4.00 GHz, 4 Cores | Podman on Windows WSL2 | 171,43 | 285,71 | n/a | both files, two runs each |
i5-4460 @ 3,2 GHz, 4 Cores (kein HT) | Docker on Debian 12.5 (bare metal) | 142,86 | 240 | n/a | both files, one run each |
Xeon Gold 5415+ @ 2,90 GHz, 4 vCores | Docker on Debian VM on Hyper-V Host | 71,43 | 272,73 | 130,43 | 46 runs, all different files (production system from a customer) |
Xeon E5-2620 v3 @ 2,40 GHz, 4 vCores | Docker on Debian VM on Hyper-V Host | 200 | 240 | n/a | beide Dateien, jeweils ein Durchlauf |
As is not difficult to see, processing on CPUs is only possible with a high time investment. In our solution, however, this is still usable, at least when the dictations are not urgent.
The sweet spot in terms of price/performance in this comparison is likely to be the Nvidia GeForce RTX 4060ti with 16GB VRAM. This is faster than the currently around 200 euros more expensive GeForce RTX 4070 Super with 12GB VRAM. The larger cards like the RTX 4080/4090 are still much more expensive – for that price you could get three RTX 4060tis and, due to the memory requirements, you can better parallelize if needed and set up an appropriate system.