OpenAI Whisper performance benchmarks

OpenAI’s Whisper is an automatic speech recognition AI system that can recognize speech and convert it into text. The system works amazingly well for the German language and can also be used on local systems.

For some time now, we have been developing a solution based on this, aimed at those professional groups who still work with a dictation device (whether they like it or not) and subsequently need to convert the dictations into text form – e.g. lawyers, doctors, notaries and others.

Our solution combines the workflow of electronic dictation (e.g. via a smartphone app) with the ability to send the recording via email to a speech-to-text system and receive a transcription back – all on-premise in the customer’s environment, making it GDPR-compliant. In addition to processing formatting commands, our solution offers further advantages over pure Whisper transcription. But that’s not what we want to focus on today.

Since the processing time depends heavily on the hardware used (and works particularly well on GPUs/graphics cards), we wanted to find a sweet spot in terms of price/performance and conducted tests on various systems (and gathered some from the internet).

It is important to note that the largest and best Whisper model requires approximately 10GB of RAM or VRAM (and when processing on a GPU/graphics card, in addition to the minimum 10GB of VRAM, at least 10GB of RAM is also required, otherwise the model cannot be loaded into the graphics card’s memory). This applies to each instance that is to be run in parallel on the same system (i.e. for two parallel instances on one GPU, at least 20GB of VRAM is required).

We used the OpenAI Whisper Large-v3 model (not whisper-faster or similar derived models) in our own solution, but only measured the pure Whisper processing time. We used two different audio files with sizes of 5.7MB and 11.25MB and audio lengths of 245.64s and 368.68s, respectively. Since the processing time can vary greatly even for the same files (regardless of system load), the values are sometimes averaged (indicated by avg in the list, if enough runs were performed).

The PpM (Processing seconds per (audio/recording) minute) in the following list refers to the processing time in seconds per audio minute, i.e., the duration in seconds required to process one minute of audio.

Environment: Pytorch 2.2.2, CUDA 12.4, Debian 12.5 (bare metal, wsl, Hyper-V VM)

GPUs

GPU Model	Environment	Min PpM	Max PpM	AVG PpM	Comment
RTX 4070 Super 12GB	Docker on Debian 12.5 (bare metal)	17,60	27,91	19,11	both files, four runs each
RTX 4070 Super 12GB	Podman on Windows WSL2	16,85	41,38	24,10	both files, seven runs each
RTX 4060ti 16GB	Podman on Windows WSL2	16,26	20,76	18,01	both files, five runs each
RTX 4090 24GB	n/a	n/a	n/a	7,12	foreign test https://github.com/openai/whisper/discussions/918
RTX 3090 24GB	n/a	11,49	22,22	n/a	foreign test https://youtu.be/lgP1LNnaUaQ?t=880 https://github.com/openai/whisper/discussions/918
RTX 3060 12GB	n/a	n/a	n/a	34,48	foreign test https://youtu.be/lgP1LNnaUaQ?t=880

CPUs

CPU Model	Environment	Min PpM	Max PpM	AVG PpM	Comment
Ryzen 7 5700G @ 3.80 GHz, 8 Cores	Podman on Windows WSL2	157,89	214,29	176,47	both files, three runs each
i7-6700K @ 4.00 GHz, 4 Cores	Podman on Windows WSL2	171,43	285,71	n/a	both files, two runs each
i5-4460 @ 3,2 GHz, 4 Cores (kein HT)	Docker on Debian 12.5 (bare metal)	142,86	240	n/a	both files, one run each
Xeon Gold 5415+ @ 2,90 GHz, 4 vCores	Docker on Debian VM on Hyper-V Host	71,43	272,73	130,43	46 runs, all different files (production system from a customer)
Xeon E5-2620 v3 @ 2,40 GHz, 4 vCores	Docker on Debian VM on Hyper-V Host	200	240	n/a	beide Dateien, jeweils ein Durchlauf

As is not difficult to see, processing on CPUs is only possible with a high time investment. In our solution, however, this is still usable, at least when the dictations are not urgent.

The sweet spot in terms of price/performance in this comparison is likely to be the Nvidia GeForce RTX 4060ti with 16GB VRAM. This is faster than the currently around 200 euros more expensive GeForce RTX 4070 Super with 12GB VRAM. The larger cards like the RTX 4080/4090 are still much more expensive – for that price you could get three RTX 4060tis and, due to the memory requirements, you can better parallelize if needed and set up an appropriate system.