OpenAI Whisper performance benchmarks

OpenAI’s Whisper is an automatic speech recognition AI system that can recognize speech and convert it into text. The system works amazingly well for the German language and can also be used on local systems.

For some time now, we have been developing a solution based on this, aimed at those professional groups who still work with a dictation device (whether they like it or not) and subsequently need to convert the dictations into text form – e.g. lawyers, doctors, notaries and others.

Whisper Performance Benchmark
Whisper Performance Benchmark

Our solution combines the workflow of electronic dictation (e.g. via a smartphone app) with the ability to send the recording via email to a speech-to-text system and receive a transcription back – all on-premise in the customer’s environment, making it GDPR-compliant. In addition to processing formatting commands, our solution offers further advantages over pure Whisper transcription. But that’s not what we want to focus on today.

Since the processing time depends heavily on the hardware used (and works particularly well on GPUs/graphics cards), we wanted to find a sweet spot in terms of price/performance and conducted tests on various systems (and gathered some from the internet).

It is important to note that the largest and best Whisper model requires approximately 10GB of RAM or VRAM (and when processing on a GPU/graphics card, in addition to the minimum 10GB of VRAM, at least 10GB of RAM is also required, otherwise the model cannot be loaded into the graphics card’s memory). This applies to each instance that is to be run in parallel on the same system (i.e. for two parallel instances on one GPU, at least 20GB of VRAM is required).

We used the OpenAI Whisper Large-v3 model (not whisper-faster or similar derived models) in our own solution, but only measured the pure Whisper processing time. We used two different audio files with sizes of 5.7MB and 11.25MB and audio lengths of 245.64s and 368.68s, respectively. Since the processing time can vary greatly even for the same files (regardless of system load), the values are sometimes averaged (indicated by avg in the list, if enough runs were performed).

The PpM (Processing seconds per (audio/recording) minute) in the following list refers to the processing time in seconds per audio minute, i.e., the duration in seconds required to process one minute of audio.

Environment: Pytorch 2.2.2, CUDA 12.4, Debian 12.5 (bare metal, wsl, Hyper-V VM)

GPUs

GPU ModelEnvironmentMin PpMMax PpMAVG PpMComment
RTX 4070 Super 12GBDocker on Debian 12.5 (bare metal)17,6027,9119,11both files, four runs each
RTX 4070 Super 12GBPodman on Windows WSL216,8541,3824,10both files, seven runs each
RTX 4060ti 16GBPodman on Windows WSL216,2620,7618,01both files, five runs each
RTX 4090 24GBn/an/an/a7,12foreign test
https://github.com/openai/whisper/discussions/918
RTX 3090 24GBn/a11,4922,22n/aforeign test
https://youtu.be/lgP1LNnaUaQ?t=880
https://github.com/openai/whisper/discussions/918
RTX 3060 12GBn/an/an/a34,48foreign test
https://youtu.be/lgP1LNnaUaQ?t=880

CPUs

CPU ModelEnvironmentMin PpMMax PpMAVG PpMComment
Ryzen 7 5700G @ 3.80 GHz, 8 CoresPodman on Windows WSL2157,89214,29176,47both files, three runs each
i7-6700K @ 4.00 GHz, 4 CoresPodman on Windows WSL2171,43285,71n/aboth files, two runs each
i5-4460 @ 3,2 GHz, 4 Cores (kein HT)Docker on Debian 12.5 (bare metal)142,86240n/aboth files, one run each
Xeon Gold 5415+ @ 2,90 GHz, 4 vCoresDocker on Debian VM on Hyper-V Host71,43272,73130,4346 runs, all different files (production system from a customer)
Xeon E5-2620 v3 @ 2,40 GHz, 4 vCoresDocker on Debian VM on Hyper-V Host200240n/abeide Dateien, jeweils ein Durchlauf

As is not difficult to see, processing on CPUs is only possible with a high time investment. In our solution, however, this is still usable, at least when the dictations are not urgent.

The sweet spot in terms of price/performance in this comparison is likely to be the Nvidia GeForce RTX 4060ti with 16GB VRAM. This is faster than the currently around 200 euros more expensive GeForce RTX 4070 Super with 12GB VRAM. The larger cards like the RTX 4080/4090 are still much more expensive – for that price you could get three RTX 4060tis and, due to the memory requirements, you can better parallelize if needed and set up an appropriate system.

Leave a Reply

Your email address will not be published. Required fields are marked *