OpenAI’s Whisper is an automatic speech recognition AI system that can recognize speech and convert it into text. The system works amazingly well for the German language and can also be used on local systems.
For some time now, we have been developing a solution based on this, aimed at those professional groups who still work with a dictation device (whether they like it or not) and subsequently need to convert the dictations into text form – e.g. lawyers, doctors, notaries and others.
Our solution combines the workflow of electronic dictation (e.g. via a smartphone app) with the ability to send the recording via email to a speech-to-text system and receive a transcription back – all on-premise in the customer’s environment, making it GDPR-compliant. In addition to processing formatting commands, our solution offers further advantages over pure Whisper transcription. But that’s not what we want to focus on today.
Since the processing time depends heavily on the hardware used (and works particularly well on GPUs/graphics cards), we wanted to find a sweet spot in terms of price/performance and conducted tests on various systems (and gathered some from the internet).
It is important to note that the largest and best Whisper model requires approximately 10GB of RAM or VRAM (and when processing on a GPU/graphics card, in addition to the minimum 10GB of VRAM, at least 10GB of RAM is also required, otherwise the model cannot be loaded into the graphics card’s memory). This applies to each instance that is to be run in parallel on the same system (i.e. for two parallel instances on one GPU, at least 20GB of VRAM is required).
weiterlesen → OpenAI Whisper performance benchmarks