Last week we compared selfharm selfhost speech-to-text solutions on Apple Silicon processors. But what if you are a business and need an API for transcription? Or what if you are a developer and just want to make a pet project with transcription? Now let’s see what options are on the market, and which one is the best (spoiler - there is no clear winner) .
Why compare?
The question arises, why compare APIs? After all, 100% someone has already done this before me. And yes, indeed, if you google comparison, you can find the Artificial Analysis website, which compares all the models in the world, including speech-to-text models. Here is an example of columns with different providers:
https://artificialanalysis.ai/speech-to-text
The price is clear, the speed is also clear, but let’s see how they measure the error. Scroll down to the methodology and see that they calculate the error on the Common Voice 16.1 dataset, and in this dataset the maximum audio length is 15 seconds. This means that the most interesting thing is not checked in any way - how each provider works with long files (Whisper has a maximum input file length of 30 seconds). Last week, we saw that each has its own algorithms, and even with the same model you can get a different error, since each has its own implementation of gluing chunks.
Error of different versions of Whisper, it is clear that all v3 are the same (except Fireworks)
And on the graph we see that all Whispers of the same configuration have the same error, which, in my opinion, is impossible. In this article I will check how providers actually work with long files. There will also be beautiful bars and graphs, don’t worry.
Players
Meet our players!
Foreign Solutions Team:
- AssemblyAI - apparently the most accurate model and has the largest number of functions, judging by the landing page;
- OpenAI API is what everyone uses;
- Fireworks - I was surprised by the high speed and low accuracy, let’s see if they sacrificed accuracy for speed;
- DeepInfra is the cheapest Whisper provider, let’s see, it’s a contender for victory;
- fal is a very cheap provider with high speed, I also expect good results from them.
I chose them from the Artificial Analysis list (because they have the nicest bar colors)
First problems
I expected to spend exactly 0 rubles 0 kopecks on testing different services, but unfortunately, not all of them have a free-tier in their API. OpenAI and DeepInfra do not give a penny for API testing, and the case with VseGPT is especially comical, because they give as much as 5 rubles for free for API testing! Thank you, very generously. Let’s not take into account that a minute of their Whisper costs one and a half rubles.
On the other end of the spectrum is AssemblyAI, which gives away a whopping $50 for testing for free! That’s actually enough to test all of their API features (and they have a lot of them).
Rules of the game
This time I put together a benchmark of 16 audio files based on YouTube videos with author subtitles. It has audio in German, English, Russian, Spanish and French with varying degrees of noise and speech clarity. I also included three songs in the dataset to see if any model could recognize the lyrics in the songs (spoiler - no) .
I will check the Word Error Rate (WER, %) and the speed of each provider (how many seconds of audio are counted per second of transcription). WER is usually calculated after normalizing the text (removing punctuation and converting all text to lowercase), but I will not do this. I want to check how well the text is readable without additional processing through LLM. I expect that the author’s subtitles in the video are the standard of punctuation and quality, and I will measure WER through them. Yes, they can contain not only words, but also descriptions of sounds (for example, [Music Playing], and this will obviously increase the error) , but I am not comparing the absolute accuracy of the models, but the relative accuracy between different providers.
If the request returns an error, I set 100% WER and zero speed, no compromises, APIs should always work. The higher the uptime, the better for business.
Tests and first victims
Let’s start with the sad part, with the disqualification of some providers from the competition.
- fal (wizper) - unfortunately, it does not have automatic language detection, which is unacceptable to me, since a typical API user will not be able to detect the language on their regular backend, this will require additional capacity.
- VseGPT (stt-openai/whisper-1) - put 500 rubles on the account (thanks to the rich free-tier), and received the same in response from the server. All requests, unfortunately, returned 500 Internal Server Error, and I could not test them.
- Yandex SpeechKit - … let’s leave it for a separate section, perhaps.
What problems do others have:
- OpenAI API - it is necessary to have a server abroad (or use prohibited technologies) and send files up to 25 MB in size. This limitation did not allow us to count three files from the benchmark, so we had to give them a minus. It is very difficult to split large files on the backend (what to do at the junction of chunks?), and this problem is shifted from the API to the backend. And the backend, for example, will not be able to use the heuristics from the OpenAI article , because they require logits, which, naturally, are not sent via the API.
- DeepInfra - unfortunately, it also didn’t live up to my expectations, and with such a small price for the API (the entire benchmark cost only 6 cents), you pay for reliability. 6 requests out of 16 returned a transcription error.
- Whisper - it’s a complicated story with them, I was counting on them, and they… Firstly, one request returned some strange internal error. Secondly, their API is interestingly built - you send a request and either regularly check the endpoint for transcription readiness, or use a webhook. On one hand, this is a good solution, since transcription can take a long time, and keeping the connection open for a long time is not recommended. But, on the other hand, it complicates the work of the backend and you need to screw on work with webhooks. As a result, for myself, I set the expectation of a response within two minutes, and if the response does not come, I put a minus. Therefore, I did not calculate the transcription speed for Whisper, since all audio is counted as 120 seconds. In my benchmark, the longest audio is 3811 seconds long, and with a limit of 2 minutes, it should be counted as about 30-35 minutes per minute. And because their service is an order of magnitude more expensive than the others, I would like to get such a transcription speed.
Yandex - we hear one thing, we write another
I didn’t use their API, I just decided to test it through their UI on SpeechKit Playground . I started with the first video - it’s in French. After a while I got a response, I typed it into my python script that calculates WER, and I see an error of 174% 🤨. I immediately thought that maybe I didn’t copy everything, I checked again, the same thing again. I started looking at the text, and there…
Forgive me for Comic Sans, I like it so much
I have marked pairs of absolutely identical lines with different colors. I want to ask, how can they be laid out in such a crooked order in a casino ? The lines are repeated without any algorithm, just in random order. And so for the other entries too! They deserved their disqualification, I do not recommend using it. However, I decided to manually clear this entry of repetitions and got an error of 57% (there is no punctuation!). Not bad, but the average error for other services on this entry is about 37%.
TL;DR Results - Fireworks and Nexara are great
Comparison of price and Word Error Rate on the chart, green square - optimal price / quality
If you are not interested in further details, the result is on the graph. The first thing that catches your eye is Shopot, they are an order of magnitude more expensive than all competitors, but they have a division into speakers and processing via LLM (I am not sure how useful processing via LLM is, as you can see, it does not reduce the error much, but it raises the price considerably).
I was also very surprised that AssemblyAI showed such a bad result. Below we will try to figure out why.
Now let’s take a look at the speed and Word Error Rate comparison:
Comparison of WER and speed, everything in the green square is cool
As you can see, there is no Shopot, since I set the wait time for all requests to 120 seconds, and I did not measure its speed. Fireworks Large-v3 Turbo works very fast with excellent accuracy. Good work, great API. Their slogan “Fastest inference for generative AI” really does not lie.
Comparison of the number of errors
The shaded bar means that the API had other issues, in the case of OpenAI it was the 25MB limit, and in the case of Shopot it was slow file transcription and timeout.
More results
Let’s look at each entry separately and see what nuances there are in working with each API. Let’s start with the error.
And let’s finish with a speed comparison.
Sorry if the font is too small, I tried
Not every entry has all 6 columns, because if the request failed, I could not calculate its speed.
AssemblyAI (orange):
Let’s start with them, as they have the highest average error. Note that the error is either very low or very high. That is, the model does a good job, but sometimes everything is wrong. Let’s look at the texts, and we will see a sad picture - AssemblyAI sometimes does not guess the language and the entire transcription immediately becomes incorrect :-( In terms of speed, the service is also not bad (much faster than OpenAI Whisper), and has a lot of functions, but such a simple function as language detection spoiled all their results.
Fireworks (large-v3 and turbo) (brown and pink):
Everything is fine, but in the songs (recordings #10-#12) the error is 100%. If you look at the response text, you can see that it is empty. And it is empty because of the Voice Activity Detection model. Voice Activity Detection (VAD) is a model that identifies audio fragments where it hears a voice that can be transcribed. This is a very good solution that speeds up calculations (the turbo model on the 14th recording gives a simply cosmic result in speed - there were many fragments in the video without a voice), but sometimes VAD can consider a barely audible voice to be just noise and ignore a piece of text. We see this in the case of songs - it considers all songs to be just noise and returns an empty string, since, apparently, for the VAD model it said that there is noise everywhere.
OpenAI Whisper (red):
There is nothing to say about them, they just work well. The only files where there is a 100% error are those where the file does not fit into 25 MB.
DeepInfra (blue):
Good fast service, but unfortunately 6 out of 16 records returned an error.
Nexara (lime):
Everything is also great, like Fireworks, but they are a bit slower than their Turbo model and apparently do not use VAD, since they try to guess something in the songs. It is thanks to at least some correct words in the songs that Nexara received the lowest average error in all tests.
The service was a pleasant surprise, as it had just recently appeared on th Radar , and I didn’t expect anything out of the ordinary from it.
Shopot (blue):
Everything is fine, the error is on par with competitors, VAD can be turned on/off in the API (turning it off can reduce the error in some cases), everything is convenient. The only thing is that one request from them returned an error, and in general the API is slow.
And, if it is convenient, you can look at the results in the table:
Provider | Average WER | Average speed | % successful requests | Price (₽ / 1000 min) |
DeepInfra Whisper v3 Large | 0.59 | 56.3 | 62.5 | 39.6 |
| Fireworks Large v3 Turbo | 0.42 | 149.9 | 100 | 80 |
| Fireworks Large-v3 | 0.41 | 89.6 | 100 | 132 |
| Nexara | 0.38 | 110.7 | 100 | 360 |
OpenAI Whisper API | 0.54 | 24.4 | 81.2%* | 528 |
AssemblyAI | 0.79 | 51 | 100 | 542 |
Shopot | 0.52 | NaN | 81.2%* | 2000 |
* - the error percentage includes timeouts (for Shopot) and exceeding the 25 MB limit (for OpenAI). Shopot has NaN in speed, since I did not use webhooks. Fireworks and Nexara are in bold, since they have the lowest error.
Summary
Almost all tested API services can be used, but there is no clear winner among them. Some have very few functions (like Nexara, Fireworks or DeepInfra), and they simply return a wall of text without the ability to separate speakers. Some have low speed, some have low accuracy, some have low reliability. And for some, nothing works. (Hello Yandex and Deepgram)
What should I use then?
At my very first lecture on ML at the university, I was told about the concept of “No Free Lunch” - there is no such algorithm in machine learning that would solve all problems well. Some algorithm is always better than another in something, and there is no single clear winner. And in the case of comparing APIs, since we work with ML models, I suggest using the same approach. I am sure that for some niche application, AssemblyAI will give a minimal error and please the user, I am sure that someone will be able to find a use for the high speed of Fireworks. And I am sure that many will find it more convenient to work with webhooks than to maintain a connection, as in the case of Shopot (if requests did not fall off, it would be really cool). When making a choice, test it yourself and do not trust no-names from the Internet :-)
What if I don’t want API?
If you just need to quickly run a couple of audios and get an answer, there are many free services, such as:
- https://huggingface.co/spaces?q=whisper - choose any card and transcribe to your heart’s content. a good service with a generous, also found on Radar
I’ll be glad to participate in the discussion in the comments, write your experience with any speech transcription API :-)