Apple's New Transcription APIs is fast, But what about accuracy?

I recently read about Apple’s new transcription APIs blowing past Whisper in speed tests. And since I already had an app in the same niche, I had to check it out.
I added support for the new Transcription APIs on my App VoiceInk and the results were surprisingly good.
Apple’s Transcription was a shit compared to the SOTA STT models like whisper. But rumor is that its better now.
I thought its time to test it.
For the past 2 days, I was constantly switching between the Apples new Transcription APIs and the Whisper’s Large V3 turbo using VoiceInk. Based on my personal experience, the Apple dictation is almost instant, while the whisper’s large v3 turbo takes a couple of seconds for short dictations & messages. But regarding accuracy, Apple is still falling behind.
I’m non-native English speaker. So the results might slightly vary for others.
I carried some experiments to test the speed and accuracy of Apple’s new transcription APIs in comparison with the openAI’s whisper model. Let’s see what we got.
Speed Test
Sample No. | Length | Apple Transcription Time | RTFx | Whisper Transcription Time | RTFx |
---|---|---|---|---|---|
1 | 2m 10s | 2.66s | 49.1 | 5.60s | 23.4 |
2 | 19m 8s | 17.51s | 65.5 | 51.80s | 22.2 |
3 | 2hr 4m 39s | 1m 39s | 75.3 | 6m 0s | 20.7 |
Apple’s new Transcription APIs seem to be significantly faster than the Whisper Large V3 Turbo Model.
For this test, I’m running Large v3 turbo model with coreML support on VoiceInk app.
Accuracy Test
Apple’s new transcription APIs are ~3x faster than Whisper Large v3 turbo. But what about accuracy? This is something that everyone were asking in the reddit comments.
After using it for some time with VoiceInk, I was pretty sure that it was no where near the Whisper Large v3 turbo model. Whisper is not the most accurate, but does work really well with english and other popular languages.
I took the help of AI to help me calculate the accuracy of the new Apple’s transcription APIs.
I wanted to compare it by calculating WER. By no means am I a professional, and I know nothing about these tests, but with AI, it was possible.
I recorded 15 audio samples in English, randomly ranging from 15 seconds to 2 minutes. And tested against these 3 speech-to-text tools.
- Apple’s New Transcription APIs
- openAI Whisper Large v3 Turbo
- Eleven Lab’s scribe v1
And the WER results are as follows:
Transcription Model | Word Error Rate (WER) |
---|---|
New Apple Transcription APIs | 7.63% |
Whisper Large V3 turbo | 3.95% |
ElevenLabs Scribe v1 | 3.42% |
For reference, on HuggingFace ASR leaderboard, Large v3 Turbo model’s WER is at 7.83%
See one example of the transcripts from all 3 models.
New Apple Transcription APIs | Whisper Large V3 turbo | ElevenLabs Scribe v1 | Accurate Transcription |
---|---|---|---|
I want to evaluate the WRM modus pesto text tolls, light whisper, and Apple’s transcription APIs. To do this, I need high-quality audio samples paired with accurate transcripts, public data sets such as lift speech or common voice are good starting points. What error rate calculates mistakes in three categories, insertions, deletions, and substitutions? For example, some tools may share acronyms like CNN or GPD, or struggle with words like specificity and algorithm, background noise and varying sample rates may further impact accuracy. Apple’s transcription claims better real-time purpaments, while whisper supports murdering, you will use cases. How do they handle technical terms like biometrics of quantum? I also want to test diarization in multispeaker recordings. For fear comparison, I will use identical audio conditions, across altos, does a higher bit rate, like 48 kilh, improve bridges over the standard 16 kilh. Finally, I will check if what is a speaker ascent affect the WR significantly. This was generated with the help of deep sake. | I want to evaluate the WER of modern space to text tools like Whisper and Apple’s transcription APIs. To do this, I need high quality audio samples paired with accurate transcripts. Public datasets such as Libre Speech or Common Voice are good starting points. Word error rate calculates mistakes in three categories: insertions, deletions, and substitutions. For example, some tools may share acronyms like CNN or GPT or struggle with words like specificity and algorithm. Background noise and varying sample rates may further impact accuracy. Apple’s transcription claims better real-time performance while Whisper supports multi-regul use cases. How do they handle technical terms like biometrics or quantum? I also want to test diarization in multi-speaker recordings. For a fair comparison, I will use identical audio conditions across all those. Does a higher bitrate like 48 kHz improbred results over the standard 16 kHz? Finally, I will check if watches or speaker ascent affect the WER significantly. This was generated with the help of DeepSeq. | I want to evaluate the W-ER of modern speech-to-text tools like Whisper and Apple’s transcription APIs. To do this, I need high quality audio samples paired with accurate transcripts, public datasets such as LibriSpeech or CommonVoice are good starting points. Word error rate calculates mistakes in three categories: insertions, deletions, and substitutions. For example, some tools may share acronyms like CNN or GPT, or struggle with words like specificity and algorithm. Background noise and varying sample rates may, uh, further impact accuracy. Apple’s transcription claims better real time performance while Whisper supports multi-lingual use cases. How do they handle technical terms like biometrics or quantum? I also want to test diarization in multi-speaker recordings. For a ca- fair comparison, I will use identical audio conditions across all tools. Does a higher bitrate, like 48 kilohertz, improve results over the standard 16 kilohertz? Finally, I will check if voices or speaker accent affect the W-ER significantly. This was generated with the help of DeepSeek. | I want to evaluate the word error rate of modern speech-to-text tools like Whisper and Apple’s transcription APIs. To do this, I need high-quality audio samples paired with accurate transcripts. Public datasets such as LibriSpeech or Common Voice are good starting points. Word error rate calculates mistakes in three categories: insertions, deletions, and substitutions. For example, some tools mishear acronyms like CNN or GPT, or struggle with words like “specificity” and “algorithm.” Background noise and varying sample rates may further impact accuracy. Apple’s latest transcription claims better real-time performance, while Whisper supports multilingual use cases. How do they handle technical terms like “biometrics” or “quantum?” I also want to test diarization in multi-speaker recordings. For a fair comparison, I’ll use identical audio conditions across all tools. Does a higher bitrate like 48 kHz improve results over standard 16 kHz? Finally, I’ll check if pauses or speaker accents affect the WER significantly. This was generated with the help of DeepSeek. |
Note: All of these test were carried out on MacBook Pro M2 with 16GB RAM. The audio samples were recorded by me(a non-native english speaker) with some level of background noise.