Apple's New Transcription APIs is fast, But what about accuracy?

Can the new Transcription API's from Apple beat a year old openAI's whisper models?

I recently read about Apple’s new transcription APIs blowing past Whisper in speed tests. And since I already had an app in the same niche, I had to check it out.

I added support for the new Transcription APIs on my App VoiceInk and the results were surprisingly good.

Apple’s Transcription was a shit compared to the SOTA STT models like whisper. But rumor is that its better now.

I thought its time to test it.

For the past 2 days, I was constantly switching between the Apples new Transcription APIs and the Whisper’s Large V3 turbo using VoiceInk. Based on my personal experience, the Apple dictation is almost instant, while the whisper’s large v3 turbo takes a couple of seconds for short dictations & messages. But regarding accuracy, Apple is still falling behind.

I’m non-native English speaker. So the results might slightly vary for others.

I carried some experiments to test the speed and accuracy of Apple’s new transcription APIs in comparison with the openAI’s whisper model. Let’s see what we got.

Speed Test

Sample No.LengthApple Transcription TimeRTFxWhisper Transcription TimeRTFx
12m 10s2.66s49.15.60s23.4
219m 8s17.51s65.551.80s22.2
32hr 4m 39s1m 39s75.36m 0s20.7

Apple’s new Transcription APIs seem to be significantly faster than the Whisper Large V3 Turbo Model.

For this test, I’m running Large v3 turbo model with coreML support on VoiceInk app.

Accuracy Test

Apple’s new transcription APIs are ~3x faster than Whisper Large v3 turbo. But what about accuracy? This is something that everyone were asking in the reddit comments.

After using it for some time with VoiceInk, I was pretty sure that it was no where near the Whisper Large v3 turbo model. Whisper is not the most accurate, but does work really well with english and other popular languages.

I took the help of AI to help me calculate the accuracy of the new Apple’s transcription APIs.

I wanted to compare it by calculating WER. By no means am I a professional, and I know nothing about these tests, but with AI, it was possible.

I recorded 15 audio samples in English, randomly ranging from 15 seconds to 2 minutes. And tested against these 3 speech-to-text tools.

  • Apple’s New Transcription APIs
  • openAI Whisper Large v3 Turbo
  • Eleven Lab’s scribe v1

And the WER results are as follows:

Transcription ModelWord Error Rate (WER)
New Apple Transcription APIs7.63%
Whisper Large V3 turbo3.95%
ElevenLabs Scribe v13.42%

For reference, on HuggingFace ASR leaderboard, Large v3 Turbo model’s WER is at 7.83%

See one example of the transcripts from all 3 models.

New Apple Transcription APIsWhisper Large V3 turboElevenLabs Scribe v1Accurate Transcription
I want to evaluate the WRM modus pesto text tolls, light whisper, and Apple’s transcription APIs. To do this, I need high-quality audio samples paired with accurate transcripts, public data sets such as lift speech or common voice are good starting points. What error rate calculates mistakes in three categories, insertions, deletions, and substitutions? For example, some tools may share acronyms like CNN or GPD, or struggle with words like specificity and algorithm, background noise and varying sample rates may further impact accuracy. Apple’s transcription claims better real-time purpaments, while whisper supports murdering, you will use cases. How do they handle technical terms like biometrics of quantum? I also want to test diarization in multispeaker recordings. For fear comparison, I will use identical audio conditions, across altos, does a higher bit rate, like 48 kilh, improve bridges over the standard 16 kilh. Finally, I will check if what is a speaker ascent affect the WR significantly. This was generated with the help of deep sake.I want to evaluate the WER of modern space to text tools like Whisper and Apple’s transcription APIs. To do this, I need high quality audio samples paired with accurate transcripts. Public datasets such as Libre Speech or Common Voice are good starting points. Word error rate calculates mistakes in three categories: insertions, deletions, and substitutions. For example, some tools may share acronyms like CNN or GPT or struggle with words like specificity and algorithm. Background noise and varying sample rates may further impact accuracy. Apple’s transcription claims better real-time performance while Whisper supports multi-regul use cases. How do they handle technical terms like biometrics or quantum? I also want to test diarization in multi-speaker recordings. For a fair comparison, I will use identical audio conditions across all those. Does a higher bitrate like 48 kHz improbred results over the standard 16 kHz? Finally, I will check if watches or speaker ascent affect the WER significantly. This was generated with the help of DeepSeq.I want to evaluate the W-ER of modern speech-to-text tools like Whisper and Apple’s transcription APIs. To do this, I need high quality audio samples paired with accurate transcripts, public datasets such as LibriSpeech or CommonVoice are good starting points. Word error rate calculates mistakes in three categories: insertions, deletions, and substitutions. For example, some tools may share acronyms like CNN or GPT, or struggle with words like specificity and algorithm. Background noise and varying sample rates may, uh, further impact accuracy. Apple’s transcription claims better real time performance while Whisper supports multi-lingual use cases. How do they handle technical terms like biometrics or quantum? I also want to test diarization in multi-speaker recordings. For a ca- fair comparison, I will use identical audio conditions across all tools. Does a higher bitrate, like 48 kilohertz, improve results over the standard 16 kilohertz? Finally, I will check if voices or speaker accent affect the W-ER significantly. This was generated with the help of DeepSeek.I want to evaluate the word error rate of modern speech-to-text tools like Whisper and Apple’s transcription APIs. To do this, I need high-quality audio samples paired with accurate transcripts. Public datasets such as LibriSpeech or Common Voice are good starting points. Word error rate calculates mistakes in three categories: insertions, deletions, and substitutions. For example, some tools mishear acronyms like CNN or GPT, or struggle with words like “specificity” and “algorithm.” Background noise and varying sample rates may further impact accuracy. Apple’s latest transcription claims better real-time performance, while Whisper supports multilingual use cases. How do they handle technical terms like “biometrics” or “quantum?” I also want to test diarization in multi-speaker recordings. For a fair comparison, I’ll use identical audio conditions across all tools. Does a higher bitrate like 48 kHz improve results over standard 16 kHz? Finally, I’ll check if pauses or speaker accents affect the WER significantly. This was generated with the help of DeepSeek.

Note: All of these test were carried out on MacBook Pro M2 with 16GB RAM. The audio samples were recorded by me(a non-native english speaker) with some level of background noise.