Speech AI Is More Than Transcription

Most people still meet speech AI through transcription. They upload a recording, get text back, and judge the model by word error rate. That view is useful, but it is also strangely narrow. Speech is not just text carried by sound. It is timing, stress, rhythm, articulation, pause structure, pitch, breath, hesitation, intelligibility, and change over time.

That is why speech therapy and clinical speech analysis are such an interesting test for the current generation of voice models. The question is not just whether a model heard the words correctly. It is whether the system can preserve and interpret the speech features that a clinician, coach, or researcher actually cares about.

As of April 30, 2026, the answer is: partly, but only if we stop treating ASR as the whole system.

The models are strong, but incomplete

The strongest open models and backbones are now familiar: Whisper, WavLM, XLS-R, Meta's MMS and Omnilingual ASR, NVIDIA's Canary and Parakeet families, and newer reproducible work such as Ai2 OLMoASR. On the commercial side, the important services include OpenAI gpt-4o-transcribe, Google Chirp 3, Deepgram Nova-3, AssemblyAI Universal models, and Amazon Transcribe.

For therapy, the standout commercial feature set remains Azure Pronunciation Assessment. It exposes pronunciation-relevant scores at phoneme, syllable, word, and full-text levels, plus fluency, completeness, and prosody outputs in supported scenarios. That makes it unusually practical for scripted articulation drills, reading tasks, and guided home practice.

But none of these systems, by themselves, is a general clinical speech engine. A low word error rate does not prove correct articulation. A clean transcript does not tell you whether a child substituted one phoneme for another, whether a patient is developing atypical pause patterns, whether speech rate changed after treatment, or whether a stuttering event was handled correctly.

The useful architecture is a pipeline

A defensible speech therapy system has at least four layers.

First is capture and quality control: microphone checks, clipping detection, voice activity detection, denoising, and confidence gating. If the recording is bad, the system should say so before it turns noise into a clinical-looking score.

Second is transcription. This is where Whisper, Canary, Parakeet, Chirp, gpt-4o-transcribe, Nova-3, AssemblyAI, or Amazon Transcribe can be excellent front ends.

Third is alignment and feature extraction. This is the layer most product demos skip. Tools such as Montreal Forced Aligner, WhisperX, and NeMo Forced Aligner exist because base ASR models usually do not expose enough reliable phone-level structure on their own.

Fourth is the clinical or coaching layer: mispronunciation detection, fluency scoring, dysarthria or aphasia adaptation, anomaly detection, severity estimation, and longitudinal change tracking. This layer needs task-specific data and human validation. It cannot be inferred safely from transcript quality alone.

What looks most promising

For research-grade private deployments, the most pragmatic stack is still Whisper large-v3 or turbo for transcription, WhisperX or Montreal Forced Aligner for timing, and WavLM or XLS-R for downstream speech-feature heads. Whisper has become the default open ASR starting point, while WavLM and XLS-R are better thought of as representation backbones for the features below text.

For on-prem production systems, NVIDIA's Canary and Parakeet ecosystem is attractive because it gives a modern open ASR stack, streaming and chunked inference paths, timestamps, and an aligner in the same tooling family. That matters for hospitals, enterprises, and regulated teams that cannot simply send therapy audio to any cloud API.

For scripted pronunciation and articulation coaching, Azure is currently the most mature off-the-shelf option. It is not a neurological diagnosis system, but it does expose the kind of phone and syllable evidence that many therapy exercises need.

For cloud-first products, gpt-4o-transcribe, Chirp 3, Nova-3, AssemblyAI, and Amazon Transcribe are credible transcript layers. They should be treated as inputs to a separate analysis stack, not as the analysis stack itself.

The dataset problem is still the bottleneck

The hard part is no longer only model architecture. It is data.

Clinical speech datasets are small, protected, heterogeneous, and often English-skewed. UA-Speech and TORGO are valuable for dysarthric speech, but small enough that bad splits can create misleadingly optimistic results. AphasiaBank is essential for aphasia research, but access and annotation are controlled. FluencyBank is important for stuttering and fluency work, but event-level temporal annotation remains limited compared with the scale of general ASR corpora.

That means evaluation has to be stricter than normal ASR benchmarking. WER and CER are not enough. Therapy systems should report phone-level errors, false acceptance, false rejection, diagnostic error, disfluency F1, severity correlation, boundary accuracy, and performance by severity, accent, age, sex, device, and language whenever the data allows.

The most useful near-term work is not a leaderboard for "best speech model." It is a set of narrow, reproducible therapy tasks where we know what feature is being measured and can compare the model against human-rated evidence.

Why this matters

Speech models are starting to become good enough to change how therapy workflows are built. They can help with home practice, triage, progress tracking, accessibility, and clinical documentation. They may also help surface patterns that are difficult to measure consistently by ear alone.

But the systems need to be honest about what they are doing. A pronunciation coach, a dysarthria monitor, an aphasia screening assistant, and a general transcription API are different products. They need different data, metrics, governance, and review loops.

That is the point of our new research project, Speech AI Beyond Transcription. We are mapping the current model landscape from the speech-feature perspective: pronunciation, anomalies, fluency, speech-pattern issues, and therapy needs. The goal is to identify which model stacks are useful today, where the evidence is weak, and what a responsible evaluation protocol should look like.

The interesting frontier in speech AI is not just turning audio into text. It is turning speech into interpretable evidence without pretending that evidence is a diagnosis.