Speech AI Beyond Transcription
A research program on modern voice recognition models as feature extractors: mispronunciation, anomalies, fluency, speech-pattern issues, dysarthria, aphasia, and therapy workflows.
Core finding
The strongest systems are pipelines, not single models
Most people evaluate speech AI as transcription. For clinical and therapy use, the transcript is just one artifact. The system has to preserve the acoustic and phonetic evidence that clinicians care about.
ASR is only one layer
Transcript accuracy helps, but therapy workflows need audio quality control, alignment, prosody, task-specific scoring, and calibrated human review.
Speech features are the product
The most valuable signals live below text: phones, syllables, pauses, rhythm, stress, intelligibility, disfluency, and outlier speech patterns.
Clinical evidence is still thin
Recent models are useful research backbones, but disordered-speech datasets remain small, restricted, English-skewed, and easy to overfit.
Architecture
A therapy-grade speech stack
The project evaluates speech models as components in a larger system. Each layer has different failure modes, metrics, and privacy requirements.
Capture and quality control
VAD, denoising, microphone checks, clipping detection, reverberation estimates, and confidence gating before any clinical interpretation.
Transcription front end
Whisper, Canary, Parakeet, Chirp, gpt-4o-transcribe, Nova, AssemblyAI, or Amazon Transcribe depending on language, latency, and privacy constraints.
Alignment and pronunciation
Phone, syllable, word, and segment timing from Azure Pronunciation Assessment, Montreal Forced Aligner, WhisperX, or NeMo Forced Aligner.
Clinical analysis heads
Mispronunciation, dysarthria, aphasia, stuttering, apraxia-like timing, anomaly, and severity models trained on task-specific data.
Model landscape
Research shortlist as of April 2026
Open models are most useful where auditability, local deployment, and adaptation matter. Commercial systems are strongest for rapid production transcription and scripted pronunciation workflows.
Open and research models
| Model | Type | Strength | Therapy fit | Main limit |
|---|---|---|---|---|
| Whisper large-v3 / turbo | Open ASR backbone | Strong multilingual transcription, robust noisy-speech baseline, active clinical fine-tuning literature. | Best default open ASR starting point when paired with WhisperX or MFA and pathology-specific adaptation. | Does not natively expose therapy-grade phone, prosody, or disorder scores. |
| WavLM Large | Self-supervised speech representation | Phoneme-aware English representations; strong for speaker, paralinguistic, and downstream speech-feature tasks. | Excellent candidate for pronunciation, anomaly, intelligibility, and severity heads. | English-centric pretraining and not a turnkey ASR service. |
| XLS-R | Multilingual SSL backbone | Large cross-lingual wav2vec 2.0 family trained across 128 languages. | Useful where clinical labels are scarce and multilingual generalization matters. | Heavy models and task-specific fine-tuning still required. |
| MMS and Omnilingual ASR | Very broad language coverage | Meta's MMS expanded open ASR to more than 1,100 languages; Omnilingual ASR pushes the low-resource frontier further. | Important for language access and low-resource screening research. | Language coverage does not imply validated clinical pronunciation or therapy scoring. |
| NVIDIA Canary / Parakeet | Open production ASR stack | FastConformer-based models with streaming, chunked inference, timestamps, and NeMo ecosystem tooling. | Strong option for on-prem, GPU-backed, multilingual or fast English transcript front ends. | Needs external clinical heads for feature-level interpretation. |
| Ai2 OLMoASR | Open reproducible ASR | Fully open training recipe and datasets compared with many opaque speech systems. | Attractive for auditable research where reproducibility matters. | Newer ecosystem; less direct clinical adaptation evidence than Whisper and SSL backbones. |
Commercial services and scoring layers
| Model | Type | Strength | Therapy fit | Main limit |
|---|---|---|---|---|
| Azure Speech Pronunciation Assessment | Commercial scoring layer | Documented phoneme, syllable, word, full-text, fluency, completeness, and prosody outputs. | Best turnkey choice for scripted articulation, reading, and guided practice workflows. | Reference-text centered; prosody support remains narrower than general STT coverage. |
| OpenAI gpt-4o-transcribe | Commercial STT / realtime front end | High-quality transcription with realtime and batch pathways for voice interfaces. | Good transcript layer for coaching products that will add their own feature analysis. | Opaque internals and no public phone-level clinical scoring surface. |
| Google Chirp 3 | Commercial multilingual STT | Production streaming, batch recognition, language ID, diarization, and broad language coverage. | Strong multilingual transcript service when Google Cloud is already acceptable. | Word timestamps and transcript features are not the same as therapy-grade phone analysis. |
| Deepgram Nova-3 / Nova-3 Medical | Commercial STT, hosted or self-hosted | Strong streaming, domain terms, multilingual support, and a deployment-control story for regulated environments. | Practical front end for healthcare products that need speed and possible self-hosting. | Still needs separate prosody, pronunciation, and disorder-analysis layers. |
| AssemblyAI Universal-2 / Universal-3 Pro | Commercial STT API | Good developer ergonomics, streaming, prompting, formatting, and language coverage across model families. | Useful for product teams that prioritize API speed and transcript quality. | Not a standalone clinical speech-analysis engine. |
| Amazon Transcribe / Medical | Commercial STT and medical dictation | Batch and streaming transcription with vocabularies and custom language models. | Useful baseline for medical dictation and terminology-heavy workflows. | Medical is US-English oriented and not designed for phone-level therapy scoring. |
Capability map
What to use for speech features
The practical question is not which model has the lowest WER. It is which combination exposes the right evidence for the clinical or coaching task.
Mispronunciation and articulation
Best fit: Azure Pronunciation Assessment for scripted practice; WavLM/XLS-R/Whisper plus MFA or WhisperX for research.
Measure: Phone-level accuracy, syllable timing, good-old-fashioned pronunciation scores, and phonological-feature classifiers.
False acceptance matters clinically: a fluent transcript can still hide a wrong articulation.
Prosody, fluency, and speech pattern issues
Best fit: Azure for out-of-box fluency/prosody in supported settings; custom acoustic feature channels elsewhere.
Measure: F0, intensity, pause structure, speech rate, stress, rhythm, disfluency events, and longitudinal change.
WER is often the wrong primary metric for stuttering, Parkinsonian speech, or apraxia-like timing.
Dysarthria and aphasia
Best fit: Fine-tuned Whisper, WavLM, XLS-R, or Canary pipelines with speaker-independent clinical evaluation.
Measure: Transcript recovery, intelligibility, severity scores, linguistic feature preservation, and anomaly flags.
Small datasets make leakage, prompt overlap, and recording-condition confounds a serious risk.
Anomaly detection
Best fit: Self-supervised embeddings plus calibrated one-class or supervised heads, with clinician review.
Measure: Outlier pronunciation, sudden decline, atypical pause patterns, code-switching breakdowns, or device-quality failures.
Anomaly is not diagnosis; the system must separate capture artifacts from clinically meaningful change.
Recommended stacks
The current implementation paths
The right stack depends on whether the priority is research evidence, production latency, language coverage, or out-of-the-box pronunciation feedback.
Research-grade private stack: Whisper large-v3 or turbo, WhisperX or Montreal Forced Aligner, then WavLM or XLS-R embeddings for clinical heads.
On-prem production stack: Canary or Parakeet with NeMo Forced Aligner, plus local prosody and disorder-scoring models.
Scripted pronunciation stack: Azure Pronunciation Assessment, especially for reference-text articulation drills and home practice.
Cloud transcript front end: gpt-4o-transcribe, Chirp 3, Nova-3, AssemblyAI, or Amazon Transcribe feeding a separate feature-analysis layer.
Data and metrics
UA-Speech / TORGO
Dysarthric English speech
High value for accessibility research, but small and sensitive to speaker, prompt, and recording-condition splits.
AphasiaBank / SONIVA
Aphasia and stroke-related language impairment
Useful for linguistic feature preservation and aphasia-oriented ASR, with access and annotation constraints.
FluencyBank
Stuttering and fluency
Important for event-level disfluency work where intended-speech recovery matters more than generic WER.
LibriSpeech, Common Voice, MLS, FLEURS
Healthy-speech baselines
Useful for pretraining and stabilization, but insufficient for clinical validation by themselves.
Evaluation should report WER/CER/PER only as a baseline. Therapy work also needs F1, AUROC, UAR, false acceptance, false rejection, diagnostic error, boundary accuracy, severity correlation, and longitudinal stability.
Project plan
Research deliverables
This project turns the landscape review into a reproducible evaluation program for feature-level speech AI.
1. Model registry and evidence map
Maintain a living table of open and commercial speech models, supported languages, alignment surfaces, privacy posture, and therapy-relevant outputs.
2. Feature benchmark
Compare transcript, phone alignment, prosody, pause, and embedding features against human-rated pronunciation and clinical speech tasks.
3. Therapy protocol prototype
Design reference-text articulation drills, reading tasks, spontaneous-speech tasks, and longitudinal monitoring flows for clinician review.
4. Governance and deployment review
Assess GDPR, consent, data retention, model transparency, bias, medical-device boundaries, and human-in-the-loop requirements.
Safety
Clinical guardrails
The project treats speech AI as decision support, not autonomous diagnosis. A useful therapy system must be calibrated, reviewable, and honest about uncertainty.
- Separate capture-quality failure from speech-pattern anomalies.
- Use speaker-independent and site-independent splits whenever possible.
- Report performance by severity, age, sex, accent, language, and device when data allow.
- Keep clinician review in the loop for health claims and treatment changes.
- Design consent, retention, and audit trails before collecting therapy audio.
Primary sources
We are looking at pilots, datasets, and evaluation partnerships.
The most useful next step is a narrow benchmark: one therapy task, a clear reference protocol, human ratings, and a model stack that can expose the evidence behind its score.
Get in touch