Research Project | Updated April 30, 2026

Speech AI Beyond Transcription

A research program on modern voice recognition models as feature extractors: mispronunciation, anomalies, fluency, speech-pattern issues, dysarthria, aphasia, and therapy workflows.

Read the summary Discuss the research

Abstract speech AI feature-analysis dashboard

Core finding

The strongest systems are pipelines, not single models

Most people evaluate speech AI as transcription. For clinical and therapy use, the transcript is just one artifact. The system has to preserve the acoustic and phonetic evidence that clinicians care about.

ASR is only one layer

Transcript accuracy helps, but therapy workflows need audio quality control, alignment, prosody, task-specific scoring, and calibrated human review.

Speech features are the product

The most valuable signals live below text: phones, syllables, pauses, rhythm, stress, intelligibility, disfluency, and outlier speech patterns.

Clinical evidence is still thin

Recent models are useful research backbones, but disordered-speech datasets remain small, restricted, English-skewed, and easy to overfit.

Architecture

A therapy-grade speech stack

The project evaluates speech models as components in a larger system. Each layer has different failure modes, metrics, and privacy requirements.

Capture and quality control

VAD, denoising, microphone checks, clipping detection, reverberation estimates, and confidence gating before any clinical interpretation.

Transcription front end

Whisper, Canary, Parakeet, Chirp, gpt-4o-transcribe, Nova, AssemblyAI, or Amazon Transcribe depending on language, latency, and privacy constraints.

Alignment and pronunciation

Phone, syllable, word, and segment timing from Azure Pronunciation Assessment, Montreal Forced Aligner, WhisperX, or NeMo Forced Aligner.

Clinical analysis heads

Mispronunciation, dysarthria, aphasia, stuttering, apraxia-like timing, anomaly, and severity models trained on task-specific data.

Model landscape

Research shortlist as of April 2026

Open models are most useful where auditability, local deployment, and adaptation matter. Commercial systems are strongest for rapid production transcription and scripted pronunciation workflows.

Open and research models

Model	Type	Strength	Therapy fit	Main limit
Whisper large-v3 / turbo	Open ASR backbone	Strong multilingual transcription, robust noisy-speech baseline, active clinical fine-tuning literature.	Best default open ASR starting point when paired with WhisperX or MFA and pathology-specific adaptation.	Does not natively expose therapy-grade phone, prosody, or disorder scores.
WavLM Large	Self-supervised speech representation	Phoneme-aware English representations; strong for speaker, paralinguistic, and downstream speech-feature tasks.	Excellent candidate for pronunciation, anomaly, intelligibility, and severity heads.	English-centric pretraining and not a turnkey ASR service.
XLS-R	Multilingual SSL backbone	Large cross-lingual wav2vec 2.0 family trained across 128 languages.	Useful where clinical labels are scarce and multilingual generalization matters.	Heavy models and task-specific fine-tuning still required.
MMS and Omnilingual ASR	Very broad language coverage	Meta's MMS expanded open ASR to more than 1,100 languages; Omnilingual ASR pushes the low-resource frontier further.	Important for language access and low-resource screening research.	Language coverage does not imply validated clinical pronunciation or therapy scoring.
NVIDIA Canary / Parakeet	Open production ASR stack	FastConformer-based models with streaming, chunked inference, timestamps, and NeMo ecosystem tooling.	Strong option for on-prem, GPU-backed, multilingual or fast English transcript front ends.	Needs external clinical heads for feature-level interpretation.
Ai2 OLMoASR	Open reproducible ASR	Fully open training recipe and datasets compared with many opaque speech systems.	Attractive for auditable research where reproducibility matters.	Newer ecosystem; less direct clinical adaptation evidence than Whisper and SSL backbones.

Commercial services and scoring layers

Model	Type	Strength	Therapy fit	Main limit
Azure Speech Pronunciation Assessment	Commercial scoring layer	Documented phoneme, syllable, word, full-text, fluency, completeness, and prosody outputs.	Best turnkey choice for scripted articulation, reading, and guided practice workflows.	Reference-text centered; prosody support remains narrower than general STT coverage.
OpenAI gpt-4o-transcribe	Commercial STT / realtime front end	High-quality transcription with realtime and batch pathways for voice interfaces.	Good transcript layer for coaching products that will add their own feature analysis.	Opaque internals and no public phone-level clinical scoring surface.
Google Chirp 3	Commercial multilingual STT	Production streaming, batch recognition, language ID, diarization, and broad language coverage.	Strong multilingual transcript service when Google Cloud is already acceptable.	Word timestamps and transcript features are not the same as therapy-grade phone analysis.
Deepgram Nova-3 / Nova-3 Medical	Commercial STT, hosted or self-hosted	Strong streaming, domain terms, multilingual support, and a deployment-control story for regulated environments.	Practical front end for healthcare products that need speed and possible self-hosting.	Still needs separate prosody, pronunciation, and disorder-analysis layers.
AssemblyAI Universal-2 / Universal-3 Pro	Commercial STT API	Good developer ergonomics, streaming, prompting, formatting, and language coverage across model families.	Useful for product teams that prioritize API speed and transcript quality.	Not a standalone clinical speech-analysis engine.
Amazon Transcribe / Medical	Commercial STT and medical dictation	Batch and streaming transcription with vocabularies and custom language models.	Useful baseline for medical dictation and terminology-heavy workflows.	Medical is US-English oriented and not designed for phone-level therapy scoring.

Capability map

What to use for speech features

The practical question is not which model has the lowest WER. It is which combination exposes the right evidence for the clinical or coaching task.

Mispronunciation and articulation

Best fit: Azure Pronunciation Assessment for scripted practice; WavLM/XLS-R/Whisper plus MFA or WhisperX for research.

Measure: Phone-level accuracy, syllable timing, good-old-fashioned pronunciation scores, and phonological-feature classifiers.

False acceptance matters clinically: a fluent transcript can still hide a wrong articulation.

Prosody, fluency, and speech pattern issues

Best fit: Azure for out-of-box fluency/prosody in supported settings; custom acoustic feature channels elsewhere.

Measure: F0, intensity, pause structure, speech rate, stress, rhythm, disfluency events, and longitudinal change.

WER is often the wrong primary metric for stuttering, Parkinsonian speech, or apraxia-like timing.

Dysarthria and aphasia

Best fit: Fine-tuned Whisper, WavLM, XLS-R, or Canary pipelines with speaker-independent clinical evaluation.

Measure: Transcript recovery, intelligibility, severity scores, linguistic feature preservation, and anomaly flags.

Small datasets make leakage, prompt overlap, and recording-condition confounds a serious risk.

Anomaly detection

Best fit: Self-supervised embeddings plus calibrated one-class or supervised heads, with clinician review.

Measure: Outlier pronunciation, sudden decline, atypical pause patterns, code-switching breakdowns, or device-quality failures.

Anomaly is not diagnosis; the system must separate capture artifacts from clinically meaningful change.

Recommended stacks

The current implementation paths

The right stack depends on whether the priority is research evidence, production latency, language coverage, or out-of-the-box pronunciation feedback.

Research-grade private stack: Whisper large-v3 or turbo, WhisperX or Montreal Forced Aligner, then WavLM or XLS-R embeddings for clinical heads.

On-prem production stack: Canary or Parakeet with NeMo Forced Aligner, plus local prosody and disorder-scoring models.

Scripted pronunciation stack: Azure Pronunciation Assessment, especially for reference-text articulation drills and home practice.

Cloud transcript front end: gpt-4o-transcribe, Chirp 3, Nova-3, AssemblyAI, or Amazon Transcribe feeding a separate feature-analysis layer.

Data and metrics

UA-Speech / TORGO

Dysarthric English speech

High value for accessibility research, but small and sensitive to speaker, prompt, and recording-condition splits.

AphasiaBank / SONIVA

Aphasia and stroke-related language impairment

Useful for linguistic feature preservation and aphasia-oriented ASR, with access and annotation constraints.

FluencyBank

Stuttering and fluency

Important for event-level disfluency work where intended-speech recovery matters more than generic WER.

LibriSpeech, Common Voice, MLS, FLEURS

Healthy-speech baselines

Useful for pretraining and stabilization, but insufficient for clinical validation by themselves.

Evaluation should report WER/CER/PER only as a baseline. Therapy work also needs F1, AUROC, UAR, false acceptance, false rejection, diagnostic error, boundary accuracy, severity correlation, and longitudinal stability.

Project plan

Research deliverables

This project turns the landscape review into a reproducible evaluation program for feature-level speech AI.

1. Model registry and evidence map

Maintain a living table of open and commercial speech models, supported languages, alignment surfaces, privacy posture, and therapy-relevant outputs.

2. Feature benchmark

Compare transcript, phone alignment, prosody, pause, and embedding features against human-rated pronunciation and clinical speech tasks.

3. Therapy protocol prototype

Design reference-text articulation drills, reading tasks, spontaneous-speech tasks, and longitudinal monitoring flows for clinician review.

4. Governance and deployment review

Assess GDPR, consent, data retention, model transparency, bias, medical-device boundaries, and human-in-the-loop requirements.

Safety

Clinical guardrails

The project treats speech AI as decision support, not autonomous diagnosis. A useful therapy system must be calibrated, reviewable, and honest about uncertainty.

Separate capture-quality failure from speech-pattern anomalies.
Use speaker-independent and site-independent splits whenever possible.
Report performance by severity, age, sex, accent, language, and device when data allow.
Keep clinician review in the loop for health claims and treatment changes.
Design consent, retention, and audit trails before collecting therapy audio.

Primary sources

OpenAI: next-generation audio models OpenAI: Whisper research Microsoft Azure: Pronunciation Assessment Google Cloud: Chirp speech model Deepgram: Nova-3 model AssemblyAI: speech model docs NVIDIA NeMo ASR models NVIDIA NeMo Forced Aligner WavLM paper XLS-R paper Meta MMS paper Meta Omnilingual ASR Ai2 OLMoASR WhisperX paper Montreal Forced Aligner docs AphasiaBank FluencyBank

Interested in feature-level speech AI?

We are looking at pilots, datasets, and evaluation partnerships.

The most useful next step is a narrow benchmark: one therapy task, a clear reference protocol, human ratings, and a model stack that can expose the evidence behind its score.

Get in touch