Spike — Audio-based phonological error detection (feasibility)¶

Ticket: PHON-45 (Task, labels: spike audio research; parent: PHON-44 Audio) Date: 2026-04-24 Status: Spec for review Timebox: ~1 week of focused fun; no sprint commitment

Goal¶

Determine whether an open-weight, small-model audio pipeline can detect phoneme-level pronunciation errors in English at a quality bar sufficient to build a clinical/AR product on top of. Produce a go/no-go recommendation for a full audio workstream.

The long-term product vision is a full diagnostic loop: patient speaks → system detects phonological patterns → recommendations + graduated curricula (via the existing governed-generation constraint engine) → practice content → listener-direction audio for comprehension checks → reassessment. Everything downstream of detection is either already solved (curriculum generation) or low-risk (TTS). Detection is the hinge.

Scope¶

In - English, constrained-input tasks (word and sentence reading with known targets) - L2 / accent-reduction population as the prototype target (HIPAA-friendly, data-rich) - Detection only

Out - Pediatric clinical data — the long-term target, but deferred to a follow-up spike so this spike isn't blocked on data access and consent - Freeform / open-vocabulary ASR — constrained tasks give us target transcripts for free - TTS — graceful fallback if detection fails; separate spike - Any production code, endpoint, package, or frontend integration - Fine-tuning — unless an off-the-shelf model is tantalizingly close to the quality bar, in which case note as follow-up and do not execute in the spike

Research questions¶

The spike must answer each of the following. Each answer lives in findings.md.

Datasets. What's available, with what licensing, for L2/AR English pronunciation? Primary candidates: Speechocean762 (flagship — phoneme-level GOP ground truth), L2-ARCTIC, CommonVoice (with demographic filters), Speech Accent Archive. Adjacent disordered-speech corpora worth knowing about: TORGO, UA-Speech.
Models. What small open-weight models produce phoneme-level output today? Candidates: wav2vec2-phoneme family, Allosaurus, WhisperX + forced alignment, Montreal Forced Aligner, GOP-based pipelines, phoneme-recognition fine-tunes on Hugging Face.
Quality. On Speechocean762, what precision / recall can a candidate pipeline achieve on phoneme-level error detection? How does that compare to published SOTA?
Clinical translation. Which SODA categories (Substitution / Omission / Distortion / Addition) and phonological processes are detectable vs. not? Where's the ceiling?
Fit with PhonoLex. Does the detected-error representation map cleanly onto the existing IPA / feature-vector stack? Can detection output feed the governed-generation constraint engine to auto-produce targeted curricula?
Analysis approach. Once we have a hypothesized phoneme sequence, is the right path rule-based, classification, or hybrid?
Rule-based — align hypothesized vs. target phoneme sequences, map differences onto phonological processes (fronting, stopping, cluster reduction, final consonant deletion, etc.) using the existing feature-vector distance machinery. No labeled training data needed. Clinically interpretable. Natural fit with PhonoLex's stack.
Classification — train a model to label errors directly from audio + target. Higher ceiling for distortions that don't reduce to phoneme substitutions (lisp quality, nasality, prosody). Needs labeled data that may not exist → would require manufacturing (forced-TTS with phoneme substitutions, data augmentation).
Hybrid (expected outcome) — rule-based for substitutions/omissions/additions; classification for distortion-type errors where phoneme labels don't capture the acoustics. If classification is implicated anywhere, the spike should also propose how to get or manufacture the data.

Deliverables¶

All under research/audio-detection-spike/:

scripts/ — focused one-off scripts: dataset loaders, model evaluation, metric computation. One script per question where possible.
LAB.md — running lab notebook, dated entries, numbers, dead ends, decisions as they happen.
findings.md — final synthesized report: landscape, prototype results, quality numbers, clinical-translation analysis, analysis-approach recommendation, go/no-go with rationale. If go: rough architecture sketch and proposed follow-up tickets.

Deliberately chosen over a Jupyter notebook: scripts + markdown are git-friendly, rerunnable, and easier to maintain. No ipynb in this spike.

Success criteria¶

The spike is done when:

At least 3 candidate models have been evaluated end-to-end on Speechocean762
Phoneme-level detection numbers are reported against ground truth (precision, recall, per-category breakdowns)
The clinical-translation section honestly addresses which error categories the best pipeline can and cannot detect
The analysis-approach question (rule-based / classification / hybrid) has an explicit recommendation with reasoning
findings.md ends with an unambiguous go / no-go and a one-paragraph rationale a skeptical reader can act on

Out-of-scope guardrails¶

No new package, no new endpoint, no frontend work
No fine-tuning unless an off-the-shelf model is on the edge of the bar — note as follow-up, don't execute
No pediatric data even if it appears accessible — defer to a follow-up spike
No attempt to integrate with governed generation in this spike — the fit question is answered in prose, not code

Follow-ups (not part of this spike, but likely implied if go)¶

Pediatric-transfer spike — whether L2-trained detection generalizes to child speech, and what pediatric data we can realistically access
TTS feasibility spike — phoneme-controllable open-weight TTS for contrastive playback
Data manufacturing spike — if classification is needed anywhere, can we synthesize error-labeled audio via forced TTS with phoneme substitutions
Architecture design — if go, full audio workstream architecture (capture, processing, integration with existing stack, UI)