Spike — Audio-based phonological error detection (feasibility)¶
Ticket: PHON-45 (Task, labels: spike audio research; parent: PHON-44 Audio)
Date: 2026-04-24
Status: Spec for review
Timebox: ~1 week of focused fun; no sprint commitment
Goal¶
Determine whether an open-weight, small-model audio pipeline can detect phoneme-level pronunciation errors in English at a quality bar sufficient to build a clinical/AR product on top of. Produce a go/no-go recommendation for a full audio workstream.
The long-term product vision is a full diagnostic loop: patient speaks → system detects phonological patterns → recommendations + graduated curricula (via the existing governed-generation constraint engine) → practice content → listener-direction audio for comprehension checks → reassessment. Everything downstream of detection is either already solved (curriculum generation) or low-risk (TTS). Detection is the hinge.
Scope¶
In - English, constrained-input tasks (word and sentence reading with known targets) - L2 / accent-reduction population as the prototype target (HIPAA-friendly, data-rich) - Detection only
Out - Pediatric clinical data — the long-term target, but deferred to a follow-up spike so this spike isn't blocked on data access and consent - Freeform / open-vocabulary ASR — constrained tasks give us target transcripts for free - TTS — graceful fallback if detection fails; separate spike - Any production code, endpoint, package, or frontend integration - Fine-tuning — unless an off-the-shelf model is tantalizingly close to the quality bar, in which case note as follow-up and do not execute in the spike
Research questions¶
The spike must answer each of the following. Each answer lives in findings.md.
- Datasets. What's available, with what licensing, for L2/AR English pronunciation? Primary candidates: Speechocean762 (flagship — phoneme-level GOP ground truth), L2-ARCTIC, CommonVoice (with demographic filters), Speech Accent Archive. Adjacent disordered-speech corpora worth knowing about: TORGO, UA-Speech.
- Models. What small open-weight models produce phoneme-level output today? Candidates: wav2vec2-phoneme family, Allosaurus, WhisperX + forced alignment, Montreal Forced Aligner, GOP-based pipelines, phoneme-recognition fine-tunes on Hugging Face.
- Quality. On Speechocean762, what precision / recall can a candidate pipeline achieve on phoneme-level error detection? How does that compare to published SOTA?
- Clinical translation. Which SODA categories (Substitution / Omission / Distortion / Addition) and phonological processes are detectable vs. not? Where's the ceiling?
- Fit with PhonoLex. Does the detected-error representation map cleanly onto the existing IPA / feature-vector stack? Can detection output feed the governed-generation constraint engine to auto-produce targeted curricula?
- Analysis approach. Once we have a hypothesized phoneme sequence, is the right path rule-based, classification, or hybrid?
- Rule-based — align hypothesized vs. target phoneme sequences, map differences onto phonological processes (fronting, stopping, cluster reduction, final consonant deletion, etc.) using the existing feature-vector distance machinery. No labeled training data needed. Clinically interpretable. Natural fit with PhonoLex's stack.
- Classification — train a model to label errors directly from audio + target. Higher ceiling for distortions that don't reduce to phoneme substitutions (lisp quality, nasality, prosody). Needs labeled data that may not exist → would require manufacturing (forced-TTS with phoneme substitutions, data augmentation).
- Hybrid (expected outcome) — rule-based for substitutions/omissions/additions; classification for distortion-type errors where phoneme labels don't capture the acoustics. If classification is implicated anywhere, the spike should also propose how to get or manufacture the data.
Deliverables¶
All under research/audio-detection-spike/:
scripts/— focused one-off scripts: dataset loaders, model evaluation, metric computation. One script per question where possible.LAB.md— running lab notebook, dated entries, numbers, dead ends, decisions as they happen.findings.md— final synthesized report: landscape, prototype results, quality numbers, clinical-translation analysis, analysis-approach recommendation, go/no-go with rationale. If go: rough architecture sketch and proposed follow-up tickets.
Deliberately chosen over a Jupyter notebook: scripts + markdown are git-friendly, rerunnable, and easier to maintain. No ipynb in this spike.
Success criteria¶
The spike is done when:
- At least 3 candidate models have been evaluated end-to-end on Speechocean762
- Phoneme-level detection numbers are reported against ground truth (precision, recall, per-category breakdowns)
- The clinical-translation section honestly addresses which error categories the best pipeline can and cannot detect
- The analysis-approach question (rule-based / classification / hybrid) has an explicit recommendation with reasoning
findings.mdends with an unambiguous go / no-go and a one-paragraph rationale a skeptical reader can act on
Out-of-scope guardrails¶
- No new package, no new endpoint, no frontend work
- No fine-tuning unless an off-the-shelf model is on the edge of the bar — note as follow-up, don't execute
- No pediatric data even if it appears accessible — defer to a follow-up spike
- No attempt to integrate with governed generation in this spike — the fit question is answered in prose, not code
Follow-ups (not part of this spike, but likely implied if go)¶
- Pediatric-transfer spike — whether L2-trained detection generalizes to child speech, and what pediatric data we can realistically access
- TTS feasibility spike — phoneme-controllable open-weight TTS for contrastive playback
- Data manufacturing spike — if classification is needed anywhere, can we synthesize error-labeled audio via forced TTS with phoneme substitutions
- Architecture design — if go, full audio workstream architecture (capture, processing, integration with existing stack, UI)