Skip to content

Spike — Audio-based phonological error detection (feasibility)

Ticket: PHON-45 (Task, labels: spike audio research; parent: PHON-44 Audio) Date: 2026-04-24 Status: Spec for review Timebox: ~1 week of focused fun; no sprint commitment

Goal

Determine whether an open-weight, small-model audio pipeline can detect phoneme-level pronunciation errors in English at a quality bar sufficient to build a clinical/AR product on top of. Produce a go/no-go recommendation for a full audio workstream.

The long-term product vision is a full diagnostic loop: patient speaks → system detects phonological patterns → recommendations + graduated curricula (via the existing governed-generation constraint engine) → practice content → listener-direction audio for comprehension checks → reassessment. Everything downstream of detection is either already solved (curriculum generation) or low-risk (TTS). Detection is the hinge.

Scope

In - English, constrained-input tasks (word and sentence reading with known targets) - L2 / accent-reduction population as the prototype target (HIPAA-friendly, data-rich) - Detection only

Out - Pediatric clinical data — the long-term target, but deferred to a follow-up spike so this spike isn't blocked on data access and consent - Freeform / open-vocabulary ASR — constrained tasks give us target transcripts for free - TTS — graceful fallback if detection fails; separate spike - Any production code, endpoint, package, or frontend integration - Fine-tuning — unless an off-the-shelf model is tantalizingly close to the quality bar, in which case note as follow-up and do not execute in the spike

Research questions

The spike must answer each of the following. Each answer lives in findings.md.

  1. Datasets. What's available, with what licensing, for L2/AR English pronunciation? Primary candidates: Speechocean762 (flagship — phoneme-level GOP ground truth), L2-ARCTIC, CommonVoice (with demographic filters), Speech Accent Archive. Adjacent disordered-speech corpora worth knowing about: TORGO, UA-Speech.
  2. Models. What small open-weight models produce phoneme-level output today? Candidates: wav2vec2-phoneme family, Allosaurus, WhisperX + forced alignment, Montreal Forced Aligner, GOP-based pipelines, phoneme-recognition fine-tunes on Hugging Face.
  3. Quality. On Speechocean762, what precision / recall can a candidate pipeline achieve on phoneme-level error detection? How does that compare to published SOTA?
  4. Clinical translation. Which SODA categories (Substitution / Omission / Distortion / Addition) and phonological processes are detectable vs. not? Where's the ceiling?
  5. Fit with PhonoLex. Does the detected-error representation map cleanly onto the existing IPA / feature-vector stack? Can detection output feed the governed-generation constraint engine to auto-produce targeted curricula?
  6. Analysis approach. Once we have a hypothesized phoneme sequence, is the right path rule-based, classification, or hybrid?
  7. Rule-based — align hypothesized vs. target phoneme sequences, map differences onto phonological processes (fronting, stopping, cluster reduction, final consonant deletion, etc.) using the existing feature-vector distance machinery. No labeled training data needed. Clinically interpretable. Natural fit with PhonoLex's stack.
  8. Classification — train a model to label errors directly from audio + target. Higher ceiling for distortions that don't reduce to phoneme substitutions (lisp quality, nasality, prosody). Needs labeled data that may not exist → would require manufacturing (forced-TTS with phoneme substitutions, data augmentation).
  9. Hybrid (expected outcome) — rule-based for substitutions/omissions/additions; classification for distortion-type errors where phoneme labels don't capture the acoustics. If classification is implicated anywhere, the spike should also propose how to get or manufacture the data.

Deliverables

All under research/audio-detection-spike/:

  • scripts/ — focused one-off scripts: dataset loaders, model evaluation, metric computation. One script per question where possible.
  • LAB.md — running lab notebook, dated entries, numbers, dead ends, decisions as they happen.
  • findings.md — final synthesized report: landscape, prototype results, quality numbers, clinical-translation analysis, analysis-approach recommendation, go/no-go with rationale. If go: rough architecture sketch and proposed follow-up tickets.

Deliberately chosen over a Jupyter notebook: scripts + markdown are git-friendly, rerunnable, and easier to maintain. No ipynb in this spike.

Success criteria

The spike is done when:

  • At least 3 candidate models have been evaluated end-to-end on Speechocean762
  • Phoneme-level detection numbers are reported against ground truth (precision, recall, per-category breakdowns)
  • The clinical-translation section honestly addresses which error categories the best pipeline can and cannot detect
  • The analysis-approach question (rule-based / classification / hybrid) has an explicit recommendation with reasoning
  • findings.md ends with an unambiguous go / no-go and a one-paragraph rationale a skeptical reader can act on

Out-of-scope guardrails

  • No new package, no new endpoint, no frontend work
  • No fine-tuning unless an off-the-shelf model is on the edge of the bar — note as follow-up, don't execute
  • No pediatric data even if it appears accessible — defer to a follow-up spike
  • No attempt to integrate with governed generation in this spike — the fit question is answered in prose, not code

Follow-ups (not part of this spike, but likely implied if go)

  • Pediatric-transfer spike — whether L2-trained detection generalizes to child speech, and what pediatric data we can realistically access
  • TTS feasibility spike — phoneme-controllable open-weight TTS for contrastive playback
  • Data manufacturing spike — if classification is needed anywhere, can we synthesize error-labeled audio via forced TTS with phoneme substitutions
  • Architecture design — if go, full audio workstream architecture (capture, processing, integration with existing stack, UI)