Skip to content

v6 Audio — Design

Date: 2026-06-06 Status: Design (ready for implementation planning) Supersedes: the "five targeted models" per-population architecture on the model axis (memory/project_audio_targeted_models.md). The "one model per need, independently retire-able" principle still governs the scoring/interpretation layers.


1. What this is

The audio pillar of v6: one unified faithful phoneme transcriber + a generalized error-scoring layer, delivered through a single Audio tab. It turns recorded/uploaded speech into a positioned, typed, population-contextualized error analysis an SLP, language teacher, or learner can act on.

Two decisions frame everything:

  • The model is population-agnostic. There is one transcriber, not a child/adult/L2 family. All population-specificity (developmental vs. L1-transfer vs. neurogenic) lives on the scoring side, where it is cheap to add and tune.
  • Real clinical data is deferred, not faked. The transcriber trains only on real audio + produced-phoneme labels; no synthetic/generated audio. Adult-disordered data acquisition is gated behind consent walls and is the subject of a separate commercial-IRB + Fall SBIR track, not this MVP.

This is decision-support, never diagnosis. Every surfaced signal is descriptive, faithful to source, and clinician-explainable.


2. Architecture — four layers

audio (record / upload, any length)
   │
   ▼
[L1] Unified faithful transcriber  ──►  broad-40 phonemes
   │                                    + per-frame timing
   │                                    + per-frame confidence / logits
   ├──────────────┬───────────────────┬───────────────────────────┐
   ▼              ▼                   ▼                           ▼
[L2] Symbolic   [L3] Acoustic       [L4] Temporal/prosodic      (target word /
 scoring         residual            block                        text + profile)
 cos_dist +      (integrated)        rate · artic-rate ·
 edit-align +    distortion +        pauses · F0
 error-typing +  covert-contrast
 profile-prior   cross-check
   │              │                   │
   └──────────────┴───────────────────┘
                  ▼
        Unified Audio tab: positioned error overlay (hover = score/delta/type/
        expectedness) + predicted-error-type panel + rate readout + scope banner

The transcriber's job is faithful transcription only. Everything interpretive is downstream and profile-conditioned.


3. Layer 1 — Unified faithful transcriber

Property: faithful = length-agnostic = position-independent. A CTC model reads frames and emits a phoneme per frame (blank for silence). It must transcribe any part of any waveform regardless of length or what population produced it. "It doesn't matter where it comes from."

The empirical basis (research/2026-06-06-audio-triage/FINDINGS.md, n=915 + n=5,247):

  • Off-the-shelf wav2vec2-lv-60-espeak is already length-agnostic — len_ratio ≈ 1.0 from single words to 10-word utterances. Length-generalization is the base model's native behavior.
  • ft-child (PHON-139, fine-tuned on single-word GFTA/PERCEPT) collapses with length: ratio 1.0 → 0.19, emitting a fixed ~one-word count regardless of input. The collapse is real on natural sentences, not a splicing artifact.
  • ft-l2 (PHON-142, fine-tuned on L2-ARCTIC sentences) is length-agnostic and the most accurate model on connected speech — even on clean LibriSpeech outside its L2 domain (ratio 1.0, PER 0.10, beating off-the-shelf).
  • The single determinant of whether a fine-tune keeps length-generalization is the length diversity of its training data — not population. ft-l2 is the live existence proof that half the unified recipe already works.

Recipe: fine-tune the wav2vec2-espeak base (faithful hard-CTC on produced labels) on a length-AND-population-diverse union of real audio: connected clean (LibriSpeech CC-BY), accented (Common Voice CC0), child/disordered single words (PhonBank), dysarthria (TORGO), L2 sentences (L2-ARCTIC). The deviant-labeled data keeps it faithful (won't canonical-collapse); the connected data keeps it length-agnostic. Never off-the-shelf (NO-GO on disordered), never synthetic (a string generator can't supply acoustic training without TTS, which is the domain-gap shortcut the IRB track exists to avoid).

Outputs: broad-40 phonemes, per-frame timestamps (→ Layer 4), and per-frame confidence/logits (→ Layer 3).

Honest limits: broad-phoneme only; distortions and covert contrast invisible by design at this layer (Layer 3 handles them). The whole scorer's validity is bounded by transcript fidelity.


4. Layer 2 — Symbolic scoring (the generalized-prior interpreter)

The core insight, generalized from Flege's Speech Learning Model beyond L2: a deviation is interpreted against what's expected for the speaker's profile. deviation × profile-prior → expected-vs-anomalous, plus the symbolic error type.

Metric (have it): cos_dist over learned 26-d articulatory feature vectors (packages/features/outputs/vectors.csv, PHON-126 validated, Mann-Whitney p=1.9e-4), with full edit-distance alignment — substitutions, deletions, insertions — each at its true position.

Error typing (PHON-144, to build): feature-delta → clinical process labels (devoicing = [voice] flip, stopping = [continuant] change, fronting/backing = place shift, gliding, cluster reduction = onset deletion, epenthesis = insertion, …). Rule-based, no ML.

The prior is structurally different per population (this is the key nuance — not one channel, three shapes):

population prior shape source
L2 / accent confusion channel P(produced \| canonical, L1, position) (SLM-r + PAM) have itl1Prior.json (PHON-142)
child process_prior(type, age, position, cluster) — a two-axis tree: developmental-vs-atypical (atypical = backing, initial-CD, glottal replacement → flag at any age) × within-vs-past age-of-suppression McLeod & Crowe 2018 (consonant acquisition); process-suppression ages = clinical consensus (soft)
adult not one prior — aphasia = feature/neighborhood/frequency channel; AOS = variance/instability model, not a fixed channel; dysarthria → Layer 3 partial; AphasiaBank gated

Cross-cutting invariants (all three populations):

  1. The reference variety is explicit, and deviation-from-it ≠ error. The canonical CMU/SAE string is one dialect, not truth. Largest false-positive risk: scoring a non-rhotic L2 speaker's coda-/r/ as "deletion," or flagging systematic AAE features (cluster reduction, interdental fronting) as child disorder. Surface a dialect_plausible flag; dialect/accent is just another "expected" channel.
  2. Severity × impact-weight; expose error type, never a bare scalar. L2 = functional-load of the neutralized contrast (we have minimal-pair infrastructure); child = omission-weighting; adult = distortion-vs-substitution. A flat "accuracy %" is clinically misleading.
  3. The anomaly flag = low-prior + high-distance + profile-shared segment + inconsistent → SCREENING language only, never "disorder." Hard ethical line: don't pathologize accent, don't over-identify AAE, never make an adult-neuro diagnostic claim.
  4. Consistency/variability is first-class (requires repeated tokens of a target): child inconsistent disorder (Dodd ≥40%), AOS instability. Caveat: the classic "AOS = consistent errors" discriminator has flipped in the recent literature (Haley 2017) — treat consistency as suggestive, not load-bearing.

Aggregate: population-appropriate roll-up — intelligibility/comprehensibility (L2) ↔ PCC / PCC-R / PPC + process density (child) ↔ error-weighted PCC (adult).


5. Layer 3 — Acoustic residual (integrated, never standalone)

Why it's required, not optional: broad-phoneme symbolic scoring is structurally blind to two things that are the clinical signal:

  • Gradient distortion — which is the disorder in dysarthria, and the distinguishing feature of AOS (distorted substitutions). A broad transcriber either rounds a distortion back to canonical (false "correct") or snaps it to a wrong category (false substitution) — either way the diagnostic signal is destroyed.
  • Covert contrast — children (and adults) transcribed as merging /s/–/ʃ/ who maintain a sub-phonemic acoustic distinction. A symbolic scorer false-merges them, and they have better prognosis, so it's a costly error.

This is exactly why the standalone Praat UI (/dev/acoustic, PHON-130) was scrapped — formants in isolation are meaningless. The acoustic residual only has meaning fused into scoring. The extraction substrate is live and validated: packages/audio/src/phonolex_audio/acoustic.py (Parselmouth F1–F3 / F0 / duration, reproduces Praat to ~0 Hz), plus the transcriber's per-frame confidence/logits as a distortion-attention signal. Integrate it as the residual layer under the scorer; never resurrect it as a display.

Until the residual layer lands: the distortion bucket reads "not measured," never folded into substitution — silently mis-bucketing a motor-execution distortion as a phoneme-selection error is a fabrication (a silent-drop regression).


6. Layer 4 — Temporal / prosodic block

Rate of speech is a required emitted field (clinically mandated). It is load-bearing across exactly the scored populations: dysarthria (slowed/variable rate), AOS (slow + segmented prosody — the length effect), hypokinetic rushes, child connected speech.

It is a direct payoff of building the transcriber length-agnostic: per-frame phoneme timestamps yield speech rate (syll/sec, phon/sec), articulation rate (rate excluding pauses), and pause structure; the acoustic layer adds the F0 contour for prosody. Not a bolt-on.


7. The product surface — unified Audio tab

  • Three-population dropdown (L2 / child L1 / adult L1) — selects the scoring context (which prior, which norms, which aggregate), not a model. One transcriber behind all three.
  • Record or upload, any length.
  • Error overlay: every edit rendered at its true position — insertions at their after_canonical_index, subs/deletions in place, never bunched at the trailing edge (the PronunciationViewer bug we will not repeat; the scoring layer already emits positioned data). Hover = cos_dist + feature delta + error type + expectedness.
  • Predicted-error-type panel: the prospective process analysis for SLP review (Layer 2 typing + expected/delayed/atypical).
  • Rate readout: Layer 4 temporal/prosodic block.
  • Per-population scope banner: states coverage honesty (below).
  • Later (PHON-146): SLP transcript scoring methods (PCC/PPC) for clinician convenience.

8. The honest coverage boundary

The symbolic MVP ships for exactly what it can honestly score:

  • Covered (symbolic): L2 / accent, aphasic phonemic paraphasia, categorical child SSD.
  • Needs the acoustic residual layer (Layer 3): dysarthria, AOS distortion, covert contrast.

The three dropdown populations are not equal-coverage at MVP — this is stated in the scope banner, not implied by a uniform score. Distortion reads "not measured" until Layer 3 lands. Nothing here is a diagnostic claim.


9. Data & training

  • Transcriber union — real audio + produced labels only: LibriSpeech (CC-BY), Common Voice (CC0), PhonBank child (CC-BY-NC, registration-gated — usable), TORGO (open download, dysarthria), L2-ARCTIC (CC-BY-NC). Licensing footing: we ship models trained on the data, not the data itself.
  • Priors: L1 from L2-ARCTIC (have); developmental from McLeod & Crowe + process-suppression norms (expose as ranges); adult from AphasiaBank etc. — approval-gated (consortium membership), the harder wall.
  • Real adult-disordered clinical data: deferred to a commercial-IRB-designed collection + the Fall SBIR application. We drive our own data rather than beg academic gatekeepers.

10. MVP vs. deferred

MVP: Layer 1 (unified transcriber) + Layer 2 (symbolic scoring) + Layer 4 (rate) + the Audio tab, for the honestly-covered populations.

Deferred: - Layer 3 acoustic residual → unlocks the distortion populations (substrate exists; integration is the work). - SLP transcript scoring methods (PHON-146). - Real adult clinical data (commercial-IRB + SBIR). - Gender axis for adults (gender-affirming voice care) — feasible later via Common Voice's gender labels; parked.


11. Risks & open questions

  • The unified retrain hasn't been run. The recipe is grounded (the ft-l2 existence proof), but training the single length-+population-diverse model is the central build risk.
  • AOS variance-prior model is undesigned — a fixed confusion channel is the wrong shape; needs an entropy/instability formulation.
  • Dialect / AAE false positives — the dominant validity threat; the explicit-reference-variety mechanism must be real, not an afterthought.
  • Soft developmental norms — process-suppression ages are clinical consensus; expose as ranges, never hard cutoffs.
  • Thin priors — L2-ARCTIC is ~4 speakers/L1; adult priors are gated. Calibrate confidence; back off to inventory priors for sparse cells.
  • Transcriber fidelity ceiling — off-the-shelf ASR is brittle on dysarthric speech; the unified model needs population-specific validation before its scores are trusted there.

12. References

Live assets: research/2026-06-03-phon-139-transcriber-ft/ckpt/fullA_s2/state.pt (ft-child), research/2026-06-05-phon-142-ft-l2/ckpt/full_s17/state.pt (ft-l2, the existence proof) · packages/audio/src/phonolex_audio/{transcribe.py,transcribe_ft.py,mapping.py,acoustic.py,server.py} · packages/features/outputs/vectors.csv · packages/web/workers/src/lib/pronunciationScore.ts · packages/web/workers/src/config/l1Prior.json.

Research this session: research/2026-06-06-audio-triage/ (FINDINGS.md, FINDINGS_L2LIBRI.md, harnesses, parquets) — the length-agnosticism evidence. Three scoring-literature syntheses (L2/Flege-PAM, child/developmental, adult-acquired) — synthesized into §4–§6; key sources: Flege & Bohn 2021 (SLM-r), Best 1995/2007 (PAM), Munro & Derwing (intelligibility), McLeod & Crowe 2018 (acquisition), Shriberg & Kwiatkowski 1982 (PCC), Dodd (differential diagnosis), Haley 2017 (AOS consistency), Darley-Aronson-Brown (dysarthria subtypes).

Tickets: PHON-128 (off-the-shelf baseline), PHON-139 (ft-child), PHON-142 (ft-l2 + L1 prior), PHON-129 (scorer), PHON-126 (graded metric), PHON-130 (acoustic extraction), PHON-122 (acoustic residual layer), PHON-144 (symbolic typing), PHON-145 (Audio tab), PHON-146 (SLP scoring methods), PHON-143 (adult — IRB/SBIR), PHON-141 (transcriber inventory expansion).

Memory: memory/project_v6_audio_mvp_unified.md.