Skip to content

v6 Audio — Continuous Feature-Prediction Model (the plan that never got written)

Author: recovered by Claude from session 154041c2 transcript (user's own messages, cited inline) + the 2026-03-13 continuous-feature-learning design. Date: 2026-06-08 Status: DRAFT — design recovered, one training-objective decision open (§6). No code/compute until that's settled and this is checked.

This document exists because the design below was decided in conversation and never captured in a plan (user msg 108: "we DID have an entire conversation that SHOULD have been in the plan"; msg 140: "you never wrote it"). Everything in §2–§5 is the user's stated design, quoted. The phoneme-CTC framing in 2026-06-06-v6-audio-design.md §3 is wrong and is superseded by this doc.

1. The task

A user is prompted with a target word, says it, and produces something that deviates from the target. We do not transcribe "what they said" as a symbol string. We predict the deviation from the target, in continuous feature space.

msg 136: "Someone is prompted with a word, they say the word. They don't produce it perfectly, they produce something else. We don't predict that something else, but the deviation from the target. That is all in a continuous, not discrete, feature space. How glottal is it? How lateral? rhotic?"

1a. What we've already learned that this builds on (do NOT re-derive)

  1. Diphthongs are dynamic. composite.py: 5 diphthongs eɪ oʊ aɪ aʊ ɔɪ, each α·v_onset + β·v_offset (α≈1.50 onset-dominant, β≈0.20). The static composite (composites.csv, 63 segs) is the lexicon representation; for per-frame audio prediction the diphthong is a trajectory — frames move v_onset → v_offset. CTC targets for a diphthong span its frames as that traversal, not a single collapsed vector. The feature-learning fit already consumed Hillenbrand formant trajectory evidence (run.py) — the acoustics↔features bridge for vowels/diphthongs, directly relevant to an acoustic→feature model.
  2. Two vector sets, Bayesian, continuous. vectors.csv (58 segments, post-PHON-141) + composites.csv (+5 diphthongs). Hayes prior + ECCC + Hillenbrand + corpus-confusion; r=0.987 vs PHOIBLE, 8/8 voicing. NOT PHOIBLE. These are the prediction space and the alignment anchors.
  3. Faithful = length-agnostic (PHON-139/142). Word-only training collapses on sentences (fixed ~one-word output regardless of input). The fix is a length-diverse union with connected speech; "any part of any waveform → a feature vector, null for static."
  4. Canonical-collapse is the structural enemy (PHON-123 retro). Transcribe-then-align regressed disordered productions to canonical (47% SSD / 59% CAS). Train on produced targets, and measure collapse. This is §6a (contrast preservation) at the source — the same lesson.
  5. Per-frame confidence localizes distortion (PHON-139 compare viewer). Clear segments score ~0.99, distorted/uncertain ones drop (tap l=0.67, voicing p=0.56). Emit per-frame confidence alongside features — it's the distortion / covert-contrast attention signal (the PHON-122 acoustic-residual seam).
  6. The scoring layer already exists. cos_dist over the learned vectors (PHON-126, Mann-Whitney p=1.9e-4) + WPER sequence alignment (PHON-129). Deviation-from-target = per-aligned-position feature distance. Build the model to feed this, don't reinvent it.
  7. Produced-vs-canonical = confusion data (PHON-141). The same pairs that expanded the vectors are the audio training signal and carry the contrast to preserve.
  8. Staged training is how we train (PHON-139 recipe + continuous-feature §5.2). Freeze→unfreeze with warmup; add data/populations in measurable isolation. train_union.py already has --freeze-encoder/--warmup-steps.
  9. Encoder substrate (PHON-128). wav2vec2-lv-60-espeak-cv-ft, 9.3% PER native clean, reference-robust — the acoustic backbone.

2. What the model predicts — CONTINUOUS FEATURES, not phonemes

The model predicts the 26-d continuous articulatory feature vector per frame (the learned Hayes feature space, packages/features/outputs/vectors.csv), with a null/silence output for static. It does not emit a discrete phoneme class.

msg 134: "we predict features not phonemes. that's always been the design." msg 135: "why would you predict discrete phonemes when we don't have discrete phonemes." msg 60: "they stand in for feature vectors where the features are continuous, not binary. If we do the prediction on the features, we can map them to the closest phonemes for broad category while retaining fine-grained feature analysis and specific feature distances."

Phonemes are derived, not predicted: map the predicted feature vector to its nearest phoneme for a broad-category label, while the continuous features are retained for fine-grained analysis and exact feature distances.

Faithful = token-length-agnostic (msg 20): "If it reads any part of a wave form, it better produce a [feature vector]. Or a null [output] for static… faithful MEANS it doesn't matter where it comes from." Any part of any waveform → a feature vector or null; no collapse with length.

3. The feature space (already built — PHON-141, done)

The learned Bayesian vectors (NOT PHOIBLE): Hayes prior + ECCC + Hillenbrand + corpus-as-confusion evidence. Two sets of vectors — segment vectors and diphthong composites (msg 41: "There are two sets of vectors. One for fucking diphthongs"; composites.csv = 45 = 40 mono + 5 diphthong composites, the α·v_onset + β·v_offset representation from the 2026-03-13 design §4).

Expanded 40→58 in PHON-141 to carry the nonstandard segments the transcripts already produce (msg 35: "expanding the feature vectors to include nonstandard english vectors that MOST OF THE TRANSCRIPTS ALREADY PUT IN THE PRODUCED COLUMN… is produced vs canonical not confusion data?"). Done, validated, committed.

3a. The two differences from v3 (why this combination, and why it captures non-canonical phones)

v3 (our best prior model) is a broad-40 discrete phoneme-CTC transcriber. This model differs in exactly two ways, and they are inseparable:

  1. Feature-based prediction (continuous, §2) instead of discrete phoneme classes.
  2. Expanded 58-segment inventory (PHON-141) instead of broad-40. v3 resolved every production to broad-40, so the non-canonical phones the transcripts already carry — ʈ ɖ ɳ ɬ ʔ ɾ ɫ + central vowels — were structurally collapsed (ɾ→t, ʔ→∅, ɬ→s). The whole point of the interim work was to make those representable.

Why the two must go together — and why the discrete retrain I ran failed to emit them: a discrete-CTC class needs many labeled examples to train, so the sparse new classes (ʈ=1 train token, ɬ=6) were never emitted (my eval: emitted 1/18). The feature-based path removes that requirement: - The model learns one global acoustics→features map from all frames, not a per-class detector. - The non-canonical segments are fixed anchors in feature space (PHON-141), not classes that must be independently learned. - "Resolve phonemes" = nearest anchor in continuous space, so a retroflex production lands near V_ʈ and resolves to ʈ without needing many ʈ examples; the soft distance readout makes it a graded lean toward the non-canonical anchor rather than a hard collapse to canonical. - Honest caveat: genuinely rare segments stay data-limited for learning their acoustics, but they are now represented and resolved-toward (graded) instead of hard-collapsed — the capture broad-40 + discrete CTC structurally could not do.

4. Deviation + population live on the SCORING side, not the model

The model is unified and population-agnostic (msg 17: "It is not dependant on child or adult or l1 speech. All of that takes place on the scoring side. The model itself can even draw on broader phonetic transcriptions"). The model predicts the production's feature trajectory faithfully; the scorer computes deviation from the prompted target's feature sequence (how glottal/lateral/rhotic vs intended) and applies the population prior (L2 / child L1 / adult L1) — the existing pronunciationScore.ts / l1Prior direction, but over predicted features vs target features, not phoneme symbols.

5. Staged training (the methodology I abandoned)

msg 140: "You abandoned everything we ever learned. You're not doing a staged training. You're doing random shit."

We train staged, the way we always have — the continuous-feature design's "establish a clean baseline, add evidence, measure its impact in isolation" (§5.2) and the PHON-139 freeze→unfreeze anti-collapse recipe (already in train_union.py: --freeze-encoder, --warmup-steps), which the cold single-stage full-FT I ran bypassed. Concretely (schedule tunable):

  • Stage 1 — frozen encoder, train the feature head on clean/native first → establish a faithful baseline with stable acoustic representations (avoids the cold-start collapse).
  • Stage 2 — unfreeze with LR warmup, add accented → child/disordered, measuring each population's impact in isolation (diff against the stage-1 baseline), exactly as the feature-learning phases were measured.

6b. Loss — faithful + contrast-margin + no-delete (2026-06-09, the corrected objective)

The original loss (§6 below) was pure CTC to the produced anchors. It is faithful in target but not in behaviour: because near-neighbour anchors (V_ʈV_t, ~2 features apart) satisfy CTC "well enough," the model emits e at the canonical neighbour (measured: retroflex productions land at cos_dist 0.035 from t, 0.17 from ʈ) and deletes brief segments via blank (ʔ). It transcribed the canonical — the thing that never appears in the waveform. The waveform already carries the cue (F3 dip for retroflex, F0 break for glottal); the loss just never forced the model to use it. No Praat/formant input is added — the acoustics are in the waveform; the loss is the fix. (Formants are a back-pocket, tightly-scoped guardrail for round 2 only if the margin fails — confident voiced frames, F1/F2/F3 dims, low weight — because Praat is least reliable on exactly the creaky/child frames we care about, whereas the canonical reference is exact.)

The model is unchanged (wav2vec2 → per-frame e ∈ [0,1]²⁶ sigmoid; readout logit_c = −‖e−V_c‖²/τ over 58 anchors + learned blank). The loss gains two terms and needs the canonical sequence per utterance (added to the manifest/JSONL as canonical; for clean sources TIMIT/LibriSpeech canonical=produced):

L = L_ctc + λ_m · L_margin + λ_b · L_nodelete

  1. L_ctc — CTC over the produced anchor sequence (blank=0). Faithful alignment. (have it)
  2. L_margin — contrast preservation. A forced alignment A of the produced sequence to frames is computed from the current per-frame log-probs (CTC Viterbi / torchaudio.functional.forced_align, under no_grad). The produced and canonical sequences are Levenshtein-aligned once per utterance → canon_of[i] (the canonical phone paired with produced position i, or None for insertions). For each non-blank frame t aligned to produced position i with p=produced[i], c=canon_of[i], where c exists and p≠c: L_margin += max(0, margin − (d2(e_t, V_c) − d2(e_t, V_p))) (squared-euclidean; e_t must be closer to produced than canonical by margin). Averaged over qualifying frames. This is "maintain produced-vs-canonical distance where there's variation."
  3. L_nodelete — at non-blank forced-aligned frames, penalize blank probability: L_nodelete += −log(1 − p_blank,t), averaged. Fights the ʔ deletion (blank is no longer free where a produced segment is aligned).

Hyperparameters: margin=0.5, λ_m=0.5, λ_b=0.2 (tunable). The forced-alignment path is detached (no grad through alignment); gradients flow through e/logits at the aligned frames. L_margin/L_nodelete engage only after a short CTC-only warmup (alignments must be sane first).

6. Training objective — ALIGNMENT (original; §6b supersedes the loss)

Not an open question after all — I'd manufactured one. Every decent prior approach (PHON-126 cos_dist, PHON-128/139/142 transcribers, PHON-129 WPER scorer) is alignment-based, and so is this:

  • The acoustic model predicts a continuous feature vector per frame (so a production between t and ɾ stays between them — never argmax'd to a discrete class).
  • It is trained with CTC alignment (CTC is alignment), where the per-segment targets are the feature vectors of the produced phonemes (msg: "the feature vector can be from the phonemes provided. Why not?") — the produced narrow transcription supplies the targets; nothing is forced-aligned per frame.
  • At scoring, the predicted feature sequence is sequence-aligned (Levenshtein/WPER, PHON-129) against the prompted target's feature sequence; the per-position feature distance (cos_dist over the learned vectors, PHON-126) is the deviation — how glottal/lateral/rhotic vs intended.

This is option (B) of the earlier draft, and it is the established recipe — not a new invention. The earlier "frame-aligned feature regression (A)" with MFA per-frame targets was woven from thin air and is removed.

The only real failure was execution, not design: the distance-readout was trained cold, single-stage, fell into the CTC blank attractor, and I thrashed. The fix is the staged freeze→unfreeze training in §5 — which train_union.py already supports (--freeze-encoder, --warmup-steps) and which I bypassed — building on the prior approaches, not replacing them.

6a. Contrast preservation — faithfulness at the feature level

User: "if we have produced v canonical, we very well should maintain the contrast in our faithful feature prediction where there is variation at all."

Where a production genuinely varies — from canonical, or between tokens that share one broad transcription — the predicted continuous features must preserve that variation. The clinical case: a speaker who maintains a sub-phonemic /s/–/θ/ distinction that a broad transcript merges to [θ] must receive different predicted feature vectors for the two productions, not both collapsed to V_θ. This is covert contrast, the highest-value signal (maintained contrast ⇒ better prognosis; a false-merge is the costly error).

Consequences for the design: - Third reason it's continuous features, not discrete phonemes: an argmax to a phoneme class is the contrast collapse. - Third reason the thin-air frame-regression (A) is wrong: hard-regressing every [θ] frame to the single symbol vector trains the contrast out. The soft CTC alignment (§6) treats the produced phoneme as a feature-distance anchor, not a hard target, so a covert-/s/-leaning [θ] can land between V_θ and V_s while still aligning to θ — the variation is retained in the emitted features. - Property to MEASURE (feature-level anti-collapse — the PHON-123 retro lesson, restated): on produced-vs-canonical pairs that differ, verify the predicted features preserve the produced-side variation and do not regularize toward canonical. Scope is "where there is variation at all" — no variation, nothing to preserve. This is also a training-pressure check: CTC pulls aligned frames toward their anchor, so over-collapse is the failure mode to watch and the staged schedule / loss must not erase real variation.

7. Out of scope / settled elsewhere

  • Acoustic residual (distortion gradient) = later quality pass (msg 7), substrate = the predicted-feature confidence + acoustic.py.
  • Word boundaries + rate of speech layer = after faithful feature prediction (msg 20).
  • Real clinical data = deferred to IRB → Fall SBIR (msg 17).

8. Implementation strategy (locked) + staged task list

One line: continue-FT from v3 → swap to a continuous-feature head with a soft produced-phoneme-anchor CTC readout over the 58-segment inventory → stage-1 frozen-encoder head warmup (kills the blank attractor) → stage-2 gentle unfreeze, balance-pop, produced targets → resolve to nearest non-canonical anchor; score by feature-distance deviation from the prompted target.

All work in research/2026-06-06-audio-union/. Reuses the existing union (23,516 clips) + train_union.py (already has --freeze-encoder, --warmup-steps, --init-ckpt, --balance-pop, --child-boost). v3 weights at /Volumes/ExternalData1/audio-union/model_v3/state.pt.

  • Task 1 — Feature-emission head + soft-anchor readout (model). Linear(1024→26)+sigmoid = emitted e; readout −‖e−V_c‖²/τ over the 58 vectors.csv anchors + a learned null channel; per-frame confidence = max anchor-similarity. Encoder init from v3 (--init-ckpt model_v3/state.pt, mapping v3's transformer; the head is new). Gate: forward pass emits e∈[0,1]^26, finite CTC loss, confidence in range.
  • Task 2 — Targets from produced phonemes (label path). Produced narrow transcription → expanded-58 tokens (to_expanded, done) → each token's vectors.csv vector as its soft anchor id. Diphthongs: emit the produced diphthong as its v_onset then v_offset anchors across frames (trajectory), not the composite. No discrete class targets; no canonical. Gate: a produced [ɾ a t] aligns to the ɾ,a,t anchors (not t,a,t).
  • Task 3 — Stage 1: frozen-encoder head warmup (the blank-attractor fix). --freeze-encoder, blank-warmup ON, higher LR on the head only, on the full balanced union. Gate (HARD): non-empty decodes + loss descends past the ~4.1 plateau my cold runs stuck at; per-frame e is sharp (nearest-anchor distance ≪ second-nearest on clear segments).
  • Task 4 — Stage 2: gentle unfreeze. Load stage-1, unfreeze transformer, low LR (~1e-5) + LR warmup, --balance-pop, full union, produced targets. Gate: val feature-distance improves over stage-1; no blank regression.
  • Task 5 — Validation in feature space (not categorical). Per-population (clean/L2/child/dysarthria) mean feature-distance deviation (cos_dist); canonical-collapse rate (PHON-123 check); covert-contrast preservation (produced-vs-canonical pairs that differ keep their difference); length-ratio ≈1.0; non-canonical resolution (does ɾ/ʔ/ɬ get resolved where the production warrants, graded — not the discrete 1/18). Gate: beats v3's broad-40 ceiling on non-canonical capture without regressing clean/L2 deviation.
  • Task 6 — Scoring hook (reuse, don't reinvent). Predicted feature trajectory → WPER-align (PHON-129) to the prompted target's canonical feature sequence → per-position feature-distance = deviation → confidence flags distortion. Population prior on the scoring side (l1Prior).
  • Task 7 — Prediction triage automation (msg 39/40). Generate a batch of predictions across populations/lengths and surface the worst feature-distance / lowest-confidence / collapse cases for review — the "second set of eyes" you asked for, so we fix the right things in the quality pass.

Staging cost: one A40 cycle (stage-1 + stage-2 chained), ~the v3 budget. No new data prep — the union + expanded labels already exist.