v6 Audio — Trajectory-Aware Scoring + the Diphthong Correction¶

SUPERSEDED 2026-06-13 by docs/superpowers/plans/2026-06-13-all-phones-are-trajectories.md. This doc encodes a two-tier "categorical anchor + formant-trajectory" design that preserved the single-vector collapse and a categorical anchor tier that were never asked for. Kept only as a record of the wrong turn. Read the plan, not this.

Date: 2026-06-13 Status: SUPERSEDED (see banner). Design (converged in conversation 2026-06-12/13; written down because the last diphthong-as-trajectory intent lived only in memory and got dropped — §1). Scope: within v6 = faithful feature transcriber → symbolic error classification → product tab (project_v6_scope_transcriber_only). This is the scoring/representation layer. NOT L1 prediction (attribution is back-pocket); decision-support, never diagnosis.

1. The problem¶

Diphthongs were dropped entirely. lib_labels._tokenize had TWO = {tʃ, dʒ} and an inventory with no diphthongs, so oʊ/eɪ/aɪ/aʊ/ɔɪ matched neither the two-char set nor a single inventory entry and fell through to component chars → o, ʊ. Split by omission, everywhere: training labels split them; the scorer marked the atomic lexicon-canonical oʊ in_inventory:false and skipped it ("not in inventory" greyed cell). A diphthong word's nucleus — the point of the word — went unscored.

The deeper issue the diphthong exposed: the model emits continuous per-frame features — a trajectory through the 26-d space — which was the whole v6 design ("predict continuous features per frame"). But the scoring collapses each phone to a single anchor point and measures point-distance, discarding the dynamics the model already computed. That collapse is invisible for steady-ish phones and catastrophic for diphthongs (whose entire identity is movement) — but it loses real signal for every phone (VOT/aspiration timing, affricate stop+frication structure, vowel inherent spectral change). The point-anchor scoring was a simplification of the continuous- feature vision; diphthongs are where the seam tore.

2. Principle: all phones are trajectories¶

A /o/ has formant movement and transitions; an /oʊ/ has more; a stop is closure→burst→release. Every phone is a path through feature space over its frames; the single feature vector is a centroid summary we already accept for all of them. So a diphthong needs no special category — it's a phone whose trajectory is non-flat. The composite (α·onset + β·offset, α=1.43/β=0.54 → oʊ ≈ 73% o / 27% ʊ) is just its centroid, exactly like any monophthong's vector. Treating diphthongs as a different kind of object requiring bespoke machinery was the error.

Two consequences: - Categorically, oʊ is one unit / one label / one anchor — like any phone. - Dynamically, scoring should compare the emitted trajectory to a reference trajectory, not collapse both to points. This uses the model's full output and generalizes to all phones.

3. Architecture — two complementary layers¶

3a. Categorical layer (the feature model + symbolic alignment) — exists, needs the diphthong retrain¶

Model emits 26-d learned features per frame; CTC aligns to phone anchors (incl. oʊ as one anchor, the composite blend, for alignment); decode → phone sequence.
A glided diphthong → the model emits oʊ (one token). A separated production (Tidewater "boh-oot") → the model emits o, ʊ (two monophthongs it has). A monophthongized one → o alone.
alignWPER(canonical, produced) against the one-token canonical oʊ makes the pattern fall straight out of the edit alignment — no dynamics machinery needed here:
produced oʊ → match (correct diphthong)
produced o ʊ → substitution + insertion (separation)
produced o → substitution/deletion (monophthongization)
Per-position deviation = cos_dist over the learned vectors at committed frames.

This layer answers which segments and catches separation/monophthong categorically through alignment. It is the immediate diphthong correction (§4).

3b. Dynamic / trajectory layer (formant space: Hillenbrand + acoustic.py) — the acoustic-residual layer, in its right shape¶

Point-distance is blind to time-course. The dynamic layer scores the movement, grounded in real acoustics with tools we already have:

Reference trajectory = Hillenbrand's measured F1–F3 time-course per vowel/ diphthong (it samples formants at multiple points across each vowel, incl. eɪ/oʊ; the dynamic data lives in bigdata.dat, already used by the feature-learning's load_formant_distances). Empirical, not invented.
Produced trajectory = acoustic.py (Parselmouth, Praat-validated, PHON-130) measures F1–F3/F0 over the segment located by the categorical alignment.
Score = trajectory comparison (time-normalize or DTW): a glide is a smooth F-transition; "boh-oot" is two F-plateaus with a step; a monophthong is flat; later — VOT/aspiration and affricate stop+frication structure read the same way.

This is the acoustic residual the design always reserved (never parked) — realized correctly. The standalone /dev/acoustic (PHON-130) was the wrong shape not the wrong substrate: a raw F1 number in isolation is meaningless, but a formant trajectory scored against a reference, fused into the deviation, is the dynamic signal. It generalizes to every phone because every phone has a formant trajectory.

Division of labor¶

Categorical layer → segmental identity + deviation + the categorical glide/separation/monophthong read (via alignment). Carries oʊ-as-unit after the retrain.
Trajectory layer → the dynamics (how the formants moved), grounded in Hillenbrand + acoustic.py. Diphthongs first; VOT/affricates/vowel-inherent-change generalize for free.

4. The diphthong correction (concrete)¶

Inventory — 5 diphthong composite anchors, normalized convex blend V_oʊ = (α·v_o + β·v_ʊ)/(α+β) in [0,1], on the un-pooled geometry. 58→63. build_diphthong_anchors.py → vectors_63_unpooled.csv. ✅ DONE.
Tokenizer — DIPHTHONGS added to TWO + INV58/INV40 in lib_labels; to_expanded('B OW T') → [b, oʊ, t]. ✅ DONE.
Relabel — union manifest stores only split tokens, so re-extract from raw sources through the fixed lib_labels (clips cached/reused) → glided diphthongs become oʊ.
Retrain — 63 anchors + relabeled union, combined with the un-pooled/freed-prior geometry (one proper next model; supersedes model_feat_unpooled, which is diphthong-blind). RunPod A40 (runpod/.pod_env). Validate diphthong capture like retroflexes: does a glide emit oʊ, a separation o ʊ?
Scoring — oʊ is now an anchor → feature_emitter scores it (no skip). The categorical separation read is just alignWPER. No new categorical code.

5. Trajectory layer build (3b) — BUILT + VALIDATED 2026-06-13¶

Shipped (research branch research/phon-130-acoustic-analysis): - References — build_excursion_refs.py → diphthong_excursion_refs.json (bundled at packages/audio/src/phonolex_audio/data/). Hillenbrand carries only eɪ/oʊ (the small glides §8 hands to the categorical layer), so all five directed-excursion refs are built from glided GOLD union clips (clean+L2 short clips, central-60% voiced, group women) through the SAME acoustic.py apparatus the scorer uses — apples-to-apples. Directions are phonetically correct (eɪ/ɔɪ front via +F2; oʊ/aʊ back via −F2; aɪ raises+fronts via −F1/+F2). - Scorer — packages/audio/src/phonolex_audio/trajectory.py score_dynamics(). For each diphthong target the categorical alignment located, it spans the vowel between neighboring located-segment centers (CTC is peaky — it commits ~1 frame to the diphthong slot, far too narrow; the inter-neighbor region adapts to the local phone rate), measures the produced F1/F2 track over that window (same smoothing + central-60% trim as the refs), and computes glide_realized = produced_disp / ref_disp and direction_cos. acoustic.py runs at most once per clip and ONLY when a diphthong slot has committed frames (no Parselmouth cost on non-diphthong words). - Wiring — FeatureEmitter.review(..., group=) calls it and attaches a dynamic block to each diphthong per_position entry. FeatureEmitter now takes a configurable vectors_csv (+ $PHONOLEX_FEATURE_VECTORS, --feature-vectors) so the 63-anchor diphthong keeper — whose state.pt and vectors.csv must travel together — loads in serving without touching the packaged 58-anchor default. - Reliability gate — reliable=true only for large-glide aɪ/aʊ/ɔɪ (§8). The small-glide eɪ/oʊ carry reliable=false: the dynamic read is reported but the categorical layer is authoritative there. Decision-support framing; a span with too few voiced formant frames returns measured=false (no fabricated number).

Unit coverage: packages/audio/tests/test_trajectory.py (model-free — synthetic track + per_position).

6. Validation — DONE 2026-06-13¶

validate_dynamics.py (63-anchor keeper, L2 test clips, group women), glided gold vs monophthongized-to-onset gold, median over n≤18/cohort:

diphthong	glided (realized / dir_cos)	reduced	reads
aɪ (reliable)	1.00 / +0.98	0.18 / +0.98 (n=2)	magnitude collapses, direction holds — clean separation ✓
aʊ (reliable)	0.60 / +0.90	(no reduced sample)	glide present, correct direction
eɪ (cat-owned)	0.29 / +0.91	0.27 / +0.95	no separation → categorical owns it ✓
oʊ (cat-owned)	0.23 / +0.79	0.46 / +0.10	no separation → categorical owns it ✓

The directed-excursion metric separates glided from reduced for the large-glide diphthong (aɪ 1.00 vs 0.18, direction stable — the magnitude deficit IS the reduction signal §8 predicted), and the reliable=false flag fires exactly where the metric doesn't separate (eɪ/oʊ), handing those to the categorical layer. (aʊ/ɔɪ reduced are sparse in the L2 test split; the offline probe Δdisp evidence + aɪ's decisive separation carry the large-glide claim.) Categorical metrics: no regression — same keeper, the dynamic layer is additive and runs only on diphthong slots.

7. What this is NOT¶

Not L1 prediction (the clinician knows the L1; attribution engine is back-pocket).
Not a standalone formant display (PHON-130's mistake); formants only as trajectory-vs- reference fused into scoring.
Not diagnosis — decision support, human-in-the-loop.

8. Metric — DECIDED off evidence: directed excursion, contrast-dependent¶

Not magnitude-vs-monophthong (trajectory_ref.py: diphthongs do NOT move more than monophthongs — ʊ/ʌ/æ/ɔ 357–469 Hz exceed eɪ/oʊ 205–219; American monophthongs are heavily dynamic). The decisive evidence is trajectory_probe.py (acoustic.py smoothed F1/F2 over the central voiced portion, glided vs gold-reduced clips):

diph	Δdisp glided−reduced	direction (netF2)	metric verdict
aɪ	+194 Hz	flip (+77→−28, fronting lost)	excursion + direction — clear
aʊ	+321 Hz (n=5)	−144	excursion clear
eɪ	+141 Hz	noisy (n=12)	moderate
oʊ	+85 Hz	no help (−70→−67)	too small — categorical owns it

So the metric is directed excursion (start→end displacement + net F1/F2 direction along the reference glide), and which contrasts it carries is contrast-specific: large-glide diphthongs (aɪ/aʊ) reduction = measurable excursion+direction deficit; small-glide (oʊ/eɪ) reduction is below the formant-noise floor → the categorical layer (model emits oʊ vs o, ~0.5 capture) owns those, plus all separations (via alignment). Two complementary detectors, each owning what it's empirically good at. Caveats: raw per-frame path-length is jitter-dominated — use smoothed displacement; region-location was central-voiced-portion (crude) — the real scorer should locate the diphthong via the model's frame alignment. References saved formant_trajectory_refs.json.

8b. Open questions¶

RESOLVED in the build: - Trajectory space → formant (Hillenbrand-grounded apparatus + acoustic.py), confirmed: the produced/reference excursion comparison in F1/F2 separates the reliable contrasts (§6). - Region location → inter-neighbor span, not the peaky single committed frame. CTC commits ~1 frame to a diphthong slot (diag_span.py); the vowel is spanned between neighboring located-segment centers. - oʊ anchor reachable? Yes for categorical capture (keeper: aɪ/aʊ/ɔɪ strong, oʊ/eɪ ~0.55); for the DYNAMIC read oʊ/eɪ are reliable=false regardless (small glide, §6).

REMAINING: - Consonant reference trajectories (VOT, affricates) — Hillenbrand is vowels only; needs another source or model-aggregated references. Diphthongs/vowels first. - Group coverage: the excursion refs are women-keyed. Child clips currently score against women refs (direction is group-robust; magnitude scales) — a children ref set is the clean follow-up when child dynamic scoring matters. - Reduced-aʊ/ɔɪ are sparse in the L2 test split — confirm separation on a larger or synthesized reduced set before leaning on those two contrasts.