v6 Audio — Trajectory-Aware Scoring + the Diphthong Correction¶
SUPERSEDED 2026-06-13 by
docs/superpowers/plans/2026-06-13-all-phones-are-trajectories.md. This doc encodes a two-tier "categorical anchor + formant-trajectory" design that preserved the single-vector collapse and a categorical anchor tier that were never asked for. Kept only as a record of the wrong turn. Read the plan, not this.
Date: 2026-06-13
Status: SUPERSEDED (see banner). Design (converged in conversation 2026-06-12/13; written down because the
last diphthong-as-trajectory intent lived only in memory and got dropped — §1).
Scope: within v6 = faithful feature transcriber → symbolic error classification →
product tab (project_v6_scope_transcriber_only). This is the scoring/representation
layer. NOT L1 prediction (attribution is back-pocket); decision-support, never diagnosis.
1. The problem¶
Diphthongs were dropped entirely. lib_labels._tokenize had TWO = {tʃ, dʒ} and
an inventory with no diphthongs, so oʊ/eɪ/aɪ/aʊ/ɔɪ matched neither the
two-char set nor a single inventory entry and fell through to component chars →
o, ʊ. Split by omission, everywhere: training labels split them; the scorer marked
the atomic lexicon-canonical oʊ in_inventory:false and skipped it ("not in
inventory" greyed cell). A diphthong word's nucleus — the point of the word — went
unscored.
The deeper issue the diphthong exposed: the model emits continuous per-frame features — a trajectory through the 26-d space — which was the whole v6 design ("predict continuous features per frame"). But the scoring collapses each phone to a single anchor point and measures point-distance, discarding the dynamics the model already computed. That collapse is invisible for steady-ish phones and catastrophic for diphthongs (whose entire identity is movement) — but it loses real signal for every phone (VOT/aspiration timing, affricate stop+frication structure, vowel inherent spectral change). The point-anchor scoring was a simplification of the continuous- feature vision; diphthongs are where the seam tore.
2. Principle: all phones are trajectories¶
A /o/ has formant movement and transitions; an /oʊ/ has more; a stop is
closure→burst→release. Every phone is a path through feature space over its frames; the
single feature vector is a centroid summary we already accept for all of them. So a
diphthong needs no special category — it's a phone whose trajectory is non-flat. The
composite (α·onset + β·offset, α=1.43/β=0.54 → oʊ ≈ 73% o / 27% ʊ) is just its
centroid, exactly like any monophthong's vector. Treating diphthongs as a different kind
of object requiring bespoke machinery was the error.
Two consequences:
- Categorically, oʊ is one unit / one label / one anchor — like any phone.
- Dynamically, scoring should compare the emitted trajectory to a reference
trajectory, not collapse both to points. This uses the model's full output and
generalizes to all phones.
3. Architecture — two complementary layers¶
3a. Categorical layer (the feature model + symbolic alignment) — exists, needs the diphthong retrain¶
- Model emits 26-d learned features per frame; CTC aligns to phone anchors (incl.
oʊas one anchor, the composite blend, for alignment); decode → phone sequence. - A glided diphthong → the model emits
oʊ(one token). A separated production (Tidewater "boh-oot") → the model emitso,ʊ(two monophthongs it has). A monophthongized one →oalone. alignWPER(canonical, produced)against the one-token canonicaloʊmakes the pattern fall straight out of the edit alignment — no dynamics machinery needed here:- produced
oʊ→ match (correct diphthong) - produced
o ʊ→ substitution + insertion (separation) - produced
o→ substitution/deletion (monophthongization) - Per-position deviation =
cos_distover the learned vectors at committed frames.
This layer answers which segments and catches separation/monophthong categorically through alignment. It is the immediate diphthong correction (§4).
3b. Dynamic / trajectory layer (formant space: Hillenbrand + acoustic.py) — the acoustic-residual layer, in its right shape¶
Point-distance is blind to time-course. The dynamic layer scores the movement, grounded in real acoustics with tools we already have:
- Reference trajectory = Hillenbrand's measured F1–F3 time-course per vowel/
diphthong (it samples formants at multiple points across each vowel, incl.
eɪ/oʊ; the dynamic data lives inbigdata.dat, already used by the feature-learning'sload_formant_distances). Empirical, not invented. - Produced trajectory =
acoustic.py(Parselmouth, Praat-validated, PHON-130) measures F1–F3/F0 over the segment located by the categorical alignment. - Score = trajectory comparison (time-normalize or DTW): a glide is a smooth F-transition; "boh-oot" is two F-plateaus with a step; a monophthong is flat; later — VOT/aspiration and affricate stop+frication structure read the same way.
This is the acoustic residual the design always reserved (never parked) — realized
correctly. The standalone /dev/acoustic (PHON-130) was the wrong shape not the wrong
substrate: a raw F1 number in isolation is meaningless, but a formant trajectory scored
against a reference, fused into the deviation, is the dynamic signal. It generalizes to
every phone because every phone has a formant trajectory.
Division of labor¶
- Categorical layer → segmental identity + deviation + the categorical
glide/separation/monophthong read (via alignment). Carries
oʊ-as-unit after the retrain. - Trajectory layer → the dynamics (how the formants moved), grounded in
Hillenbrand +
acoustic.py. Diphthongs first; VOT/affricates/vowel-inherent-change generalize for free.
4. The diphthong correction (concrete)¶
- Inventory — 5 diphthong composite anchors, normalized convex blend
V_oʊ = (α·v_o + β·v_ʊ)/(α+β)in [0,1], on the un-pooled geometry. 58→63.build_diphthong_anchors.py→vectors_63_unpooled.csv. ✅ DONE. - Tokenizer —
DIPHTHONGSadded toTWO+INV58/INV40inlib_labels;to_expanded('B OW T') → [b, oʊ, t]. ✅ DONE. - Relabel — union manifest stores only split tokens, so re-extract from raw sources
through the fixed
lib_labels(clips cached/reused) → glided diphthongs becomeoʊ. - Retrain — 63 anchors + relabeled union, combined with the un-pooled/freed-prior
geometry (one proper next model; supersedes
model_feat_unpooled, which is diphthong-blind). RunPod A40 (runpod/.pod_env). Validate diphthong capture like retroflexes: does a glide emitoʊ, a separationo ʊ? - Scoring —
oʊis now an anchor →feature_emitterscores it (no skip). The categorical separation read is justalignWPER. No new categorical code.
5. Trajectory layer build (3b) — BUILT + VALIDATED 2026-06-13¶
Shipped (research branch research/phon-130-acoustic-analysis):
- References — build_excursion_refs.py → diphthong_excursion_refs.json (bundled
at packages/audio/src/phonolex_audio/data/). Hillenbrand carries only eɪ/oʊ (the
small glides §8 hands to the categorical layer), so all five directed-excursion refs are
built from glided GOLD union clips (clean+L2 short clips, central-60% voiced, group
women) through the SAME acoustic.py apparatus the scorer uses — apples-to-apples.
Directions are phonetically correct (eɪ/ɔɪ front via +F2; oʊ/aʊ back via −F2;
aɪ raises+fronts via −F1/+F2).
- Scorer — packages/audio/src/phonolex_audio/trajectory.py score_dynamics(). For
each diphthong target the categorical alignment located, it spans the vowel between
neighboring located-segment centers (CTC is peaky — it commits ~1 frame to the
diphthong slot, far too narrow; the inter-neighbor region adapts to the local phone
rate), measures the produced F1/F2 track over that window (same smoothing + central-60%
trim as the refs), and computes glide_realized = produced_disp / ref_disp and
direction_cos. acoustic.py runs at most once per clip and ONLY when a diphthong slot
has committed frames (no Parselmouth cost on non-diphthong words).
- Wiring — FeatureEmitter.review(..., group=) calls it and attaches a dynamic
block to each diphthong per_position entry. FeatureEmitter now takes a configurable
vectors_csv (+ $PHONOLEX_FEATURE_VECTORS, --feature-vectors) so the 63-anchor
diphthong keeper — whose state.pt and vectors.csv must travel together — loads in
serving without touching the packaged 58-anchor default.
- Reliability gate — reliable=true only for large-glide aɪ/aʊ/ɔɪ (§8). The
small-glide eɪ/oʊ carry reliable=false: the dynamic read is reported but the
categorical layer is authoritative there. Decision-support framing; a span with too few
voiced formant frames returns measured=false (no fabricated number).
Unit coverage: packages/audio/tests/test_trajectory.py (model-free — synthetic track +
per_position).
6. Validation — DONE 2026-06-13¶
validate_dynamics.py (63-anchor keeper, L2 test clips, group women), glided gold vs
monophthongized-to-onset gold, median over n≤18/cohort:
| diphthong | glided (realized / dir_cos) | reduced | reads |
|---|---|---|---|
| aɪ (reliable) | 1.00 / +0.98 | 0.18 / +0.98 (n=2) | magnitude collapses, direction holds — clean separation ✓ |
| aʊ (reliable) | 0.60 / +0.90 | (no reduced sample) | glide present, correct direction |
| eɪ (cat-owned) | 0.29 / +0.91 | 0.27 / +0.95 | no separation → categorical owns it ✓ |
| oʊ (cat-owned) | 0.23 / +0.79 | 0.46 / +0.10 | no separation → categorical owns it ✓ |
The directed-excursion metric separates glided from reduced for the large-glide
diphthong (aɪ 1.00 vs 0.18, direction stable — the magnitude deficit IS the reduction
signal §8 predicted), and the reliable=false flag fires exactly where the metric
doesn't separate (eɪ/oʊ), handing those to the categorical layer. (aʊ/ɔɪ reduced are
sparse in the L2 test split; the offline probe Δdisp evidence + aɪ's decisive separation
carry the large-glide claim.) Categorical metrics: no regression — same keeper, the
dynamic layer is additive and runs only on diphthong slots.
7. What this is NOT¶
- Not L1 prediction (the clinician knows the L1; attribution engine is back-pocket).
- Not a standalone formant display (PHON-130's mistake); formants only as trajectory-vs- reference fused into scoring.
- Not diagnosis — decision support, human-in-the-loop.
8. Metric — DECIDED off evidence: directed excursion, contrast-dependent¶
Not magnitude-vs-monophthong (trajectory_ref.py: diphthongs do NOT move more than
monophthongs — ʊ/ʌ/æ/ɔ 357–469 Hz exceed eɪ/oʊ 205–219; American
monophthongs are heavily dynamic). The decisive evidence is trajectory_probe.py
(acoustic.py smoothed F1/F2 over the central voiced portion, glided vs gold-reduced
clips):
| diph | Δdisp glided−reduced | direction (netF2) | metric verdict |
|---|---|---|---|
| aɪ | +194 Hz | flip (+77→−28, fronting lost) | excursion + direction — clear |
| aʊ | +321 Hz (n=5) | −144 | excursion clear |
| eɪ | +141 Hz | noisy (n=12) | moderate |
| oʊ | +85 Hz | no help (−70→−67) | too small — categorical owns it |
So the metric is directed excursion (start→end displacement + net F1/F2 direction
along the reference glide), and which contrasts it carries is contrast-specific:
large-glide diphthongs (aɪ/aʊ) reduction = measurable excursion+direction deficit;
small-glide (oʊ/eɪ) reduction is below the formant-noise floor → the categorical
layer (model emits oʊ vs o, ~0.5 capture) owns those, plus all separations (via
alignment). Two complementary detectors, each owning what it's empirically good at.
Caveats: raw per-frame path-length is jitter-dominated — use smoothed displacement;
region-location was central-voiced-portion (crude) — the real scorer should locate the
diphthong via the model's frame alignment. References saved formant_trajectory_refs.json.
8b. Open questions¶
RESOLVED in the build:
- Trajectory space → formant (Hillenbrand-grounded apparatus + acoustic.py),
confirmed: the produced/reference excursion comparison in F1/F2 separates the reliable
contrasts (§6).
- Region location → inter-neighbor span, not the peaky single committed frame. CTC
commits ~1 frame to a diphthong slot (diag_span.py); the vowel is spanned between
neighboring located-segment centers.
- oʊ anchor reachable? Yes for categorical capture (keeper: aɪ/aʊ/ɔɪ strong, oʊ/eɪ
~0.55); for the DYNAMIC read oʊ/eɪ are reliable=false regardless (small glide, §6).
REMAINING:
- Consonant reference trajectories (VOT, affricates) — Hillenbrand is vowels only; needs
another source or model-aggregated references. Diphthongs/vowels first.
- Group coverage: the excursion refs are women-keyed. Child clips currently score against
women refs (direction is group-robust; magnitude scales) — a children ref set is the
clean follow-up when child dynamic scoring matters.
- Reduced-aʊ/ɔɪ are sparse in the L2 test split — confirm separation on a larger or
synthesized reduced set before leaning on those two contrasts.