v6 Audio — All Phones Are Trajectories (the plan)¶
Date: 2026-06-13
Status: ✅ RESEARCH COMPLETE — handed off 2026-06-14. Every research phase is done
and validated (see the §6 STATUS block at the bottom). Handed off to productionization
(PHON-150) + gated app reseed (PHON-151). Fresh-agent orientation:
research/2026-06-06-audio-union/README.md.
Supersedes the spec
docs/superpowers/specs/2026-06-13-trajectory-scoring-and-diphthongs.md (the two-tier
"categorical anchor + formant-trajectory" design, committed 2b6463f7), which preserved
the single-vector collapse and a categorical anchor tier that were never asked for.
Sources (transcript, conversation cb336cec): the agreement at idx 2336–2386 ("all
phones are trajectories", "vectors be vectoring") and the metric discussion at idx
2583/2586 (magnitude / shape / temporality).
0. The one-line statement¶
Every phone is a trajectory through the 26-d articulatory feature space. We rederive the 63 phones as trajectories (replacing the single composite vectors), cast the model's training corpus in those terms and retrain, and score trajectory-to-trajectory. There is no second "categorical" tier, no point-anchor, no using a trained model as the source of the phone definitions. Nothing is collapsed to a centroid.
1. Principle (settled)¶
- All phones are trajectories. A phone is a path through feature space over its duration. The single 26-d composite vector currently emitted per phone is a collapse to a centroid — a scoring/representation simplification that flattens the path and loses the dynamics. We remove the collapse; the phone stays a trajectory.
- Diphthongs are not special. A diphthong is a phone whose trajectory is non-flat; a monophthong's is flatter. Same object, same machinery. No diphthong special-casing, no separate dynamic layer, no "reliable-only-for-large-glides" gate.
- The model is a consumer, not the source. The phone trajectories are derived from our data (below). A trained audio model is retrained to them; it is never the place the reference trajectories come from.
2. Rederive the 63 phones as trajectories (foundation, local)¶
Today (packages/features): the Bayesian fit produces phi — one 26-d vector per
segment — and composites = α·phi[onset] + β·phi[offset] collapses each phone to a
single vector (outputs/vectors.csv, 63 rows × 26). Hillenbrand's 8-timepoint formant
data enters only as pairwise trajectory distances that constrain α/β
(model.py §"Phase 3", idx-215 region). The trajectory information is consumed and then
discarded.
New: each phone is rederived as an 8-point trajectory in the 26-d feature space —
output shape (63, 8, 26) — the 8 timepoints grounded in Hillenbrand's 8-timepoint
formant trajectories (10–80% of vowel duration) carried through the feature-modeling
apparatus, instead of collapsed to a centroid. 8 follows Hillenbrand's sampling; it can
be a general K.
The geometry MUST be un-pooled (idx 1049–1117). It is fit on perceptual (ECCC) + acoustic (Hillenbrand) evidence only, freed prior — production/corpus confusion is excluded from the geometry. Perceptual confusion and productive substitution are two different generators; pooling them into one distance likelihood produces a compromise faithful to neither, and a scorer that has already absorbed the substitution pattern cannot flag the substitution. The rhotic is the proof (perceptual r/w moderately close; productive r→w dominant by markedness). So the trajectory rederivation is the un-pooled refit extended to per-timepoint paths — it is the foundation, not a superseded step.
DATA SOURCE — RESOLVED (2026-06-13): we have formants for ALL phones, measured.
The all-phones trajectory data is NOT Hillenbrand (a 12-vowel reference table) — it is
acoustic.py (Parselmouth) measuring the time-aligned corpus. Every phone is produced
in real audio with phone-level time alignment (TIMIT .PHN ships gold boundaries; TORGO,
L2-ARCTIC annotations too), so we measure each phone's F1–F3 trajectory over its aligned
span, aggregate across tokens, normalize to 8 points → a measured formant trajectory for
all 63 phones, consonants included (stop closure→burst, fricative frication).
Hillenbrand's 12 vowels are the high-fidelity vowel cross-check, not the source.
- Honest caveat: F1–F3 are clean for vowels/sonorants; obstruents have weak/sparse
formants (unvoiced frames), so their trajectories are noisier and may lean on the broader
spectral track — but they are measurable and all phones are covered. NOT flat.
- This supersedes the earlier "no consonant data → flat" framing (the regression that came
from anchoring on Hillenbrand-the-table instead of the measurement apparatus).
Then the feature-space rederivation: map the measured per-phone formant trajectories into the 26-d articulatory space (the feature apparatus already relates formants↔features), on the un-pooled geometry, giving the 8-point feature-space trajectory phones. Whether that mapping is a Bayesian refit with per-timepoint evidence or a direct construction is the remaining build detail.
Deliverable of this step: all 63 phone trajectories, shown for review, before anything downstream is built or trained against them.
2.1 One vector set: geometry / prior / readout (idx 1400–1414)¶
The un-pooling raises "do we need two or three sets of phones — perceptual, productive, continuous-productive?" The answer (idx 1406) is one vector set; the three generators are three different kinds of object, only one of which is phones:
Perception ↔ production is the universal pair; both halves are confusability data (ECCC perceptual confusion; L2-ARCTIC + PhonBank production confusion). Un-pooling keeps them in their roles:
- Perception = perceptual confusability (ECCC) = the geometry — the 8-point trajectory
phones. Universal — how confusable two sounds are perceptually; one half of the pair.
This is the metric. Already built: the un-pooled refit
refit_unpooled_full/ vectors_unpooled.csv(refit_experiment.py,4433fff3) — fit on ECCC (adult perceptual) - Hillenbrand/formants, production dropped, freed prior; it also drives the associations edges. One set. Perception is NOT L1/L2 — L1 indexing lives on the production side.
- Production = substitution / markedness (corpus: PhonBank child + L2-ARCTIC L1) = a
directional prior,
P(produced | target, …)— a confusion table (ɹ→w ≠ w→ɹ), context-dependent, riding on top of the geometry (no phone positions of its own). This is the other half and where L1 / population indexing lives — the per-L1 fingerprints (per_l1_confusion.py, the seg × L1 table inRESULTS_L1_GEOMETRY.md) +l1Prior. "He merges l/ɹ like a Korean speaker" is read here, off production — not off perception. - Motor gradient (continuous) = a readout — where a production lands in the one geometry: in a neighboring cluster = categorical substitution; between clusters = gradient distortion. A distance-to-nearest readout, not a separate set.
So: one feature geometry (un-pooled perceptual-confusability, trajectory) + production priors + a gradient readout. A deviation is read across these — confusability-manifold move vs markedness pattern vs between-cluster smear. The etiological labels (L1/L2, developmental, motor) are downstream readings, not the axes, and surfacing them as output stays back-pocket. The un-pooling itself is required regardless, because it is what lets the metric separate "differs from target" from "speakers tend to swap these."
3. Cast the model corpus in trajectory terms + retrain¶
Once the phones are trajectories: - The audio model's job is unchanged in spirit — given audio it emits a per-frame feature trajectory ("vectors be vectoring") — but its targets become the trajectory phones, not single-vector anchors. The training corpus (phone-sequence labels currently resolved to single anchors) is recast in trajectory terms and the model retrained. - This is the "redo everything" consequence: any future model corpus must be cast in the new phone-trajectory terms before training. - Retrain is pod work, gated on the rederived phones being right and on your sign-off.
4. Scoring = trajectory-to-trajectory (metric deferred, decided on data)¶
- Score a produced phone's emitted trajectory against its phone's reference trajectory. Uniform across all 63 phones.
- The metric is a decomposition in articulation space (idx 2583/2586):
- magnitude — excursion along a direction (under-shoot / reduction: the right path, not far enough);
- shape — path conformance / curvature;
- temporality — timing profile (e.g. diphthong separation = two plateaus vs one smooth glide).
- Which component carries a given contrast is contrast-dependent, and the exact decomposition/weighting is deferred — decided on empirical observation (lay real produced trajectories against the references and see which contrasts live in magnitude, which in shape, which in temporality). Not locked now.
5. Blast radius (consequences to track, not silently absorb)¶
- Lexicon similarity (
packages/web, soft-Levenshtein over the phoneme vectors / dot products) is built on the single-vector phone representation. Rederiving phones as trajectories changes that math too — similarity becomes trajectory-aware. Rebuild + reseed are gated on "happy" (nothing real breaks until then). - Keepers (foundational, NOT superseded):
- the un-pooled feature geometry (the
packages/featuresrefit on perceptual + acoustic evidence, freed prior, production confusion excluded —4433fff3/refit_unpooled). This is the base the trajectory rederivation extends (§2). I earlier mislabeled this "superseded" because the audio model's name reused the word "unpooled" — wrong; the geometry is the foundation. - the production priors (
per_l1_confusion.pyfingerprints,l1Prior) — the §2.1 directional channel, ride on top. - the diphthong-as-atomic-token tokenizer fix (
5d4481ee) + the relabeled union — a phone must be one token to have one trajectory; the relabeled corpus is reusable. - Superseded: the two-tier scorer (
08677638), the diphthong-excursion refs, the reliability gate, and — specifically its single-point-anchor targets — the RunPod retrain (e04b533a→model_feat_diph_unpooled). That model was trained on the un-pooled geometry (good) but toward collapsed point anchors (the part that's wrong, per §3/idx 2364). Its weights/init and the corpus are reusable; the point-target geometry is what gets thrown out. Decide whether to revert these commits or leave them as dead research artifacts.
6. Settled vs open (for close reading)¶
Settled: - all phones are trajectories; one representation, no second categorical tier, no point-anchor scoring; - the geometry is un-pooled — perceptual + acoustic evidence only, production confusion excluded (it confabulates the scorer otherwise); - one vector set: geometry (the trajectory phones) + production priors (ride on top) + gradient readout — not two or three sets of phones (§2.1); - rederive the 63 phones as 8-point feature-space trajectories on the un-pooled geometry, replacing the single vectors (no collapse); - recast the model corpus in trajectory terms and retrain; the model is a consumer, not the source of the phone definitions; - scoring is trajectory-to-trajectory; deviations attribute across geometry/prior/readout, but attribution-as-output is back-pocket.
The produced side is not a separate question and is not the current
FeatureEmitter: the produced trajectory is simply what the §3-retrained model emits
(per-frame feature vectors = a trajectory), compared to the phone references. The current
wav2vec2 emitter is part of what §3 replaces.
Open (to decide, in order): 1. construction of the 8-point trajectories — esp. consonants (§2); 2. the metric decomposition/weighting — deferred to empirical data (§4).
STATUS (2026-06-14) — research finish line reached; next = productionization¶
Full pipeline built + validated end-to-end:
- §2 representation — DONE: every phone a MEASURED feature trajectory (e[T,26]), not
formants. Full-63 gap-free (52 measured + 11 derived, provenance). full63_trajectories.json.
Viability PROVEN (obstruents 17/17 discriminable, semivowels situate). prove_*.
- §3 retrain — DONE (RunPod A40 ~$1.3, keeper model_feat_traj_target): trajectory-target
span loss; converged win.
- §4 scorer — DONE (score_trajectories.py) + DISCRIMINATIVE Fisher metric (build_refs_fisher.py):
cumulative error catch-rate 19→37%, match-id 60→63%.
- Source attribution — DONE (the disentanglement completion):
- accent (L1) vs developmental: 100% (attribution_source.py).
- 4-WAY source classifier (attribution_classifier.py, leave-one-subject-out): 90% —
child 96% / accent 83% / typical 87% / motor (dysarthria) 100% (consonant-imprecision
+ rate features). ALL DISORDER AXES CLEAN (developmental→accent=0, motor→anything=0);
only confusion is accent↔typical (clinically benign).
- NEXT = productionization (engineering): wire trajectory scorer + Fisher metric + source
attribution from these research/ scripts into the host/worker/frontend (dev-page →
product-tab). Research edges left: motor n=6 (more TORGO would harden), match-id single-
production-vs-mean ceiling (distributional rescore = future lever).
7. Sequence¶
- Rederive the phones as 8-point trajectories (local,
packages/features). Resolve the §2 construction question, then produce + show all 63. - Decide the metric empirically — produced trajectories vs references, which contrasts live in magnitude / shape / temporality.
- Recast the corpus + retrain the model to trajectory targets.
- Trajectory scoring in the audio host.
- (Gated on "happy") rebuild lexicon similarity + reseed.