PHON-126 — Feature-Vector Graded Error Spike¶

Status: design approved, plan pending Branch: feat/phon-126-feature-vector-graded-error (off release/v6-audio) Parent: PHON-44 Audio workstream Ticket: PHON-126 Date: 2026-05-28

1. Why¶

The PHON-53 (v6) audio tool needs graded phoneme-substitution scoring, not binary Levenshtein. An accented-but-acceptable realization should score as a near-miss (variant); a genuine clinical error should score as a real miss. The Berkeley phoneme-similarity recipe (arXiv 2507.14346) frames this as a Weighted PER where substitutions cost 1 − similarity(p₁, p₂). That recipe had to build the similarity matrix; PhonoLex already has a learned 26-d articulatory feature space (r=0.987 against theory-assigned features) powering the live similarity tool.

The risk to probe: those vectors were learned for symbolic similarity, not calibrated to acoustic distance. They may or may not transfer to acoustic error scoring. This spike measures before we bet the PHON-53 error layer on it.

2. Goal¶

Determine — with a cheap, controlled experiment — whether PhonoLex's learned feature vectors give a usable graded distance for phoneme substitutions such that: - Variant-class substitutions (accent / L2 / allophony) score low. - Error-class substitutions (SSD phonological processes) score high.

Output: a findings.md verdict (pass / pass-with-calibration / fail) that gates the PHON-53 error-layer design.

3. Method¶

3.1 Ground truth: synthetic known-error¶

Hand-curated inventory of variant-class and error-class substitution pairs, where the class label is fixed by definition (we made the call). No external label noise. No model-misperception noise.

3.2 Inventory: textbook + PERCEPT sanity check¶

Variant set (~20–30 pairs): canonical accent / L2 / allophony substitutions from Wells's lexical-set treatment and standard allophonic processes. Examples: - /θ/ → /f/ (TH-fronting, common in many dialects) - /t/ → /ɾ/ (American tap, intervocalic) - /ɹ/ → /ɾ/ (Spanish-influenced) - /v/ → /b/ (Spanish-influenced) - vowel reductions (e.g. /æ/ → /ɛ/ in unstressed contexts)

Error set (~20–30 pairs): canonical SSD phonological processes from Hodson and Bernthal: - Stopping: /s/ → /t/, /ʃ/ → /t/, /z/ → /d/ - Fronting (velar): /k/ → /t/, /g/ → /d/ - Gliding: /ɹ/ → /w/, /l/ → /w/ - Cluster reduction (modeled as deletion of one cluster member, scored separately from substitution costs) - Final consonant deletion (modeled as deletion)

Each entry is (ipa_a, ipa_b, label, severity_rank, source_ref) with severity_rank ∈ {variant, mild_error, moderate_error, severe_error} for the Spearman correlation.

PERCEPT sanity check (percept_check.py): mines actual_phonology vs model_phonology substitution frequencies from /Volumes/ExternalData1/phonbank/dataset_production.jsonl via alignment, and reports per-inventory-pair occurrence counts. Purpose: verify the listed errors actually occur in pediatric data — not to change the inventory. Skippable if the external drive isn't mounted (sanity, not core).

3.3 Similarity¶

cos_sim(p₁, p₂) = vectors[p₁] · vectors[p₂] / (‖vectors[p₁]‖ · ‖vectors[p₂]‖) cos_dist(p₁, p₂) = clip(1 − cos_sim, 0, 1)

Vectors loaded from packages/features/outputs/vectors.csv — the same artifact the live similarity tool derives from. The clip handles the (rare) case where cos_sim is negative on far phoneme pairs; it doesn't lose information for the variant/error question since both sides remain in [0, 1].

3.4 WPER¶

Standard Levenshtein DP with: - Substitution cost = cos_dist(p_pred, p_canonical) - Deletion cost = 1 - Insertion cost = 1

Normalized: WPER = total_cost / N_canonical. Returns binary PER alongside WPER for comparison.

3.5 Verdict criterion¶

Report all three; judge holistically in findings.md.

Mann-Whitney U — one-sided H1: variant_costs < error_costs. Tells us whether the two distributions are statistically separable.
Practical threshold — variant 75th percentile vs error 25th percentile. Tells us whether a hard threshold can be drawn (the deciding factor for using the metric as-is vs needing calibration).
Spearman ρ — between hand-assigned severity_rank (variant < mild_error < moderate_error < severe_error) and cos_dist. Tells us whether the metric tracks clinical severity, not just binary class.

Verdict in findings.md: - Pass — clean threshold separation + Mann-Whitney p<.01 + ρ ≥ 0.7. Use as-is in PHON-53. - Pass-with-calibration — directional but overlapping; Mann-Whitney p<.01 but no clean threshold. Need a calibration step (per-phoneme weighting, scaling, or threshold tuning) before shipping. File follow-up. - Fail — distributions overlap, or variant median ≥ error median. Vectors don't transfer to acoustic error scoring. Need an acoustic-grounded similarity matrix (Berkeley-style, learned from acoustics).

4. Architecture¶

Self-contained research artifact at research/2026-05-28-phon-126-feature-vector-graded-error/. Pure Python via uv run, PEP-723 inline deps where possible. No worker / API / D1 changes.

4.1 Components¶

File	Role
`inventory.py`	Curated `variant_pairs[]` and `error_pairs[]` with severity ranks
`similarity.py`	Loads `vectors.csv`; `cos_sim`, `cos_dist`; self-test
`wper.py`	Levenshtein DP with cos-dist substitution cost; self-test
`percept_check.py`	Mines PERCEPT substitution frequencies; outputs `inventory_coverage.parquet`
`run_pair_level.py`	Per-pair cos_dist; outputs `pair_costs.parquet` + three diagnostic metrics
`run_word_level.py`	~50 CMU strings × {variant, error} corruption; outputs `word_costs.parquet` + side-by-side distribution plot
`findings.md`	Writeup: data, method, results, verdict, PHON-53 implications

4.2 Data flow¶

inventory.py (curated) ──┬─→ run_pair_level.py ──→ pair_costs.parquet  ──┐
                         │                                                │
similarity.py ───────────┤                                                ├─→ findings.md
                         │                                                │
inventory.py + CMU ──────┴─→ run_word_level.py ──→ word_costs.parquet ──┤
                                                                          │
percept_check.py ────────────→ inventory_coverage.parquet ────────────────┘

4.3 Self-tests inside scripts (no formal pytest)¶

similarity.py: identical phoneme → cos_dist = 0; far pair (/a/ vs /k/) → near 1.
wper.py: WPER on identical strings = 0; WPER on disjoint strings ≈ binary PER.

5. Done definition¶

All scripts run end-to-end via uv run from the research dir.
findings.md posted with three metrics, two plots, and verdict.
PHON-126 transitioned to Done with verdict transcribed in a Jira comment.
Spike findings link added to PHON-53 to inform the error-layer architecture.

6. Out of scope¶

No worker / live-similarity-tool changes — pure research artifact.
No real PHON-55 inference outputs — that's a deferred SLP-adjudicated phase, fileable as PHON-126b if needed.
No calibration scheme if it fails — separate follow-up ticket informed by what failed (per-phoneme weighting? acoustic-grounded matrix? per-position weighting?).
No insertion/deletion edge-case exploration beyond standard Levenshtein (substitutions dominate in pediatric data).

7. Risks / known unknowns¶

PERCEPT drive mount. /Volumes/ExternalData1/phonbank/dataset_production.jsonl requires the external drive. If unmounted, percept_check.py is skippable; pair-level + word-level analysis is unaffected.
Cosine geometry edge case. Negative cos_sim on far pairs → clip handles it; doesn't lose information for the variant/error question.
Inventory bias. Textbook lists are normative — they may miss substitutions common in PERCEPT data. The sanity check exposes this without changing the inventory.
Severity rank subjectivity. The severity_rank field is a hand-assigned ordering. The Spearman ρ is a soft check, not the primary verdict — it's flagged in findings.md as the most subjective metric of the three.

8. References¶

arXiv 2507.14346 — Berkeley phoneme-similarity recipe (variant-tolerant graded scoring via similarity matrix).
arXiv 2601.16230 — Audio-LLM weakness on fine phonetics (motivates symbolic-side investment).
Wells, J. C. — Accents of English (variant-class reference).
Hodson, B. & Bernthal, J. — SSD phonological process taxonomy (error-class reference).
packages/features/ — Bayesian learned articulatory feature space, r=0.987.

Spec written 2026-05-28. Implementation plan to follow via superpowers:writing-plans.