PHON-126 Feature-Vector Graded Error Spike — Implementation Plan¶

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Validate (or invalidate) PhonoLex's learned 26-d articulatory feature vectors as a graded phoneme-substitution distance for the PHON-53 audio-tool error layer, by comparing cosine-distance "WPER cost" on variant-class vs error-class synthetic substitutions.

Architecture: Self-contained research artifact under research/2026-05-28-phon-126-feature-vector-graded-error/. Pure Python scripts via uv run with PEP-723 inline deps. No worker / API / D1 / production-code changes. Reads packages/features/outputs/vectors.csv and (optionally) /Volumes/ExternalData1/phonbank/dataset_production.jsonl. Outputs three parquets + a plot + findings.md verdict.

Tech Stack: Python 3.11+, numpy, polars (parquet I/O — project convention), scipy.stats (Mann-Whitney U, Spearman ρ), matplotlib (distribution plots). All deps declared via PEP-723 inline blocks at the top of each script.

Spec: docs/superpowers/specs/2026-05-28-phon-126-feature-vector-graded-error-design.md

File Structure¶

File	Responsibility
`research/2026-05-28-phon-126-feature-vector-graded-error/README.md`	Trailhead: spec link, how to run, output map
`research/2026-05-28-phon-126-feature-vector-graded-error/LAB.md`	Lab notebook — observations + decisions during the run
`research/2026-05-28-phon-126-feature-vector-graded-error/.gitignore`	Ignore parquet outputs + plot PNG (build artifacts)
`research/2026-05-28-phon-126-feature-vector-graded-error/similarity.py`	Load vectors.csv; `cos_sim`, `cos_dist`; self-test in `__main__`
`research/2026-05-28-phon-126-feature-vector-graded-error/wper.py`	Levenshtein DP with cos-dist substitution cost; self-test in `__main__`
`research/2026-05-28-phon-126-feature-vector-graded-error/inventory.py`	Variant pairs + error pairs lists with severity ranks; self-test in `__main__`
`research/2026-05-28-phon-126-feature-vector-graded-error/run_pair_level.py`	Per-pair cos_dist; writes `pair_costs.parquet`; prints 3 metrics
`research/2026-05-28-phon-126-feature-vector-graded-error/run_word_level.py`	Sample CMU strings → corrupt with variant/error → WPER + PER; writes `word_costs.parquet` + `word_costs.png`
`research/2026-05-28-phon-126-feature-vector-graded-error/percept_check.py`	PERCEPT sanity check — align actual vs canonical, output `inventory_coverage.parquet` (skippable if drive not mounted)
`research/2026-05-28-phon-126-feature-vector-graded-error/findings.md`	Final writeup with three metrics, plot, verdict, PHON-53 implications

Each script is self-contained (PEP-723 deps), self-tests in __main__, and runnable directly with uv run <script>.py. No shared package, no pyproject.toml for the research dir.

Task 1: Bootstrap research directory¶

Files: - Create: research/2026-05-28-phon-126-feature-vector-graded-error/README.md - Create: research/2026-05-28-phon-126-feature-vector-graded-error/LAB.md - Create: research/2026-05-28-phon-126-feature-vector-graded-error/.gitignore

[ ] Step 1: Create the directory

Run: mkdir -p research/2026-05-28-phon-126-feature-vector-graded-error

[ ] Step 2: Write README.md

# PHON-126 — Feature-Vector Graded Error Spike

**Spec:** [`../../docs/superpowers/specs/2026-05-28-phon-126-feature-vector-graded-error-design.md`](../../docs/superpowers/specs/2026-05-28-phon-126-feature-vector-graded-error-design.md)
**Ticket:** [PHON-126](https://neumannsworkshop.atlassian.net/browse/PHON-126)
**Parent:** [PHON-44 Audio](https://neumannsworkshop.atlassian.net/browse/PHON-44)

## What this is

Probe: do PhonoLex's learned 26-d articulatory feature vectors give a usable graded distance such that variant-class phoneme substitutions score low and error-class substitutions score high?

## How to run

All scripts are self-contained with PEP-723 inline deps. From repo root:

```bash
uv run research/2026-05-28-phon-126-feature-vector-graded-error/similarity.py    # self-test
uv run research/2026-05-28-phon-126-feature-vector-graded-error/wper.py          # self-test
uv run research/2026-05-28-phon-126-feature-vector-graded-error/inventory.py     # self-test
uv run research/2026-05-28-phon-126-feature-vector-graded-error/run_pair_level.py
uv run research/2026-05-28-phon-126-feature-vector-graded-error/run_word_level.py
uv run research/2026-05-28-phon-126-feature-vector-graded-error/percept_check.py # optional

Outputs¶

pair_costs.parquet — per-pair cos_dist + severity rank
word_costs.parquet — per-word WPER + binary PER, per class
word_costs.png — side-by-side distribution plot
inventory_coverage.parquet — PERCEPT-grounded frequency of each inventory pair (optional)
findings.md — verdict + implications for PHON-53

Verdict¶

See findings.md.

- [ ] **Step 3: Write LAB.md skeleton**

```markdown
# LAB Notebook — PHON-126

## 2026-05-28

(date) — Bootstrap.

## Observations

(Fill in during the run.)

## Decisions

(Fill in if anything diverges from the spec.)

[ ] Step 4: Write .gitignore

# Build artifacts
*.parquet
*.png

[ ] Step 5: Commit

git add research/2026-05-28-phon-126-feature-vector-graded-error/
git commit -m "research(phon-126): bootstrap spike directory"

Task 2: similarity.py¶

Files: - Create: research/2026-05-28-phon-126-feature-vector-graded-error/similarity.py

Source data: packages/features/outputs/vectors.csv — 41 rows, 27 columns (1 IPA symbol col + 26 feature dims). Phonemes covered: p b t d k ɡ tʃ dʒ f v θ ð s z ʃ ʒ h m n ŋ l ɹ w j i ɪ e ɛ æ a ɑ ɒ ɔ o ʊ u ʌ ə ɝ ɚ. Note the IPA ɡ (U+0261), not ASCII g.

[ ] Step 1: Write the failing self-test (no implementation yet)

Create the file with the test block at the bottom but with cos_sim and cos_dist undefined (so running fails):

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "numpy>=1.24",
# ]
# ///
"""
PhonoLex feature-vector cosine similarity / distance.

Loads packages/features/outputs/vectors.csv and exposes cos_sim and cos_dist
over the 26-d learned articulatory feature space. Self-test verifies identical
phonemes get cos_dist = 0 and far pairs (vowel /a/ vs stop /k/) get cos_dist
near 1.
"""
from __future__ import annotations

import csv
from pathlib import Path

import numpy as np

REPO_ROOT = Path(__file__).resolve().parents[2]
VECTORS_PATH = REPO_ROOT / "packages" / "features" / "outputs" / "vectors.csv"


if __name__ == "__main__":
    # Self-test: identical → 0; /a/ vs /k/ → near 1
    assert cos_dist("p", "p") < 1e-9, "identical phoneme should have cos_dist 0"
    assert cos_dist("a", "k") > 0.4, "far pair /a/ vs /k/ should be near 1"
    print(f"OK — cos_dist(p, p) = {cos_dist('p', 'p'):.6f}")
    print(f"OK — cos_dist(a, k) = {cos_dist('a', 'k'):.6f}")
    print(f"OK — cos_dist(s, t) = {cos_dist('s', 't'):.6f}  # stopping error")
    print(f"OK — cos_dist(θ, f) = {cos_dist('θ', 'f'):.6f}  # TH-fronting variant")

[ ] Step 2: Run the script to verify it fails

Run: uv run research/2026-05-28-phon-126-feature-vector-graded-error/similarity.py Expected: NameError: name 'cos_dist' is not defined

[ ] Step 3: Implement load + cosine functions

Add this above the if __name__ == "__main__": block:

def _load_vectors() -> dict[str, np.ndarray]:
    """Return {ipa_symbol: 26-d feature vector}."""
    vectors: dict[str, np.ndarray] = {}
    with VECTORS_PATH.open() as f:
        reader = csv.DictReader(f)
        feature_cols = [c for c in reader.fieldnames if c != "ipa"]
        for row in reader:
            vectors[row["ipa"]] = np.array(
                [float(row[c]) for c in feature_cols], dtype=np.float64
            )
    return vectors


VECTORS = _load_vectors()


def cos_sim(p1: str, p2: str) -> float:
    """Cosine similarity in the 26-d feature space."""
    v1, v2 = VECTORS[p1], VECTORS[p2]
    return float(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))


def cos_dist(p1: str, p2: str) -> float:
    """Distance = clip(1 - cos_sim, 0, 1)."""
    return float(np.clip(1.0 - cos_sim(p1, p2), 0.0, 1.0))

[ ] Step 4: Run the script to verify it passes

Run: uv run research/2026-05-28-phon-126-feature-vector-graded-error/similarity.py Expected output (numbers may vary slightly):

OK — cos_dist(p, p) = 0.000000
OK — cos_dist(a, k) = 0.4xx
OK — cos_dist(s, t) = 0.0xx  # stopping error
OK — cos_dist(θ, f) = 0.0xx  # TH-fronting variant

[ ] Step 5: Commit

git add research/2026-05-28-phon-126-feature-vector-graded-error/similarity.py
git commit -m "research(phon-126): similarity.py — cos_sim / cos_dist over learned vectors"

Task 3: wper.py¶

Files: - Create: research/2026-05-28-phon-126-feature-vector-graded-error/wper.py

Standard Levenshtein DP. Substitution cost = cos_dist(p_pred, p_canonical). Deletion / insertion cost = 1. WPER = total_cost / N_canonical.

[ ] Step 1: Write the failing self-test

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "numpy>=1.24",
# ]
# ///
"""
Weighted Phoneme Error Rate.

Levenshtein DP with substitution cost from similarity.cos_dist.
WPER = total_cost / N_canonical alongside standard binary PER.
"""
from __future__ import annotations

import sys
from pathlib import Path

# Allow importing similarity.py as a sibling.
sys.path.insert(0, str(Path(__file__).resolve().parent))

from similarity import cos_dist  # noqa: E402


if __name__ == "__main__":
    # Self-test
    w, p, _ = score(["k", "æ", "t"], ["k", "æ", "t"])
    assert w == 0.0 and p == 0.0, "identical strings should have WPER=PER=0"

    w, p, _ = score(["k", "æ", "t"], ["d", "ɔ", "ɡ"])
    assert p == 1.0, f"fully different equal-length should have binary PER=1, got {p}"
    assert 0.0 < w <= 1.0, f"WPER should be in (0, 1] for full substitution, got {w}"
    assert w < p, "WPER should be less than binary PER when substitutions are nontrivial"
    print(f"OK — identical: WPER=0, PER=0")
    print(f"OK — disjoint: WPER={w:.3f}, PER={p:.3f}")

    # Stopping error
    w, p, _ = score(["t", "ʌ", "p"], ["s", "ʌ", "p"])
    print(f"  tʌp vs sʌp (stopping): WPER={w:.3f}, PER={p:.3f}")

    # TH-fronting variant
    w, p, _ = score(["f", "ɪ", "ŋ"], ["θ", "ɪ", "ŋ"])
    print(f"  fɪŋ vs θɪŋ (TH-fronting variant): WPER={w:.3f}, PER={p:.3f}")

[ ] Step 2: Run the script to verify it fails

Run: uv run research/2026-05-28-phon-126-feature-vector-graded-error/wper.py Expected: NameError: name 'score' is not defined

[ ] Step 3: Implement the DP

Add above the if __name__ == "__main__": block:

def score(
    pred: list[str], canonical: list[str]
) -> tuple[float, float, list[tuple[str, str, str]]]:
    """
    Return (wper, binary_per, alignment) for a (predicted, canonical) phoneme pair.

    Alignment items are tuples: ("match"|"sub"|"del"|"ins", pred_phone, canonical_phone)
    where one of the phonemes is "" for del/ins.
    """
    m, n = len(pred), len(canonical)
    # dp[i][j] = (weighted_cost, binary_cost, parent_op)
    dp_w = [[0.0] * (n + 1) for _ in range(m + 1)]
    dp_b = [[0] * (n + 1) for _ in range(m + 1)]
    parent: list[list[str]] = [[""] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        dp_w[i][0] = i  # insertions
        dp_b[i][0] = i
        parent[i][0] = "ins"
    for j in range(1, n + 1):
        dp_w[0][j] = j  # deletions
        dp_b[0][j] = j
        parent[0][j] = "del"

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if pred[i - 1] == canonical[j - 1]:
                sub_w = dp_w[i - 1][j - 1]
                sub_b = dp_b[i - 1][j - 1]
                op = "match"
            else:
                sub_w = dp_w[i - 1][j - 1] + cos_dist(pred[i - 1], canonical[j - 1])
                sub_b = dp_b[i - 1][j - 1] + 1
                op = "sub"
            ins_w = dp_w[i - 1][j] + 1.0  # extra phone in pred
            ins_b = dp_b[i - 1][j] + 1
            del_w = dp_w[i][j - 1] + 1.0  # missing phone in pred
            del_b = dp_b[i][j - 1] + 1

            best = min((sub_w, op), (ins_w, "ins"), (del_w, "del"), key=lambda t: t[0])
            dp_w[i][j] = best[0]
            parent[i][j] = best[1]
            if best[1] == "match" or best[1] == "sub":
                dp_b[i][j] = sub_b
            elif best[1] == "ins":
                dp_b[i][j] = ins_b
            else:
                dp_b[i][j] = del_b

    # Backtrace for alignment
    alignment: list[tuple[str, str, str]] = []
    i, j = m, n
    while i > 0 or j > 0:
        op = parent[i][j]
        if op == "match" or op == "sub":
            alignment.append((op, pred[i - 1], canonical[j - 1]))
            i -= 1
            j -= 1
        elif op == "ins":
            alignment.append(("ins", pred[i - 1], ""))
            i -= 1
        else:  # del
            alignment.append(("del", "", canonical[j - 1]))
            j -= 1
    alignment.reverse()

    n_ref = max(n, 1)
    wper = dp_w[m][n] / n_ref
    per = dp_b[m][n] / n_ref
    return wper, per, alignment

[ ] Step 4: Run the script to verify it passes

Run: uv run research/2026-05-28-phon-126-feature-vector-graded-error/wper.py Expected output (numbers may vary):

OK — identical: WPER=0, PER=0
OK — disjoint: WPER=0.x, PER=1.000
  tʌp vs sʌp (stopping): WPER=0.0x, PER=0.333
  fɪŋ vs θɪŋ (TH-fronting variant): WPER=0.0x, PER=0.333

[ ] Step 5: Commit

git add research/2026-05-28-phon-126-feature-vector-graded-error/wper.py
git commit -m "research(phon-126): wper.py — Levenshtein DP with cos-dist sub cost"

Task 4: inventory.py¶

Files: - Create: research/2026-05-28-phon-126-feature-vector-graded-error/inventory.py

Curated variant and error substitution pairs. Severity rank in {1: variant, 2: mild_error, 3: moderate_error, 4: severe_error}. Inventory constrained to phonemes present in vectors.csv (tap ɾ not available — variant set uses vowel mergers + L2 substitutions instead).

[ ] Step 1: Write the failing self-test

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "numpy>=1.24",
# ]
# ///
"""
PHON-126 inventory: variant-class and error-class substitution pairs.

Each entry: (canonical_phoneme, substituted_phoneme, label, severity_rank, source)
- severity_rank: 1=variant, 2=mild_error, 3=moderate_error, 4=severe_error
- All phonemes must appear in packages/features/outputs/vectors.csv

Variant set: accent / L2 / dialectal substitutions that are NOT clinical errors.
Error set: SSD phonological processes from Hodson / Bernthal.
"""
from __future__ import annotations

import sys
from dataclasses import dataclass
from pathlib import Path

sys.path.insert(0, str(Path(__file__).resolve().parent))

from similarity import VECTORS  # noqa: E402


@dataclass(frozen=True)
class Pair:
    canonical: str
    substitute: str
    label: str
    severity_rank: int  # 1=variant, 2=mild_err, 3=moderate_err, 4=severe_err
    source: str


VARIANTS: list[Pair] = [
    # Vowel mergers / shifts (clear non-error)
    Pair("æ", "ɛ", "vowel-merger PIN-PEN", 1, "Wells"),
    Pair("ɛ", "æ", "vowel-merger DRESS-TRAP", 1, "Wells"),
    Pair("ɔ", "ɑ", "cot-caught merger", 1, "Wells"),
    Pair("ɑ", "ɔ", "cot-caught merger (reverse)", 1, "Wells"),
    Pair("ɪ", "i", "L2 no-laxing (Spanish/Italian)", 1, "Wells"),
    Pair("ʊ", "u", "L2 no-laxing (Spanish/Italian)", 1, "Wells"),
    Pair("ʌ", "ə", "unstressed alternation", 1, "Wells"),
    Pair("ɝ", "ɚ", "stressed/unstressed r-schwa", 1, "Wells"),
    # Consonant variants (L2 / dialectal, NOT clinical errors in adult speech)
    Pair("v", "b", "L2 v→b (Spanish)", 1, "Wells"),
    Pair("θ", "f", "TH-fronting (AAE/Cockney, dialectal)", 1, "Wells"),
    Pair("ð", "d", "TH-stopping (AAE, dialectal)", 1, "Wells"),
    Pair("ʒ", "dʒ", "ZH-affrication (foreign words)", 1, "Wells"),
]

ERRORS: list[Pair] = [
    # Fronting (velar → alveolar) — moderate, typical SSD
    Pair("k", "t", "velar fronting", 3, "Hodson"),
    Pair("ɡ", "d", "velar fronting (voiced)", 3, "Hodson"),
    # Stopping (fricative/affricate → stop) — moderate/severe
    Pair("s", "t", "stopping /s/", 3, "Hodson"),
    Pair("z", "d", "stopping /z/", 3, "Hodson"),
    Pair("ʃ", "t", "stopping /ʃ/", 3, "Hodson"),
    Pair("tʃ", "t", "affricate stopping", 3, "Hodson"),
    Pair("dʒ", "d", "affricate stopping (voiced)", 3, "Hodson"),
    # Gliding (liquid → glide) — moderate
    Pair("ɹ", "w", "gliding /ɹ/", 3, "Hodson"),
    Pair("l", "w", "gliding /l/", 3, "Hodson"),
    # Lisp (interdental for /s/) — mild
    Pair("s", "θ", "interdental lisp", 2, "Bernthal"),
    Pair("z", "ð", "interdental lisp (voiced)", 2, "Bernthal"),
    # Backing (less common, often noted in CAS) — severe
    Pair("t", "k", "backing", 4, "Bernthal"),
    Pair("d", "ɡ", "backing (voiced)", 4, "Bernthal"),
    # Devoicing (less common in English SSD but reported) — mild
    Pair("b", "p", "final devoicing", 2, "Bernthal"),
    Pair("d", "t", "final devoicing", 2, "Bernthal"),
    Pair("ɡ", "k", "final devoicing", 2, "Bernthal"),
]


ALL_PAIRS: list[Pair] = VARIANTS + ERRORS


if __name__ == "__main__":
    # Self-test: every phoneme in every pair must be in the vector set
    missing: list[tuple[Pair, str]] = []
    for p in ALL_PAIRS:
        if p.canonical not in VECTORS:
            missing.append((p, p.canonical))
        if p.substitute not in VECTORS:
            missing.append((p, p.substitute))
    assert not missing, f"missing phonemes in VECTORS: {missing}"
    assert len(VARIANTS) >= 10, f"variant inventory too small: {len(VARIANTS)}"
    assert len(ERRORS) >= 10, f"error inventory too small: {len(ERRORS)}"
    # Severity ranks must be in {1, 2, 3, 4}
    ranks = {p.severity_rank for p in ALL_PAIRS}
    assert ranks <= {1, 2, 3, 4}, f"bad severity ranks: {ranks}"
    # Variants all rank 1, errors all rank > 1
    assert all(p.severity_rank == 1 for p in VARIANTS), "variants must rank 1"
    assert all(p.severity_rank > 1 for p in ERRORS), "errors must rank > 1"
    print(f"OK — {len(VARIANTS)} variants, {len(ERRORS)} errors, all phones in VECTORS")

[ ] Step 2: Run the script to verify it passes

(This file is self-contained — pairs are defined inline above the self-test, so it should pass on first run if VECTORS loads cleanly.)

Run: uv run research/2026-05-28-phon-126-feature-vector-graded-error/inventory.py Expected output: OK — 12 variants, 16 errors, all phones in VECTORS

If a phone is missing from VECTORS (e.g. you added an entry with ɾ), the assertion will fail and the offending pair will be printed.

[ ] Step 3: Commit

git add research/2026-05-28-phon-126-feature-vector-graded-error/inventory.py
git commit -m "research(phon-126): inventory.py — variant + error pair lists with severity ranks"

Task 5: run_pair_level.py¶

Files: - Create: research/2026-05-28-phon-126-feature-vector-graded-error/run_pair_level.py - Writes: research/2026-05-28-phon-126-feature-vector-graded-error/pair_costs.parquet

Per-pair cos_dist + the three diagnostic metrics from Q3 (Mann-Whitney U, practical threshold, Spearman ρ).

[ ] Step 1: Write the script (no test-first — this is a runner)

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "numpy>=1.24",
#   "polars>=0.20",
#   "scipy>=1.11",
# ]
# ///
"""
PHON-126: Pair-level cos_dist evaluation.

For each pair in inventory.VARIANTS + inventory.ERRORS:
  - Compute cos_dist(canonical, substitute) using packages/features vectors.
Then report:
  1. Mann-Whitney U (one-sided: variant_costs < error_costs)
  2. Practical threshold: variant 75th vs error 25th percentile
  3. Spearman ρ between severity_rank and cos_dist

Outputs pair_costs.parquet for record.
"""
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np
import polars as pl
from scipy.stats import mannwhitneyu, spearmanr

sys.path.insert(0, str(Path(__file__).resolve().parent))

from inventory import ALL_PAIRS, ERRORS, VARIANTS  # noqa: E402
from similarity import cos_dist  # noqa: E402

OUT_PATH = Path(__file__).resolve().parent / "pair_costs.parquet"


def main() -> None:
    rows = []
    for p in ALL_PAIRS:
        rows.append(
            {
                "canonical": p.canonical,
                "substitute": p.substitute,
                "label": p.label,
                "severity_rank": p.severity_rank,
                "class": "variant" if p.severity_rank == 1 else "error",
                "source": p.source,
                "cos_dist": cos_dist(p.canonical, p.substitute),
            }
        )
    df = pl.DataFrame(rows)
    df.write_parquet(OUT_PATH)

    variant_costs = np.array(
        [r["cos_dist"] for r in rows if r["class"] == "variant"]
    )
    error_costs = np.array([r["cos_dist"] for r in rows if r["class"] == "error"])

    # Diagnostic 1: Mann-Whitney U (one-sided: variant < error)
    mw = mannwhitneyu(variant_costs, error_costs, alternative="less")

    # Diagnostic 2: Practical threshold
    variant_75 = float(np.percentile(variant_costs, 75))
    error_25 = float(np.percentile(error_costs, 25))
    clean_threshold = variant_75 < error_25

    # Diagnostic 3: Spearman ρ on severity_rank vs cos_dist
    ranks = [r["severity_rank"] for r in rows]
    dists = [r["cos_dist"] for r in rows]
    sp = spearmanr(ranks, dists)

    print(f"== PHON-126 Pair-Level Results ==")
    print(f"  variants: n={len(variant_costs)}, mean={variant_costs.mean():.3f}, "
          f"median={np.median(variant_costs):.3f}, 75th={variant_75:.3f}")
    print(f"  errors:   n={len(error_costs)}, mean={error_costs.mean():.3f}, "
          f"median={np.median(error_costs):.3f}, 25th={error_25:.3f}")
    print()
    print(f"  Mann-Whitney U (one-sided variant<error): "
          f"U={mw.statistic:.1f}, p={mw.pvalue:.4g}")
    print(f"  Practical threshold: variant 75th ({variant_75:.3f}) < "
          f"error 25th ({error_25:.3f}) → {'CLEAN' if clean_threshold else 'OVERLAP'}")
    print(f"  Spearman ρ (severity_rank vs cos_dist): "
          f"ρ={sp.statistic:.3f}, p={sp.pvalue:.4g}")
    print()
    print(f"Pair costs written → {OUT_PATH.relative_to(Path.cwd())}")


if __name__ == "__main__":
    main()

[ ] Step 2: Run the script

Run: uv run research/2026-05-28-phon-126-feature-vector-graded-error/run_pair_level.py Expected: parquet written, metrics printed. Capture the metrics — they go in findings.md (Task 8).

[ ] Step 3: Quick sanity check on the parquet

Run:

uv run --with polars python -c "
import polars as pl
df = pl.read_parquet('research/2026-05-28-phon-126-feature-vector-graded-error/pair_costs.parquet')
print(df.sort('cos_dist'))
"

Expected: 28 rows (12 variants + 16 errors), sorted by cos_dist ascending. Variants should cluster low, errors high.

[ ] Step 4: Commit

git add research/2026-05-28-phon-126-feature-vector-graded-error/run_pair_level.py
git commit -m "research(phon-126): run_pair_level.py — per-pair cos_dist + 3 diagnostic metrics"

(.parquet is gitignored — no need to add it.)

Task 6: run_word_level.py¶

Files: - Create: research/2026-05-28-phon-126-feature-vector-graded-error/run_word_level.py - Writes: research/2026-05-28-phon-126-feature-vector-graded-error/word_costs.parquet - Writes: research/2026-05-28-phon-126-feature-vector-graded-error/word_costs.png

50 hardcoded CMU words with their IPA strings. For each: pick one position where a variant or error substitution applies; produce both the variant-corrupted and error-corrupted version (if both exist for the word); compute WPER + binary PER on each vs the clean canonical.

[ ] Step 1: Write the script

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "matplotlib>=3.7",
#   "numpy>=1.24",
#   "polars>=0.20",
# ]
# ///
"""
PHON-126: Word-level WPER vs binary PER on synthetic-corrupted canonical strings.

For each canonical word, attempt to apply ONE variant substitution and ONE error
substitution from the inventory. For each successful application, compute WPER
and binary PER between the corrupted form and the clean canonical, classified
as 'variant' or 'error'. Output:

  - word_costs.parquet — one row per (word, applied_pair, class)
  - word_costs.png    — side-by-side WPER and PER distributions by class
"""
from __future__ import annotations

import random
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import polars as pl

sys.path.insert(0, str(Path(__file__).resolve().parent))

from inventory import ERRORS, VARIANTS, Pair  # noqa: E402
from wper import score  # noqa: E402

OUT_PARQUET = Path(__file__).resolve().parent / "word_costs.parquet"
OUT_PNG = Path(__file__).resolve().parent / "word_costs.png"

# 50 common English words with hand-verified IPA (CMU-style, US English).
# Selected to give the inventory plenty of substitution targets.
WORDS: list[tuple[str, list[str]]] = [
    ("cat", ["k", "æ", "t"]),
    ("dog", ["d", "ɔ", "ɡ"]),
    ("fish", ["f", "ɪ", "ʃ"]),
    ("ship", ["ʃ", "ɪ", "p"]),
    ("kite", ["k", "aɪ", "t"]),  # we'll skip diphthongs; this line gets pruned below
    ("rabbit", ["ɹ", "æ", "b", "ɪ", "t"]),
    ("zebra", ["z", "i", "b", "ɹ", "ə"]),
    ("snake", ["s", "n", "e", "k"]),
    ("frog", ["f", "ɹ", "ɔ", "ɡ"]),
    ("goat", ["ɡ", "o", "t"]),
    ("bird", ["b", "ɝ", "d"]),
    ("duck", ["d", "ʌ", "k"]),
    ("horse", ["h", "ɔ", "ɹ", "s"]),
    ("mouse", ["m", "aʊ", "s"]),  # diphthong; will be pruned
    ("pig", ["p", "ɪ", "ɡ"]),
    ("cow", ["k", "aʊ"]),         # diphthong; will be pruned
    ("sheep", ["ʃ", "i", "p"]),
    ("yes", ["j", "ɛ", "s"]),
    ("no", ["n", "o"]),
    ("hot", ["h", "ɑ", "t"]),
    ("cold", ["k", "o", "l", "d"]),
    ("big", ["b", "ɪ", "ɡ"]),
    ("small", ["s", "m", "ɔ", "l"]),
    ("red", ["ɹ", "ɛ", "d"]),
    ("blue", ["b", "l", "u"]),
    ("green", ["ɡ", "ɹ", "i", "n"]),
    ("happy", ["h", "æ", "p", "i"]),
    ("sad", ["s", "æ", "d"]),
    ("run", ["ɹ", "ʌ", "n"]),
    ("jump", ["dʒ", "ʌ", "m", "p"]),
    ("sleep", ["s", "l", "i", "p"]),
    ("eat", ["i", "t"]),
    ("drink", ["d", "ɹ", "ɪ", "ŋ", "k"]),
    ("walk", ["w", "ɔ", "k"]),
    ("talk", ["t", "ɔ", "k"]),
    ("sing", ["s", "ɪ", "ŋ"]),
    ("dance", ["d", "æ", "n", "s"]),
    ("book", ["b", "ʊ", "k"]),
    ("table", ["t", "e", "b", "ə", "l"]),
    ("water", ["w", "ɔ", "t", "ɝ"]),
    ("milk", ["m", "ɪ", "l", "k"]),
    ("juice", ["dʒ", "u", "s"]),
    ("apple", ["æ", "p", "ə", "l"]),
    ("banana", ["b", "ə", "n", "æ", "n", "ə"]),
    ("orange", ["ɔ", "ɹ", "ɪ", "n", "dʒ"]),
    ("thing", ["θ", "ɪ", "ŋ"]),
    ("this", ["ð", "ɪ", "s"]),
    ("that", ["ð", "æ", "t"]),
    ("five", ["f", "aɪ", "v"]),  # diphthong; will be pruned
    ("zero", ["z", "i", "ɹ", "o"]),
    ("very", ["v", "ɛ", "ɹ", "i"]),
    ("show", ["ʃ", "o"]),
    ("measure", ["m", "ɛ", "ʒ", "ɝ"]),
]

# Drop any words containing phones not in the vector set (diphthongs etc.).
from similarity import VECTORS  # noqa: E402

WORDS = [(w, ph) for (w, ph) in WORDS if all(p in VECTORS for p in ph)]


def apply_pair(phones: list[str], pair: Pair) -> list[str] | None:
    """Return phones with the FIRST occurrence of pair.canonical replaced by pair.substitute."""
    for i, ph in enumerate(phones):
        if ph == pair.canonical:
            return phones[:i] + [pair.substitute] + phones[i + 1 :]
    return None


def main() -> None:
    random.seed(0)
    rows = []
    for word, canonical_phones in WORDS:
        # Pick first applicable variant pair (if any)
        for pair in VARIANTS:
            corrupted = apply_pair(canonical_phones, pair)
            if corrupted is not None:
                wper, per, _ = score(corrupted, canonical_phones)
                rows.append(
                    {
                        "word": word,
                        "canonical": "".join(canonical_phones),
                        "corrupted": "".join(corrupted),
                        "pair_label": pair.label,
                        "class": "variant",
                        "severity_rank": pair.severity_rank,
                        "wper": wper,
                        "binary_per": per,
                        "n_phones": len(canonical_phones),
                    }
                )
                break
        # Pick first applicable error pair (if any)
        for pair in ERRORS:
            corrupted = apply_pair(canonical_phones, pair)
            if corrupted is not None:
                wper, per, _ = score(corrupted, canonical_phones)
                rows.append(
                    {
                        "word": word,
                        "canonical": "".join(canonical_phones),
                        "corrupted": "".join(corrupted),
                        "pair_label": pair.label,
                        "class": "error",
                        "severity_rank": pair.severity_rank,
                        "wper": wper,
                        "binary_per": per,
                        "n_phones": len(canonical_phones),
                    }
                )
                break

    df = pl.DataFrame(rows)
    df.write_parquet(OUT_PARQUET)

    variant_wper = df.filter(pl.col("class") == "variant")["wper"].to_numpy()
    error_wper = df.filter(pl.col("class") == "error")["wper"].to_numpy()
    variant_per = df.filter(pl.col("class") == "variant")["binary_per"].to_numpy()
    error_per = df.filter(pl.col("class") == "error")["binary_per"].to_numpy()

    fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
    axes[0].boxplot([variant_wper, error_wper], labels=["variant", "error"])
    axes[0].set_title("WPER (cos-dist substitution cost)")
    axes[0].set_ylabel("rate")
    axes[1].boxplot([variant_per, error_per], labels=["variant", "error"])
    axes[1].set_title("Binary PER")
    fig.suptitle(f"PHON-126: word-level corruption (n={len(rows)} corruptions)")
    fig.tight_layout()
    fig.savefig(OUT_PNG, dpi=120)

    print(f"== PHON-126 Word-Level Results ==")
    print(f"  variant corruptions: n={len(variant_wper)}, "
          f"WPER mean={variant_wper.mean():.3f}, PER mean={variant_per.mean():.3f}")
    print(f"  error corruptions:   n={len(error_wper)}, "
          f"WPER mean={error_wper.mean():.3f}, PER mean={error_per.mean():.3f}")
    print(f"  WPER variant/error mean ratio: "
          f"{variant_wper.mean() / max(error_wper.mean(), 1e-9):.3f} "
          f"(closer to 0 = better separation)")
    print()
    print(f"Word costs written → {OUT_PARQUET.relative_to(Path.cwd())}")
    print(f"Plot written       → {OUT_PNG.relative_to(Path.cwd())}")


if __name__ == "__main__":
    main()

[ ] Step 2: Run the script

Run: uv run research/2026-05-28-phon-126-feature-vector-graded-error/run_word_level.py Expected: parquet + PNG written, summary stats printed.

[ ] Step 3: Inspect the plot

Open research/2026-05-28-phon-126-feature-vector-graded-error/word_costs.png and eyeball: - WPER pane: variant boxplot should sit lower than error boxplot. - Binary PER pane: variant and error should be similar (binary PER doesn't distinguish them by definition — both are single-position substitutions). - The contrast between the two panes is the spike's headline visual.

[ ] Step 4: Commit

git add research/2026-05-28-phon-126-feature-vector-graded-error/run_word_level.py
git commit -m "research(phon-126): run_word_level.py — synthetic-corruption WPER vs binary PER"

Task 7: percept_check.py (optional sanity check)¶

Files: - Create: research/2026-05-28-phon-126-feature-vector-graded-error/percept_check.py - Writes (if drive mounted): research/2026-05-28-phon-126-feature-vector-graded-error/inventory_coverage.parquet

Mines (canonical, actual) substitution frequencies from PERCEPT to verify the inventory pairs actually occur. Exits gracefully (warning, exit 0) if the external drive isn't mounted.

[ ] Step 1: Write the script

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "polars>=0.20",
# ]
# ///
"""
PHON-126: PERCEPT sanity check.

Aligns canonical (model_phonology) vs actual (actual_phonology) phone strings
in /Volumes/ExternalData1/phonbank/dataset_production.jsonl, counts substitution
pairs, and reports per-inventory-pair occurrence counts in PERCEPT data.

Output: inventory_coverage.parquet
Skippable: if drive isn't mounted, prints warning and exits 0.
"""
from __future__ import annotations

import json
import sys
from collections import Counter
from pathlib import Path

import polars as pl

sys.path.insert(0, str(Path(__file__).resolve().parent))

from inventory import ALL_PAIRS  # noqa: E402
from wper import score  # noqa: E402

JSONL_PATH = Path("/Volumes/ExternalData1/phonbank/dataset_production.jsonl")
OUT_PATH = Path(__file__).resolve().parent / "inventory_coverage.parquet"


def iter_pairs(path: Path):
    with path.open() as f:
        for line in f:
            try:
                rec = json.loads(line)
            except json.JSONDecodeError:
                continue
            canonical = rec.get("model_phonology") or []
            actual = rec.get("actual_phonology") or []
            if not canonical or not actual:
                continue
            yield canonical, actual


def main() -> None:
    if not JSONL_PATH.exists():
        print(f"[skip] PERCEPT dataset not found at {JSONL_PATH} "
              f"(external drive not mounted) — sanity check skipped.")
        return

    sub_counts: Counter[tuple[str, str]] = Counter()
    n_utts = 0
    for canonical, actual in iter_pairs(JSONL_PATH):
        n_utts += 1
        _, _, alignment = score(actual, canonical)
        for op, pred_ph, can_ph in alignment:
            if op == "sub":
                sub_counts[(can_ph, pred_ph)] += 1

    rows = []
    for p in ALL_PAIRS:
        count = sub_counts.get((p.canonical, p.substitute), 0)
        rows.append(
            {
                "canonical": p.canonical,
                "substitute": p.substitute,
                "label": p.label,
                "class": "variant" if p.severity_rank == 1 else "error",
                "percept_count": count,
            }
        )

    df = pl.DataFrame(rows).sort("percept_count", descending=True)
    df.write_parquet(OUT_PATH)

    print(f"== PHON-126 PERCEPT Sanity ==")
    print(f"  Aligned {n_utts} utterances.")
    print(f"  Inventory coverage (top 10 by PERCEPT count):")
    print(df.head(10))
    n_missing = df.filter(pl.col("percept_count") == 0).height
    print(f"  Inventory pairs with 0 PERCEPT occurrences: {n_missing}/{len(rows)}")
    print(f"Output → {OUT_PATH.relative_to(Path.cwd())}")


if __name__ == "__main__":
    main()

[ ] Step 2: Run the script (or skip if drive not mounted)

Run: uv run research/2026-05-28-phon-126-feature-vector-graded-error/percept_check.py

Expected output if drive is mounted: per-pair occurrence counts printed, parquet written. Expected output if drive is NOT mounted: [skip] PERCEPT dataset not found at ... (exit 0). That's fine — note the skip in findings.md.

If the script runs but is slow: 77K utterances × Levenshtein DP can take a few minutes. Acceptable.

[ ] Step 3: Commit

git add research/2026-05-28-phon-126-feature-vector-graded-error/percept_check.py
git commit -m "research(phon-126): percept_check.py — sanity-check inventory pairs in PERCEPT"

Task 8: findings.md + run experiment end-to-end + Jira update¶

Files: - Create: research/2026-05-28-phon-126-feature-vector-graded-error/findings.md - Update: Jira PHON-126 (transition to Done with verdict comment)

[ ] Step 1: Re-run all scripts in order, capturing output

uv run research/2026-05-28-phon-126-feature-vector-graded-error/similarity.py
uv run research/2026-05-28-phon-126-feature-vector-graded-error/wper.py
uv run research/2026-05-28-phon-126-feature-vector-graded-error/inventory.py
uv run research/2026-05-28-phon-126-feature-vector-graded-error/run_pair_level.py
uv run research/2026-05-28-phon-126-feature-vector-graded-error/run_word_level.py
uv run research/2026-05-28-phon-126-feature-vector-graded-error/percept_check.py  # optional

Save the printed metrics — they go in the findings doc.

[ ] Step 2: Write findings.md from the captured numbers

Template (fill in the bracketed […] placeholders with actual values from Step 1):

# PHON-126 — Findings

**Date:** 2026-05-28
**Spec:** [`../../docs/superpowers/specs/2026-05-28-phon-126-feature-vector-graded-error-design.md`](../../docs/superpowers/specs/2026-05-28-phon-126-feature-vector-graded-error-design.md)
**Ticket:** [PHON-126](https://neumannsworkshop.atlassian.net/browse/PHON-126)
**Parent:** [PHON-44 Audio](https://neumannsworkshop.atlassian.net/browse/PHON-44)

## Question

Do PhonoLex's learned 26-d articulatory feature vectors give a graded phoneme-substitution distance that separates accent / variant substitutions from clinical SSD errors?

## Data

- Inventory: 12 variant-class pairs (Wells; vowel mergers, L2, dialectal) + 16 error-class pairs (Hodson/Bernthal SSD processes), severity-ranked 1–4.
- Vectors: `packages/features/outputs/vectors.csv`, 26-d Bayesian-learned articulatory features.
- PERCEPT sanity: `[ran / skipped — drive not mounted]`.
- Word-level corruptions: 50 canonical English words (CMU IPA), one variant- and one error-class corruption per word where applicable.

## Method

- `cos_dist(p₁, p₂) = clip(1 − cos_sim(v₁, v₂), 0, 1)`
- WPER = Levenshtein DP with substitution cost = `cos_dist`, del/ins = 1, normalized by canonical length.
- See spec §3 for full method.

## Results

### Pair-level (n=28 total: 12 variant + 16 error)

| metric | value |
|---|---|
| variant mean cos_dist | [v_mean] |
| variant 75th percentile | [v_75] |
| error mean cos_dist | [e_mean] |
| error 25th percentile | [e_25] |
| Mann-Whitney U (one-sided variant<error) | U=[U], p=[p_mw] |
| Practical threshold (v_75 < e_25) | [CLEAN / OVERLAP] |
| Spearman ρ (severity_rank vs cos_dist) | ρ=[rho], p=[p_sp] |

### Word-level (n=[n_word] corruptions across 50 words)

| metric | variant | error |
|---|---|---|
| WPER mean | [w_v_mean] | [w_e_mean] |
| binary PER mean | [p_v_mean] | [p_e_mean] |
| WPER variant/error mean ratio | [ratio] |

Plot: `word_costs.png` — WPER pane shows separation; binary PER pane is flat across classes by definition (both are single-position substitutions).

### PERCEPT sanity check

`[ran / skipped]`.

If ran: [N inventory pairs of 28 had ≥1 occurrence in PERCEPT; top covered pairs were …]

## Verdict

**[Pass / Pass-with-calibration / Fail]**

Reasoning: [1-3 sentences citing the three metrics — does the geometry cleanly separate variant from error? Does a threshold exist? Does the metric rank by severity?]

## Implications for PHON-53

- **If Pass:** use `cos_dist(p_pred, p_canonical)` directly as the substitution cost in the PHON-53 error layer. WPER replaces binary Levenshtein. Variant-tolerance is in the metric, not a separate accent-detection branch. **Moat:** the similarity matrix is an asset we already own.
- **If Pass-with-calibration:** file follow-up to learn per-phoneme weights or a threshold on top of `cos_dist`. Probable approach: linear calibration from a small SLP-adjudicated set.
- **If Fail:** the symbolic-feature vectors don't transfer to acoustic graded error scoring. Need an acoustic-grounded similarity matrix (Berkeley-style; train from PERCEPT-R graded ratings). Reroutes the PHON-53 error layer significantly.

## Caveats

- **Severity rank subjectivity.** `severity_rank` is hand-assigned; Spearman ρ is the most subjective of the three metrics. Report the other two as primary.
- **Inventory bias.** Textbook-curated — the PERCEPT sanity check guards against testing on substitutions that never occur, but doesn't eliminate selection bias toward well-documented processes.
- **Synthetic single-position corruption.** Word-level test applies one substitution per word; real disordered speech often has multiple co-occurring processes. This is the cheap-probe form and is acknowledged as such in the spec.
- **Vector set coverage.** Tap `ɾ` is not in our vector set, so tap-based accent variants (American /t/ flap, Spanish /ɾ/) are not represented. Future inventory expansion would need vector coverage first.

## Follow-ups

- [PHON-126b candidate] SLP-adjudicated face-validity check on real PHON-55 inference outputs, if the spike passes.
- [PHON-53] Bake `cos_dist` into the error layer per the verdict.
- (If Pass-with-calibration) File a calibration ticket.
- (If Fail) File an acoustic-grounded similarity matrix ticket as PHON-53 unblocker.

[ ] Step 3: Commit findings.md

git add research/2026-05-28-phon-126-feature-vector-graded-error/findings.md
git commit -m "research(phon-126): findings.md — verdict and PHON-53 implications"

[ ] Step 4: Update Jira PHON-126

Add a comment on PHON-126 with the verdict + a link to research/2026-05-28-phon-126-feature-vector-graded-error/findings.md. Transition status: Backlog → Done.

(Use the Atlassian MCP — mcp__plugin_atlassian_atlassian__addCommentToJiraIssue and mcp__plugin_atlassian_atlassian__transitionJiraIssue.)

Comment body template:

Verdict: [Pass / Pass-with-calibration / Fail].

Pair-level (n=28): variant mean cos_dist = [v_mean], error mean = [e_mean]; Mann-Whitney U one-sided p = [p_mw]; practical threshold v75 [v_75] vs e25 [e_25] → [CLEAN / OVERLAP]; Spearman ρ = [rho].

Word-level (n=[n_word]): WPER variant/error ratio = [ratio].

Findings: research/2026-05-28-phon-126-feature-vector-graded-error/findings.md

Also link the findings on PHON-53 (Backlog) so the next person picking up PHON-53 can see the error-layer verdict.

[ ] Step 5: Push branch and prepare PR

git push -u origin feat/phon-126-feature-vector-graded-error

PR target: release/v6-audio (not develop directly — v6-audio is the integration branch).

PR title: research(phon-126): feature-vector graded-error spike PR body: link to spec + findings.md, summarize verdict in 1-2 sentences.

Self-Review¶

1. Spec coverage: - Spec §3.1 (synthetic GT) → Task 4 inventory. - Spec §3.2 (textbook + PERCEPT sanity) → Task 4 (textbook) + Task 7 (PERCEPT). - Spec §3.3 (cos_sim / cos_dist) → Task 2. - Spec §3.4 (Levenshtein DP) → Task 3. - Spec §3.5 (three diagnostic metrics) → Task 5. - Spec §4 components → Tasks 1-7 covered all 7 files. - Spec §5 Done definition (scripts run end-to-end, findings.md, Jira) → Task 8. - Spec §7 risks (PERCEPT mount, severity subjectivity) → handled in Task 7 skip-path + findings.md caveats.

All spec requirements have at least one task. No gaps.

2. Placeholder scan: No TBD / TODO / add appropriate error handling / similar to Task N patterns. The findings.md template has […] placeholders that are explicitly the "fill in after run" step — not plan failures, that's the intended workflow for findings docs.

3. Type consistency: - cos_dist(p1, p2) -> float in Task 2; consumed by Task 3 (wper.score), Task 5, Task 6, Task 7. ✓ - score(pred, canonical) -> tuple[float, float, list[tuple[str, str, str]]] in Task 3; consumed by Task 6 and Task 7 (alignment traversal). ✓ - Pair dataclass in Task 4 with fields canonical, substitute, label, severity_rank, source; consumed by Tasks 5 and 6. ✓ - VARIANTS, ERRORS, ALL_PAIRS lists from Task 4; consumed by Tasks 5, 6, 7. ✓

No inconsistencies.