PHON-129 — Model #2 L2/Accent Pronunciation Scorer Implementation Plan¶

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Ship POST /api/audio/pronounce (target word + audio → per-position cos_dist score, overall score, variant/error class) plus a /dev/pronounce viewer page, validated on L2-ARCTIC.

Architecture: Scoring runs in the Worker (Approach A). The route sub-calls the existing local phonolex_audio inference server (FastAPI; /transcribe + /compare, reached via AUDIO_INFERENCE_URL, already wired) for the produced phonemes, looks up canonical phonemes from D1 words, and computes WPER alignment + cos_dist = clip(1 − cosine, 0, 1) over the learned feature vectors using the PhonemeCache (norms + dots) already loaded by the similarity route. The metric is PHON-126's, ported to TS and pinned to a frozen fixture for drift. The frontend is a dev page mirroring AudioTranscribeViewer.tsx, not the eventual unified tool. (RunPod is not used — the Worker is host-agnostic; the production host is a later deploy decision.)

Tech Stack: TypeScript, Hono (Cloudflare Workers), D1 (SQLite), Vitest (@cloudflare/vitest-pool-workers + cloudflare:test), React + MUI + React Router, Python (validation harness, Polars/NumPy).

Spec: docs/superpowers/specs/2026-06-05-phon-129-l2-accent-scorer-design.md Grounding: research/2026-06-05-phon-129-l2-accent-scorer/FLEGE_SLM.md

File Structure¶

Worker (scoring + route): - Create packages/web/workers/src/lib/pronunciationScore.ts — pure scoring (cos_dist, WPER align + traceback, classify, overall). One responsibility: turn (produced, canonical, cache) into the score object. - Modify packages/web/workers/src/lib/similarity.ts — export phonemeCosine (currently module-private) so the scorer reuses it. No logic change. - Modify packages/web/workers/src/routes/audio.ts — add the pronounce handler (transcribe sub-call + D1 canonical lookup + cache load + score). Stays in audio.ts; it's one more handler in the audio surface. - Create packages/web/workers/src/__tests__/pronunciationScore.test.ts — unit tests (synthetic cache, no D1). - Create packages/web/workers/src/__tests__/pronounce-fixture.test.ts — PHON-126 drift fixture test. - Modify packages/web/workers/src/__tests__/audio.test.ts — add /api/audio/pronounce route tests.

Frontend (dev page): - Modify packages/web/frontend/src/services/audioApi.ts — add pronounceAudio() + PronunciationResult type. - Create packages/web/frontend/src/components/tools/PronunciationViewer.tsx — dev page. - Create packages/web/frontend/src/components/tools/PronunciationViewer.test.tsx — component tests. - Modify packages/web/frontend/src/main.tsx — register /dev/pronounce route.

Validation harness (research, not shipped): - Create research/2026-06-05-phon-129-l2-accent-scorer/gen_score_fixtures.py — emits the drift fixture from vectors.csv. - Create research/2026-06-05-phon-129-l2-accent-scorer/score_fixtures.json — generated artifact (committed). - Create research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py — transcribe + score L2-ARCTIC. - Create research/2026-06-05-phon-129-l2-accent-scorer/02_metrics.py — PHON-126 metrics, pooled + per-L1. - Create research/2026-06-05-phon-129-l2-accent-scorer/RESULTS.md — GO/NO-GO write-up.

Task 1: Export `phonemeCosine` from similarity.ts¶

Files: - Modify: packages/web/workers/src/lib/similarity.ts (the function phonemeCosine declaration)

[ ] Step 1: Export the function

In packages/web/workers/src/lib/similarity.ts, change the declaration from module-private to exported. Find:

function phonemeCosine(p1: string, p2: string, cache: PhonemeCache): number {

Change to:

export function phonemeCosine(p1: string, p2: string, cache: PhonemeCache): number {

[ ] Step 2: Verify the workers package still type-checks

Run: cd packages/web/workers && npx tsc --noEmit Expected: no errors.

[ ] Step 3: Commit

git add packages/web/workers/src/lib/similarity.ts
git commit -m "refactor(phon-129): export phonemeCosine for reuse by the scorer"

Task 2: Scoring module — cos_dist + WPER alignment with traceback¶

Files: - Create: packages/web/workers/src/lib/pronunciationScore.ts - Test: packages/web/workers/src/__tests__/pronunciationScore.test.ts

[ ] Step 1: Write the failing test

Create packages/web/workers/src/__tests__/pronunciationScore.test.ts. The synthetic cache encodes three phonemes: a,b are close (cosine 0.8 → cos_dist 0.2), a,c are far (cosine 0.0 → cos_dist 1.0). Norms are 1 (unit vectors) so cosine = dot.

import { describe, it, expect } from 'vitest';
import type { PhonemeCache } from '../lib/similarity';
import { cosDist, alignWPER } from '../lib/pronunciationScore';

// Unit-norm synthetic cache: dot == cosine. a·b = 0.8, a·c = 0.0, b·c = 0.0
function synthCache(): PhonemeCache {
  const normSq = new Map<string, number>([['a', 1], ['b', 1], ['c', 1]]);
  const dots = new Map<string, number>([['a,b', 0.8], ['a,c', 0.0], ['b,c', 0.0]]);
  return { normSq, dots };
}

describe('cosDist', () => {
  it('is 0 for identical phonemes', () => {
    expect(cosDist('a', 'a', synthCache())).toBeCloseTo(0, 6);
  });
  it('is 1 - cosine for a near pair, clamped to [0,1]', () => {
    expect(cosDist('a', 'b', synthCache())).toBeCloseTo(0.2, 6);
  });
  it('is 1 for an orthogonal pair', () => {
    expect(cosDist('a', 'c', synthCache())).toBeCloseTo(1.0, 6);
  });
});

describe('alignWPER', () => {
  it('scores a perfect match as wper 0, all positions match', () => {
    const r = alignWPER(['a', 'b'], ['a', 'b'], synthCache());
    expect(r.wper).toBeCloseTo(0, 6);
    expect(r.perPosition.map((p) => p.op)).toEqual(['match', 'match']);
    expect(r.insertions).toEqual([]);
  });

  it('records a substitution with its cos_dist at the canonical position', () => {
    // canonical [a,a] vs produced [a,b] → pos2 sub a→b, cos_dist 0.2
    const r = alignWPER(['a', 'a'], ['a', 'b'], synthCache());
    expect(r.perPosition[0]).toMatchObject({ canonical: 'a', produced: 'a', op: 'match' });
    expect(r.perPosition[1]).toMatchObject({ canonical: 'a', produced: 'b', op: 'sub' });
    expect(r.perPosition[1].cos_dist).toBeCloseTo(0.2, 6);
    expect(r.wper).toBeCloseTo(0.1, 6); // 0.2 / 2 canonical positions
  });

  it('records a deletion (omitted canonical phone) with cos_dist 1', () => {
    // canonical [a,b] vs produced [a] → b deleted
    const r = alignWPER(['a', 'b'], ['a'], synthCache());
    expect(r.perPosition[1]).toMatchObject({ canonical: 'b', produced: null, op: 'del' });
    expect(r.perPosition[1].cos_dist).toBe(1);
    expect(r.wper).toBeCloseTo(0.5, 6); // one indel / 2 canonical
  });

  it('records an insertion out-of-band keyed to the preceding canonical index', () => {
    // canonical [a] vs produced [a,c] → c inserted after canonical index 0
    const r = alignWPER(['a'], ['a', 'c'], synthCache());
    expect(r.perPosition).toHaveLength(1);
    expect(r.perPosition[0].op).toBe('match');
    expect(r.insertions).toEqual([{ produced: 'c', after_canonical_index: 0 }]);
  });

  it('treats an empty production as full deletion (wper 1)', () => {
    const r = alignWPER(['a', 'b'], [], synthCache());
    expect(r.perPosition.map((p) => p.op)).toEqual(['del', 'del']);
    expect(r.wper).toBeCloseTo(1, 6);
  });
});

[ ] Step 2: Run test to verify it fails

Run: cd packages/web/workers && npx vitest run src/__tests__/pronunciationScore.test.ts Expected: FAIL — pronunciationScore.ts does not exist / cosDist is not exported.

[ ] Step 3: Write minimal implementation

Create packages/web/workers/src/lib/pronunciationScore.ts:

/**
 * L2/accent pronunciation scoring — PHON-129 Model #2.
 *
 * WPER alignment of a produced phoneme sequence against the canonical target,
 * substitution cost = cos_dist over the learned feature vectors (PHON-126 metric).
 * Per-position output is keyed to CANONICAL positions (the targets the learner aims
 * at); extra produced phones are reported out-of-band as insertions.
 */
import { phonemeCosine, type PhonemeCache } from './similarity';

export type Op = 'match' | 'sub' | 'del';

export interface PositionScore {
  canonical: string;
  produced: string | null;
  cos_dist: number;
  op: Op;
}

export interface Insertion {
  produced: string;
  after_canonical_index: number;
}

export interface AlignResult {
  perPosition: PositionScore[];
  insertions: Insertion[];
  wper: number;
}

/** cos_dist = clip(1 - cosine, 0, 1). PHON-126's substitution cost. */
export function cosDist(a: string, b: string, cache: PhonemeCache): number {
  return Math.min(1, Math.max(0, 1 - phonemeCosine(a, b, cache)));
}

/**
 * Levenshtein DP over (canonical rows, produced cols) with soft substitution cost.
 * canonical[i-1] vs produced[j-1]. Indel = 1.0. Traceback recovers ops.
 * wper = total cost / canonical length.
 */
export function alignWPER(
  canonical: string[],
  produced: string[],
  cache: PhonemeCache,
): AlignResult {
  const n = canonical.length;
  const m = produced.length;

  const dp: number[][] = Array.from({ length: n + 1 }, () => new Array(m + 1).fill(0));
  for (let i = 0; i <= n; i++) dp[i][0] = i;
  for (let j = 0; j <= m; j++) dp[0][j] = j;

  for (let i = 1; i <= n; i++) {
    for (let j = 1; j <= m; j++) {
      const sub = dp[i - 1][j - 1] + cosDist(canonical[i - 1], produced[j - 1], cache);
      const del = dp[i - 1][j] + 1; // omit canonical[i-1]
      const ins = dp[i][j - 1] + 1; // extra produced[j-1]
      dp[i][j] = Math.min(sub, del, ins);
    }
  }

  // Traceback from (n, m). Collect canonical-keyed ops + out-of-band insertions.
  const perPositionRev: PositionScore[] = [];
  const insertionsRev: Insertion[] = [];
  let i = n;
  let j = m;
  while (i > 0 || j > 0) {
    if (i > 0 && j > 0) {
      const subCost = cosDist(canonical[i - 1], produced[j - 1], cache);
      if (Math.abs(dp[i][j] - (dp[i - 1][j - 1] + subCost)) < 1e-9) {
        perPositionRev.push({
          canonical: canonical[i - 1],
          produced: produced[j - 1],
          cos_dist: subCost,
          op: subCost === 0 ? 'match' : 'sub',
        });
        i--; j--;
        continue;
      }
    }
    if (i > 0 && Math.abs(dp[i][j] - (dp[i - 1][j] + 1)) < 1e-9) {
      perPositionRev.push({ canonical: canonical[i - 1], produced: null, cos_dist: 1, op: 'del' });
      i--;
      continue;
    }
    // insertion: extra produced[j-1], lands after canonical index (i-1)
    insertionsRev.push({ produced: produced[j - 1], after_canonical_index: i - 1 });
    j--;
  }

  perPositionRev.reverse();
  insertionsRev.reverse();
  return {
    perPosition: perPositionRev,
    insertions: insertionsRev,
    wper: n === 0 ? (m === 0 ? 0 : 1) : dp[n][m] / n,
  };
}

[ ] Step 4: Run test to verify it passes

Run: cd packages/web/workers && npx vitest run src/__tests__/pronunciationScore.test.ts Expected: PASS (all cases).

[ ] Step 5: Commit

git add packages/web/workers/src/lib/pronunciationScore.ts packages/web/workers/src/__tests__/pronunciationScore.test.ts
git commit -m "feat(phon-129): WPER alignment + cos_dist scoring module"

Task 3: Classification + overall score + top-level `scorePronunciation`¶

Files: - Modify: packages/web/workers/src/lib/pronunciationScore.ts - Modify: packages/web/workers/src/__tests__/pronunciationScore.test.ts

[ ] Step 1: Write the failing test

Append to packages/web/workers/src/__tests__/pronunciationScore.test.ts:

import { scorePronunciation, VARIANT_ERROR_THRESHOLD } from '../lib/pronunciationScore';

describe('scorePronunciation', () => {
  it('labels a near (sub-threshold) substitution as variant', () => {
    // a→b cos_dist 0.2 > threshold 0.112 → error; use a closer pair instead
    expect(VARIANT_ERROR_THRESHOLD).toBeCloseTo(0.112, 6);
  });

  it('overall_score is 1 - wper', () => {
    const r = scorePronunciation(['a', 'a'], ['a', 'b'], synthCache());
    expect(r.overall_score).toBeCloseTo(0.9, 6); // wper 0.1
  });

  it('word class is error when any position exceeds the threshold', () => {
    // a→b cos_dist 0.2 >= 0.112 → that position is an error → word error
    const r = scorePronunciation(['a', 'a'], ['a', 'b'], synthCache());
    expect(r.variant_vs_error_class).toBe('error');
    expect(r.threshold_basis).toBe('l1_agnostic');
  });

  it('word class is variant for a perfect match', () => {
    const r = scorePronunciation(['a', 'b'], ['a', 'b'], synthCache());
    expect(r.variant_vs_error_class).toBe('variant');
    expect(r.overall_score).toBeCloseTo(1, 6);
  });

  it('a deletion forces the word to error', () => {
    const r = scorePronunciation(['a', 'b'], ['a'], synthCache());
    expect(r.variant_vs_error_class).toBe('error');
  });
});

[ ] Step 2: Run test to verify it fails

Run: cd packages/web/workers && npx vitest run src/__tests__/pronunciationScore.test.ts Expected: FAIL — scorePronunciation / VARIANT_ERROR_THRESHOLD not exported.

[ ] Step 3: Write minimal implementation

Append to packages/web/workers/src/lib/pronunciationScore.ts:

/** PHON-126 practical boundary: midpoint of variant-75th (0.102) / error-25th (0.122).
 *  This is the knob SLM-r flags as L1-sensitive (see FLEGE_SLM.md); v6.1 is L1-agnostic. */
export const VARIANT_ERROR_THRESHOLD = 0.112;

export interface PronunciationScore {
  per_position: PositionScore[];
  insertions: Insertion[];
  overall_score: number;
  variant_vs_error_class: 'variant' | 'error';
  threshold_basis: 'l1_agnostic';
}

/** Per-word label = worst position (any error ⇒ word is error). Deletions are errors. */
export function scorePronunciation(
  canonical: string[],
  produced: string[],
  cache: PhonemeCache,
): PronunciationScore {
  const { perPosition, insertions, wper } = alignWPER(canonical, produced, cache);
  const hasError = perPosition.some(
    (p) => p.op === 'del' || (p.op === 'sub' && p.cos_dist >= VARIANT_ERROR_THRESHOLD),
  );
  return {
    per_position: perPosition,
    insertions,
    overall_score: Math.min(1, Math.max(0, 1 - wper)),
    variant_vs_error_class: hasError ? 'error' : 'variant',
    threshold_basis: 'l1_agnostic',
  };
}

[ ] Step 4: Run test to verify it passes

Run: cd packages/web/workers && npx vitest run src/__tests__/pronunciationScore.test.ts Expected: PASS.

[ ] Step 5: Commit

git add packages/web/workers/src/lib/pronunciationScore.ts packages/web/workers/src/__tests__/pronunciationScore.test.ts
git commit -m "feat(phon-129): variant/error classification + overall score"

Task 4: PHON-126 drift fixture (Python generator + TS pin test)¶

Files: - Create: research/2026-06-05-phon-129-l2-accent-scorer/gen_score_fixtures.py - Create: research/2026-06-05-phon-129-l2-accent-scorer/score_fixtures.json (generated) - Create: packages/web/workers/src/__tests__/pronounce-fixture.test.ts

[ ] Step 1: Write the fixture generator

Create research/2026-06-05-phon-129-l2-accent-scorer/gen_score_fixtures.py. It loads the same learned vectors PHON-126 used, computes the exact cos_dist and WPER for a few (canonical, produced) pairs, and emits BOTH the cache data (normSq + dots for the involved phonemes) and the expected outputs — so the TS test is self-contained and needs no D1.

"""Generate score_fixtures.json: PHON-126 cos_dist/WPER ground truth + the cache
data needed to reproduce it in TypeScript. Pins pronunciationScore.ts against the
validated Python metric. Run: uv run python gen_score_fixtures.py"""
import json
from pathlib import Path
import numpy as np
import polars as pl

VECTORS = Path(__file__).resolve().parents[2] / "packages/features/outputs/vectors.csv"
OUT = Path(__file__).resolve().parent / "score_fixtures.json"

# (canonical, produced) pairs spanning match / near-sub / far-sub / deletion.
CASES = [
    (["k", "æ", "t"], ["k", "æ", "t"]),     # perfect
    (["v", "ɛ", "ɹ", "i"], ["b", "ɛ", "ɹ", "i"]),  # v→b substitution
    (["k", "æ", "t"], ["k", "æ"]),          # final deletion
    (["s", "ɪ", "t"], ["ʃ", "ɪ", "t"]),     # s→ʃ substitution
]

def cos_dist(v1, v2):
    c = float(v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2)))
    return min(1.0, max(0.0, 1.0 - c))

def wper(canon, prod, vec):
    n, m = len(canon), len(prod)
    dp = [[0.0] * (m + 1) for _ in range(n + 1)]
    for i in range(n + 1): dp[i][0] = i
    for j in range(m + 1): dp[0][j] = j
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            sub = dp[i-1][j-1] + cos_dist(vec[canon[i-1]], vec[prod[j-1]])
            dp[i][j] = min(sub, dp[i-1][j] + 1, dp[i][j-1] + 1)
    return dp[n][m] / n if n else (0.0 if m == 0 else 1.0)

df = pl.read_csv(VECTORS)
feat_cols = [c for c in df.columns if c != "ipa"]
vec = {row["ipa"]: np.array([row[c] for c in feat_cols], dtype=float)
       for row in df.iter_rows(named=True)}

phones = sorted({p for c, pr in CASES for p in (*c, *pr)})
norm_sq = {p: float(vec[p] @ vec[p]) for p in phones}
dots = {}
for a in phones:
    for b in phones:
        if a <= b:
            dots[f"{a},{b}"] = float(vec[a] @ vec[b])

cases = []
for canon, prod in CASES:
    cases.append({
        "canonical": canon,
        "produced": prod,
        "expected_wper": round(wper(canon, prod, vec), 9),
        "expected_per_position_cos_dist": [
            round(cos_dist(vec[c], vec[p]), 9) for c, p in zip(canon, prod)
        ][: min(len(canon), len(prod))],
    })

OUT.write_text(json.dumps({"normSq": norm_sq, "dots": dots, "cases": cases}, indent=2))
print(f"wrote {OUT} ({len(cases)} cases, {len(phones)} phonemes)")

[ ] Step 2: Generate the fixture

Run: cd research/2026-06-05-phon-129-l2-accent-scorer && uv run python gen_score_fixtures.py Expected: wrote .../score_fixtures.json (4 cases, N phonemes). Confirm the file exists and contains normSq, dots, cases.

[ ] Step 3: Write the failing TS pin test

Create packages/web/workers/src/__tests__/pronounce-fixture.test.ts:

import { describe, it, expect } from 'vitest';
import type { PhonemeCache } from '../lib/similarity';
import { alignWPER } from '../lib/pronunciationScore';
import fixture from '../../../../../research/2026-06-05-phon-129-l2-accent-scorer/score_fixtures.json';

function fixtureCache(): PhonemeCache {
  return {
    normSq: new Map(Object.entries(fixture.normSq as Record<string, number>)),
    dots: new Map(Object.entries(fixture.dots as Record<string, number>)),
  };
}

describe('pronunciationScore matches the PHON-126 Python metric', () => {
  const cache = fixtureCache();
  for (const c of fixture.cases as Array<{
    canonical: string[]; produced: string[];
    expected_wper: number; expected_per_position_cos_dist: number[];
  }>) {
    it(`reproduces wper for ${c.canonical.join('')} vs ${c.produced.join('')}`, () => {
      const r = alignWPER(c.canonical, c.produced, cache);
      expect(r.wper).toBeCloseTo(c.expected_wper, 6);
      const subOrMatch = r.perPosition
        .filter((p) => p.op !== 'del')
        .map((p) => p.cos_dist);
      c.expected_per_position_cos_dist.forEach((expected, idx) => {
        expect(subOrMatch[idx]).toBeCloseTo(expected, 6);
      });
    });
  }
});

Note: confirm the relative import depth (../../../../../) resolves from packages/web/workers/src/__tests__/ to repo root research/. Adjust the number of ../ if the test runner reports an unresolved module. Ensure resolveJsonModule is enabled in the workers tsconfig.json; if not, add "resolveJsonModule": true under compilerOptions.

[ ] Step 4: Run test to verify it passes

Run: cd packages/web/workers && npx vitest run src/__tests__/pronounce-fixture.test.ts Expected: PASS — TS reproduces Python cos_dist/WPER to 1e-6.

[ ] Step 5: Commit

git add research/2026-06-05-phon-129-l2-accent-scorer/gen_score_fixtures.py research/2026-06-05-phon-129-l2-accent-scorer/score_fixtures.json packages/web/workers/src/__tests__/pronounce-fixture.test.ts
git commit -m "test(phon-129): pin TS scorer to PHON-126 Python metric via frozen fixture"

Task 5: `/api/audio/pronounce` route handler¶

Files: - Modify: packages/web/workers/src/routes/audio.ts - Modify: packages/web/workers/src/__tests__/audio.test.ts

[ ] Step 1: Write the failing route tests

Append to packages/web/workers/src/__tests__/audio.test.ts:

describe('POST /api/audio/pronounce', () => {
  it('returns 400 when no audio part is present', async () => {
    const fd = new FormData();
    fd.append('target_word', 'very');
    const res = await SELF.fetch('http://localhost/api/audio/pronounce', { method: 'POST', body: fd });
    expect(res.status).toBe(400);
    const body = await res.json() as Record<string, unknown>;
    expect(body).toHaveProperty('detail');
  });

  it('returns 400 when target_word is missing', async () => {
    const fd = new FormData();
    fd.append('audio', new Blob([new Uint8Array([1, 2, 3])], { type: 'audio/wav' }), 'clip.wav');
    const res = await SELF.fetch('http://localhost/api/audio/pronounce', { method: 'POST', body: fd });
    expect(res.status).toBe(400);
    const body = await res.json() as Record<string, unknown>;
    expect(body).toHaveProperty('detail');
  });

  it('passes the transcriber sub-call warming state through as 503', async () => {
    // canonical lookup hits D1 (unseeded → 500) OR transcribe warms (503).
    // Either way the route must not 200 without a transcript. Accept 500/503.
    fetchMock
      .get('http://127.0.0.1:8000')
      .intercept({ path: '/transcribe', method: 'POST' })
      .reply(503, { warming: true });
    const fd = new FormData();
    fd.append('audio', new Blob([new Uint8Array([1, 2, 3])], { type: 'audio/wav' }), 'clip.wav');
    fd.append('target_word', 'very');
    const res = await SELF.fetch('http://localhost/api/audio/pronounce', { method: 'POST', body: fd });
    expect([500, 503]).toContain(res.status);
  });
});

Note: the success path needs a seeded D1 (words, phonemes, phoneme_dots), which CI does not provide — consistent with api.test.ts's convention. Do not add a success-path assertion that requires seeded D1; the scoring math is already covered by Tasks 2–4. The route tests cover validation + sub-call wiring.

[ ] Step 2: Run tests to verify they fail

Run: cd packages/web/workers && npx vitest run src/__tests__/audio.test.ts Expected: FAIL — /api/audio/pronounce returns 404 (route not registered).

[ ] Step 3: Implement the handler

In packages/web/workers/src/routes/audio.ts, add imports at the top (after the existing imports):

import { buildPhonemeCache, type PhonemeCache } from '../lib/similarity';
import { scorePronunciation } from '../lib/pronunciationScore';

Add a module-level cache (mirrors the similarity route's lazy load) above the route definitions:

// Phoneme vector cache (norms + dots), loaded once per isolate from D1.
let phonemeCache: PhonemeCache | null = null;
async function getPhonemeCache(db: D1Database): Promise<PhonemeCache> {
  if (phonemeCache) return phonemeCache;
  const { results: phons } = await db
    .prepare('SELECT ipa, norm_sq FROM phonemes')
    .all<{ ipa: string; norm_sq: number }>();
  const { results: dots } = await db
    .prepare('SELECT ipa1, ipa2, dot_product FROM phoneme_dots')
    .all<{ ipa1: string; ipa2: string; dot_product: number }>();
  phonemeCache = buildPhonemeCache(phons, dots);
  return phonemeCache;
}

/** Fetch the produced transcript from the inference host. off-the-shelf → /transcribe;
 *  ft → /compare (PHON-139 lineage), taking the .ft transcript (fallback to baseline). */
async function fetchTranscript(
  base: string, file: File, transcriber: string, language: string | null,
): Promise<{ phonemes: string[]; confidences?: number[]; duration_ms?: number;
            coverage?: string; limitations?: string[] } | { warming: true }> {
  const path = transcriber === 'ft' ? '/compare' : '/transcribe';
  const fwd = new FormData();
  fwd.append('audio', file, file.name || 'clip');
  if (language) fwd.append('language', language);
  let upstream: Response;
  try {
    upstream = await fetch(`${base}${path}`, { method: 'POST', body: fwd });
  } catch {
    return { warming: true };
  }
  if (upstream.status === 503) return { warming: true };
  if (!upstream.ok) throw new Error(`inference ${upstream.status}`);
  const body = await upstream.json() as Record<string, unknown>;
  if (transcriber === 'ft') {
    const ft = (body.ft ?? body.baseline) as { phonemes: string[] } | undefined;
    if (!ft) throw new Error('compare response missing ft/baseline');
    return ft as { phonemes: string[] };
  }
  return body as { phonemes: string[] };
}

const PRONOUNCE_LIMITS = [
  'Scores against the canonical target; assumes the speaker intends canonical.',
  'Broad-phoneme only; distortions/covert contrast not modeled (Models #3, #5).',
  'variant/error threshold is L1-agnostic in v6.1.',
];

Add the handler (register before export default audio;):

audio.post('/pronounce', async (c) => {
  // 1. Validate multipart
  const form = await c.req.formData().catch(() => null);
  if (!form) return c.json({ detail: 'Missing required multipart field: audio' }, 400);
  const fileEntry = form.get('audio');
  if (!fileEntry || typeof fileEntry === 'string') {
    return c.json({ detail: 'Missing required multipart field: audio' }, 400);
  }
  const file = fileEntry as File;
  if (file.size > MAX_BYTES) return c.json({ detail: 'Audio exceeds 10 MB limit' }, 400);
  if (file.type && !file.type.startsWith('audio/')) {
    return c.json({ detail: `Unsupported content type: ${file.type}` }, 400);
  }
  const targetWord = form.get('target_word');
  if (typeof targetWord !== 'string' || !targetWord.trim()) {
    return c.json({ detail: 'Missing required field: target_word' }, 400);
  }
  const transcriber = form.get('transcriber') === 'ft' ? 'ft' : 'off-the-shelf';
  const l1 = typeof form.get('l1') === 'string' ? (form.get('l1') as string) : null;
  const language = typeof form.get('language') === 'string' ? (form.get('language') as string) : null;

  const base = c.env.AUDIO_INFERENCE_URL?.replace(/\/$/, '');
  if (!base) return c.json({ detail: 'Audio inference host not configured' }, 500);

  // 2. Transcribe (produced phonemes)
  const transcript = await fetchTranscript(base, file, transcriber, language);
  if ('warming' in transcript) {
    return c.json({ warming: true, detail: 'Inference host is warming up. Retry shortly.' }, 503);
  }
  const produced = transcript.phonemes ?? [];

  // 3. Canonical phonemes from D1
  const row = await c.env.DB
    .prepare('SELECT phonemes FROM words WHERE word = ? LIMIT 1')
    .bind(targetWord.trim().toLowerCase())
    .first<{ phonemes: string | null }>();
  if (!row || !row.phonemes) {
    return c.json({ detail: `Word not in lexicon: ${targetWord}` }, 404);
  }
  const canonical = JSON.parse(row.phonemes) as string[];

  // 4. Score in-Worker
  const cache = await getPhonemeCache(c.env.DB);
  const score = scorePronunciation(canonical, produced, cache);

  // 5. Assemble response
  return c.json({
    target_word: targetWord,
    canonical_phonemes: canonical,
    transcript,
    per_position: score.per_position,
    insertions: score.insertions,
    overall_score: score.overall_score,
    variant_vs_error_class: score.variant_vs_error_class,
    threshold_basis: score.threshold_basis,
    l1,
    transcriber,
    coverage: 'broad-phoneme',
    limitations: PRONOUNCE_LIMITS,
  });
});

[ ] Step 4: Run tests to verify they pass

Run: cd packages/web/workers && npx vitest run src/__tests__/audio.test.ts Expected: PASS (validation 400s assert exactly; warming case accepts 500/503).

[ ] Step 5: Type-check the workers package

Run: cd packages/web/workers && npx tsc --noEmit Expected: no errors. (Confirm DB: D1Database is on Env — it is, per types.ts.)

[ ] Step 6: Commit

git add packages/web/workers/src/routes/audio.ts packages/web/workers/src/__tests__/audio.test.ts
git commit -m "feat(phon-129): /api/audio/pronounce route — transcribe + canonical lookup + score"

Task 6: Frontend service — `pronounceAudio()`¶

Files: - Modify: packages/web/frontend/src/services/audioApi.ts - Create: packages/web/frontend/src/services/audioApi.pronounce.test.ts

[ ] Step 1: Write the failing test

Create packages/web/frontend/src/services/audioApi.pronounce.test.ts:

import { describe, it, expect, vi, afterEach } from 'vitest';
import { pronounceAudio } from './audioApi';
import { TranscriberWarmingError } from './audioApi';

afterEach(() => vi.restoreAllMocks());

const sample = {
  target_word: 'very', canonical_phonemes: ['v', 'ɛ', 'ɹ', 'i'],
  transcript: { phonemes: ['b', 'ɛ', 'ɹ', 'i'], confidences: [], duration_ms: 1,
                coverage: 'broad-phoneme', limitations: [] },
  per_position: [], insertions: [], overall_score: 0.9,
  variant_vs_error_class: 'error', threshold_basis: 'l1_agnostic',
  l1: null, transcriber: 'off-the-shelf', coverage: 'broad-phoneme', limitations: [],
};

describe('pronounceAudio', () => {
  it('posts multipart and returns the score', async () => {
    const fetchSpy = vi.spyOn(globalThis, 'fetch').mockResolvedValue(
      new Response(JSON.stringify(sample), { status: 200 }),
    );
    const res = await pronounceAudio(new Blob(['x']), 'very', { l1: 'Spanish' });
    expect(res.overall_score).toBe(0.9);
    const [, init] = fetchSpy.mock.calls[0];
    expect((init as RequestInit).body).toBeInstanceOf(FormData);
  });

  it('throws TranscriberWarmingError on 503', async () => {
    vi.spyOn(globalThis, 'fetch').mockResolvedValue(
      new Response(JSON.stringify({ warming: true, detail: 'warming' }), { status: 503 }),
    );
    await expect(pronounceAudio(new Blob(['x']), 'very')).rejects.toBeInstanceOf(TranscriberWarmingError);
  });
});

[ ] Step 2: Run test to verify it fails

Run: cd packages/web/frontend && npx vitest run src/services/audioApi.pronounce.test.ts Expected: FAIL — pronounceAudio not exported.

[ ] Step 3: Implement

Append to packages/web/frontend/src/services/audioApi.ts:

export interface PositionScore {
  canonical: string;
  produced: string | null;
  cos_dist: number;
  op: 'match' | 'sub' | 'del';
}

export interface PronunciationResult {
  target_word: string;
  canonical_phonemes: string[];
  transcript: TranscriptResult;
  per_position: PositionScore[];
  insertions: { produced: string; after_canonical_index: number }[];
  overall_score: number;
  variant_vs_error_class: 'variant' | 'error';
  threshold_basis: 'l1_agnostic';
  l1: string | null;
  transcriber: 'off-the-shelf' | 'ft';
  coverage: string;
  limitations: string[];
}

export async function pronounceAudio(
  blob: Blob,
  targetWord: string,
  opts?: { transcriber?: 'off-the-shelf' | 'ft'; l1?: string; language?: string },
): Promise<PronunciationResult> {
  const fd = new FormData();
  fd.append('audio', blob, 'recording');
  fd.append('target_word', targetWord);
  if (opts?.transcriber) fd.append('transcriber', opts.transcriber);
  if (opts?.l1) fd.append('l1', opts.l1);
  if (opts?.language) fd.append('language', opts.language);

  const res = await fetch(`${baseUrl}/api/audio/pronounce`, {
    method: 'POST',
    headers: { 'X-Request-ID': freshRequestId() },
    body: fd,
  });

  if (res.status === 503) {
    const body = await res.json().catch(() => ({ detail: 'Warming up' })) as { detail?: string };
    throw new TranscriberWarmingError(body.detail || 'Inference host is warming up.');
  }
  if (!res.ok) {
    const detail = await res.text().catch(() => res.statusText);
    throw new Error(`Pronounce failed (${res.status}): ${detail}`);
  }
  return res.json();
}

[ ] Step 4: Run test to verify it passes

Run: cd packages/web/frontend && npx vitest run src/services/audioApi.pronounce.test.ts Expected: PASS.

[ ] Step 5: Commit

git add packages/web/frontend/src/services/audioApi.ts packages/web/frontend/src/services/audioApi.pronounce.test.ts
git commit -m "feat(phon-129): pronounceAudio frontend service"

Task 7: Frontend dev page — `PronunciationViewer.tsx`¶

Files: - Read first: packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx (mirror its structure) - Create: packages/web/frontend/src/components/tools/PronunciationViewer.tsx - Create: packages/web/frontend/src/components/tools/PronunciationViewer.test.tsx - Modify: packages/web/frontend/src/main.tsx

[ ] Step 1: Read the existing viewer to mirror its patterns

Run: sed -n '1,120p' packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx Note its imports (MUI components, transcribeAudio, TranscriberWarmingError, recorder hook/refs, preloaded-clip picker via loadAudioSamples) and reuse the same idioms.

[ ] Step 2: Write the failing component test

Create packages/web/frontend/src/components/tools/PronunciationViewer.test.tsx:

import { describe, it, expect, vi, afterEach } from 'vitest';
import { render, screen, fireEvent, waitFor } from '@testing-library/react';
import PronunciationViewer from './PronunciationViewer';
import * as api from '../../services/audioApi';

afterEach(() => vi.restoreAllMocks());

const result: api.PronunciationResult = {
  target_word: 'very', canonical_phonemes: ['v', 'ɛ', 'ɹ', 'i'],
  transcript: { phonemes: ['b', 'ɛ', 'ɹ', 'i'], confidences: [], duration_ms: 1,
                coverage: 'broad-phoneme', limitations: [] },
  per_position: [
    { canonical: 'v', produced: 'b', cos_dist: 0.31, op: 'sub' },
    { canonical: 'ɛ', produced: 'ɛ', cos_dist: 0, op: 'match' },
    { canonical: 'ɹ', produced: 'ɹ', cos_dist: 0, op: 'match' },
    { canonical: 'i', produced: 'i', cos_dist: 0, op: 'match' },
  ],
  insertions: [], overall_score: 0.92, variant_vs_error_class: 'error',
  threshold_basis: 'l1_agnostic', l1: null, transcriber: 'off-the-shelf',
  coverage: 'broad-phoneme', limitations: ['x'],
};

describe('PronunciationViewer', () => {
  it('renders the target-word input and a score after pronounce', async () => {
    vi.spyOn(api, 'loadAudioSamples').mockResolvedValue([]);
    vi.spyOn(api, 'pronounceAudio').mockResolvedValue(result);

    render(<PronunciationViewer />);
    const input = screen.getByLabelText(/target word/i);
    fireEvent.change(input, { target: { value: 'very' } });
    fireEvent.click(screen.getByRole('button', { name: /score|pronounce/i }));

    await waitFor(() => expect(screen.getByText(/0\.92|92/)).toBeInTheDocument());
    expect(screen.getByText('v')).toBeInTheDocument(); // canonical position rendered
  });
});

Note: this test drives pronounce from a preloaded/empty-clip path. If AudioTranscribeViewer gates the action button on a recorded/selected clip, mirror that gating and have the test select a clip or set a blob via the same mechanism the transcribe viewer's test uses (read AudioTranscribeViewer.test.tsx first and copy its clip-injection approach).

[ ] Step 3: Run test to verify it fails

Run: cd packages/web/frontend && npx vitest run src/components/tools/PronunciationViewer.test.tsx Expected: FAIL — component does not exist.

[ ] Step 4: Implement the component

Create packages/web/frontend/src/components/tools/PronunciationViewer.tsx. Mirror AudioTranscribeViewer.tsx's recorder + preloaded-clip + warming-state scaffolding; add the target-word input, optional L1 dropdown, transcriber toggle, and per-position heat row. Concretely:

/**
 * PHON-129 Model #2 dev page. Mirrors AudioTranscribeViewer; adds a target word,
 * optional L1 tag, transcriber toggle, and per-position cos_dist heat. NOT the
 * eventual user-facing tool (that's a later unification spec).
 */
import { useState } from 'react';
import {
  Box, TextField, Button, MenuItem, ToggleButton, ToggleButtonGroup,
  Typography, Chip, Stack, Tooltip, Alert,
} from '@mui/material';
import {
  pronounceAudio, TranscriberWarmingError,
  type PronunciationResult, type PositionScore,
} from '../../services/audioApi';

const L1S = ['Arabic', 'Chinese', 'Hindi', 'Korean', 'Spanish', 'Vietnamese', 'unknown'];

function heatColor(cosDist: number, op: PositionScore['op']): string {
  if (op === 'del') return '#b71c1c';
  const t = Math.min(1, cosDist / 0.5); // 0 → green, ≥0.5 → red
  const r = Math.round(76 + t * (183 - 76));
  const g = Math.round(175 - t * (175 - 28));
  return `rgb(${r}, ${g}, 60)`;
}

export default function PronunciationViewer() {
  const [blob, setBlob] = useState<Blob | null>(null);
  const [targetWord, setTargetWord] = useState('');
  const [l1, setL1] = useState('unknown');
  const [transcriber, setTranscriber] = useState<'off-the-shelf' | 'ft'>('off-the-shelf');
  const [result, setResult] = useState<PronunciationResult | null>(null);
  const [warming, setWarming] = useState(false);
  const [error, setError] = useState<string | null>(null);

  async function run() {
    if (!blob || !targetWord.trim()) return;
    setWarming(false); setError(null); setResult(null);
    try {
      const r = await pronounceAudio(blob, targetWord.trim(), {
        transcriber, l1: l1 === 'unknown' ? undefined : l1,
      });
      setResult(r);
    } catch (e) {
      if (e instanceof TranscriberWarmingError) setWarming(true);
      else setError(e instanceof Error ? e.message : 'Failed');
    }
  }

  return (
    <Box sx={{ p: 3, maxWidth: 760, mx: 'auto' }}>
      <Typography variant="h5" gutterBottom>Pronunciation Scorer (dev)</Typography>

      {/* Recorder / upload / preloaded-clip picker: reuse the same controls as
          AudioTranscribeViewer. On clip ready, call setBlob(clipBlob). */}
      {/* <AudioCapture onBlob={setBlob} /> — mirror the transcribe viewer's capture UI */}

      <Stack direction="row" spacing={2} sx={{ my: 2 }} alignItems="center">
        <TextField label="Target word" value={targetWord}
          onChange={(e) => setTargetWord(e.target.value)} size="small" />
        <TextField select label="L1" value={l1} size="small"
          onChange={(e) => setL1(e.target.value)} sx={{ minWidth: 130 }}>
          {L1S.map((x) => <MenuItem key={x} value={x}>{x}</MenuItem>)}
        </TextField>
        <ToggleButtonGroup exclusive size="small" value={transcriber}
          onChange={(_, v) => v && setTranscriber(v)}>
          <ToggleButton value="off-the-shelf">off-the-shelf</ToggleButton>
          <ToggleButton value="ft">ft</ToggleButton>
        </ToggleButtonGroup>
        <Button variant="contained" onClick={run} disabled={!blob || !targetWord.trim()}>
          Score
        </Button>
      </Stack>

      {warming && <Alert severity="info">Inference host is warming up. Retry shortly.</Alert>}
      {error && <Alert severity="error">{error}</Alert>}

      {result && (
        <Box sx={{ mt: 2 }}>
          <Stack direction="row" spacing={1} alignItems="center" sx={{ mb: 1 }}>
            <Typography variant="h6">Score: {result.overall_score.toFixed(2)}</Typography>
            <Chip label={result.variant_vs_error_class}
              color={result.variant_vs_error_class === 'error' ? 'error' : 'success'} size="small" />
            <Tooltip title="Threshold is L1-agnostic in v6.1">
              <Chip label={result.threshold_basis} variant="outlined" size="small" />
            </Tooltip>
          </Stack>
          <Stack direction="row" spacing={0.5}>
            {result.per_position.map((p, i) => (
              <Tooltip key={i} title={`${p.op} cos_dist ${p.cos_dist.toFixed(2)}`}>
                <Box sx={{ textAlign: 'center', minWidth: 36 }}>
                  <Box sx={{ bgcolor: heatColor(p.cos_dist, p.op), color: '#fff',
                             borderRadius: 1, px: 1, py: 0.5 }}>{p.canonical}</Box>
                  <Typography variant="caption">{p.produced ?? '∅'}</Typography>
                </Box>
              </Tooltip>
            ))}
          </Stack>
          {result.limitations.length > 0 && (
            <Alert severity="warning" sx={{ mt: 2 }}>{result.limitations.join(' ')}</Alert>
          )}
        </Box>
      )}
    </Box>
  );
}

Then wire the actual recorder/upload/preloaded controls by copying AudioTranscribeViewer's capture section and calling setBlob when a clip is ready (replace the commented placeholder). Keep the test's clip-injection mechanism aligned with how the transcribe viewer's test supplies a blob.

[ ] Step 5: Register the dev route

In packages/web/frontend/src/main.tsx, add the import and route alongside the existing /dev/audio:

import PronunciationViewer from './components/tools/PronunciationViewer.tsx';
// ...inside <Routes>:
<Route path="/dev/pronounce" element={<PronunciationViewer />} />

[ ] Step 6: Run test to verify it passes

Run: cd packages/web/frontend && npx vitest run src/components/tools/PronunciationViewer.test.tsx Expected: PASS.

[ ] Step 7: Type-check + build the frontend

Run: cd packages/web/frontend && npx tsc --noEmit && npm run build Expected: no type errors; build succeeds.

[ ] Step 8: Commit

git add packages/web/frontend/src/components/tools/PronunciationViewer.tsx packages/web/frontend/src/components/tools/PronunciationViewer.test.tsx packages/web/frontend/src/main.tsx
git commit -m "feat(phon-129): PronunciationViewer dev page + /dev/pronounce route"

Task 8: L2-ARCTIC validation harness¶

Files: - Create: research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py - Create: research/2026-06-05-phon-129-l2-accent-scorer/02_metrics.py - Create: research/2026-06-05-phon-129-l2-accent-scorer/RESULTS.md

This task is a research harness, not unit-tested code. It runs against the external L2-ARCTIC drive and the local phonolex_audio inference server (start it with uv run python -m phonolex_audio --port 8000; the L2-ARCTIC audio and the model are both local, so no RunPod / no uploads). Follow the long-running-jobs policy: checkpoint, SIGINT flush, resume.

[ ] Step 1: Write 01_run_l2arctic.py

Create research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py. It iterates L2-ARCTIC annotated utterances, transcribes each via the off-the-shelf host, parses the canonical,perceived,errortype phone tier, and emits per-token rows with BOTH the transcriber-derived and oracle (human-perceived) cos_dist. Checkpoint every 200 utts.

"""PHON-129 validation step 1: transcribe + score L2-ARCTIC, emit per-token rows.
Two cos_dist per token: transcriber-derived (real chain) and oracle (human-perceived).
Checkpointed per the long-running-jobs policy. Run:
  uv run python 01_run_l2arctic.py --inference-url <host> --out rows.parquet"""
import argparse, json, pickle, signal, sys
from pathlib import Path
import numpy as np, polars as pl

L2ARCTIC = Path("/Volumes/ExternalData2/audio-datasets/l2arctic")
VECTORS = Path(__file__).resolve().parents[2] / "packages/features/outputs/vectors.csv"
SPK_L1 = {  # README speaker→L1 map
    **{s: "Arabic" for s in ["ABA", "SKA", "YBAA", "ZHAA"]},
    **{s: "Chinese" for s in ["BWC", "LXC", "NCC", "TXHC"]},
    **{s: "Hindi" for s in ["ASI", "RRBI", "SVBI", "TNI"]},
    **{s: "Korean" for s in ["HJK", "HKK", "YDCK", "YKWK"]},
    **{s: "Spanish" for s in ["EBVS", "ERMS", "MBMPS", "NJS"]},
    **{s: "Vietnamese" for s in ["HQTV", "PNV", "THV", "TLV"]},
}

# Load vectors → cos_dist helper (same metric as the TS scorer / PHON-126).
df = pl.read_csv(VECTORS)
feat = [c for c in df.columns if c != "ipa"]
VEC = {r["ipa"]: np.array([r[c] for c in feat], float) for r in df.iter_rows(named=True)}
def cos_dist(a, b):
    if a not in VEC or b not in VEC: return None
    c = float(VEC[a] @ VEC[b] / (np.linalg.norm(VEC[a]) * np.linalg.norm(VEC[b])))
    return min(1.0, max(0.0, 1.0 - c))

# parse_annotation(textgrid_path) → list[(canonical, perceived, errortype)] from the
# IPA phone tier (the tier whose error labels are IPA, e.g. "ð,d,s"); see DIAGNOSTIC.
# transcribe(url, wav_path) → list[str] produced phonemes via POST /transcribe.
# (Implement both against the L2-ARCTIC TextGrid format + the host contract.)

def load_ckpt(p): return pickle.loads(p.read_bytes()) if p.exists() else {"done": [], "rows": []}
def save_ckpt(p, st): p.write_bytes(pickle.dumps(st))

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--inference-url", required=True)
    ap.add_argument("--out", default="rows.parquet")
    ap.add_argument("--checkpoint", default="_ckpt.pkl")
    ap.add_argument("--checkpoint-every", type=int, default=200)
    a = ap.parse_args()
    ckpt = Path(a.checkpoint); st = load_ckpt(ckpt); done = set(st["done"])

    def flush(*_):
        save_ckpt(ckpt, st); print(f"[ckpt] saved at {len(st['done'])} utts"); 
    signal.signal(signal.SIGINT, lambda *_: (flush(), sys.exit(0)))

    grids = sorted(L2ARCTIC.glob("*/annotation/*.TextGrid"))
    for k, g in enumerate(grids):
        uid = f"{g.parts[-3]}/{g.stem}"
        if uid in done: continue
        spk = g.parts[-3]; wav = L2ARCTIC / spk / "wav" / f"{g.stem}.wav"
        gold = parse_annotation(g)              # [(canon, perceived, errtype)]
        produced = transcribe(a.inference_url, wav)
        for canon, perceived, et in gold:
            st["rows"].append({
                "utt": uid, "speaker": spk, "l1": SPK_L1.get(spk, "?"),
                "canonical": canon, "perceived": perceived, "errortype": et,
                "cos_dist_oracle": cos_dist(canon, perceived),
                # transcriber-derived: align produced→canonical, take this position's cost.
                # (Reuse the same WPER alignment; store the matched produced phone + cost.)
            })
        done.add(uid); st["done"] = sorted(done)
        if (k + 1) % a.checkpoint_every == 0: flush()
    flush()
    pl.DataFrame(st["rows"]).write_parquet(a.out)
    print(f"wrote {a.out}: {len(st['rows'])} token rows")

if __name__ == "__main__":
    main()

Note: parse_annotation, transcribe, and the transcriber-derived alignment are marked inline — implement them against the L2-ARCTIC TextGrid format (verified: IPA error labels like ð,d,s in the phone tier) and the inference host's /transcribe contract. Reuse the WPER alignment logic from gen_score_fixtures.py for the transcriber-derived per-position cost.

[ ] Step 2: Write 02_metrics.py

Create research/2026-06-05-phon-129-l2-accent-scorer/02_metrics.py. It computes PHON-126's three diagnostics on the rows, pooled and per-L1.

"""PHON-129 validation step 2: PHON-126's three diagnostics on real audio,
pooled AND per-L1. Run: uv run python 02_metrics.py --rows rows.parquet"""
import argparse
import numpy as np, polars as pl
from scipy.stats import mannwhitneyu, spearmanr

def diagnostics(variant, error):
    """variant/error = arrays of cos_dist. Returns the three PHON-126 metrics."""
    out = {}
    if len(variant) and len(error):
        u, p = mannwhitneyu(variant, error, alternative="less")
        out["mannwhitney_U"], out["mannwhitney_p"] = float(u), float(p)
        out["variant_75"] = float(np.percentile(variant, 75))
        out["error_25"] = float(np.percentile(error, 25))
        out["threshold_clean"] = out["variant_75"] < out["error_25"]
    return out

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--rows", default="rows.parquet")
    ap.add_argument("--col", default="cos_dist_oracle",
                    help="cos_dist_oracle | cos_dist_transcriber")
    a = ap.parse_args()
    df = pl.read_parquet(a.rows)

    # variant = L2-ARCTIC substitutions (accent); severity = the cos_dist itself for ρ.
    # error pole (PhonBank Clinical) appended later as errortype/source tags if present.
    subs = df.filter(pl.col("errortype") == "s").drop_nulls(a.col)

    def report(frame, label):
        v = frame.filter(pl.col("cos_dist_oracle").is_not_null())[a.col].to_numpy()
        # Without a disordered error pole, report distribution + severity self-consistency.
        rho = spearmanr(frame[a.col].to_numpy(),
                        frame["cos_dist_oracle"].to_numpy()).correlation if len(frame) else float("nan")
        print(f"[{label}] n={len(frame)} mean_cosdist={v.mean():.3f} "
              f"transcriber_vs_oracle_rho={rho:.3f}")

    report(subs, "POOLED")
    for l1 in sorted(subs["l1"].unique().to_list()):
        report(subs.filter(pl.col("l1") == l1), l1)

    # When the PhonBank-Clinical error pole is appended (source column), run the full
    # variant<error separation: diagnostics(variant_cosdist, error_cosdist), pooled + per-L1.

if __name__ == "__main__":
    main()

Note: the full variant<error separation needs the PhonBank-Clinical error pole appended to rows.parquet (a source column distinguishing l2arctic vs phonbank_clinical). Step 1 can be extended to ingest PhonBank Clinical the same way; if deferred, 02_metrics.py reports the L2-ARCTIC distributions + the transcriber-vs-oracle agreement (ρ) per-L1, which is the load-bearing chain-fidelity result.

[ ] Step 3: Run the harness (manual, against the local server)

First start the local inference server in another shell:

uv run python -m phonolex_audio --port 8000   # off-the-shelf wav2vec2-espeak

Then run the harness against it:

cd research/2026-06-05-phon-129-l2-accent-scorer
uv run python 01_run_l2arctic.py --inference-url http://127.0.0.1:8000 --out rows.parquet
uv run python 02_metrics.py --rows rows.parquet --col cos_dist_transcriber
uv run python 02_metrics.py --rows rows.parquet --col cos_dist_oracle

Expected: per-token rows written; pooled + per-L1 metrics printed for both the transcriber chain and the oracle.

[ ] Step 4: Write RESULTS.md

Create research/2026-06-05-phon-129-l2-accent-scorer/RESULTS.md capturing: pooled + per-L1 metrics for transcriber and oracle, the transcriber−oracle gap, a GO/NO-GO read on whether the PHON-126 separation replicates on real audio, and an explicit per-L1 L1-sensitivity note (does the threshold/separation drift by L1? — the empirical answer to the design's open question).

[ ] Step 5: Commit

git add research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py research/2026-06-05-phon-129-l2-accent-scorer/02_metrics.py research/2026-06-05-phon-129-l2-accent-scorer/RESULTS.md
git commit -m "research(phon-129): L2-ARCTIC validation harness — pooled + per-L1 metrics"

Task 9: Full test matrix + final verification¶

Files: none (verification only)

[ ] Step 1: Run the workers test suite

Run: cd packages/web/workers && npm test Expected: all pass (pronunciationScore, pronounce-fixture, audio route, existing suites).

[ ] Step 2: Run the frontend test suite + type-check + build

Run: cd packages/web/frontend && npx vitest run && npx tsc --noEmit && npm run build Expected: tests pass, no type errors, build succeeds.

[ ] Step 3: Run the workers type-check

Run: cd packages/web/workers && npx tsc --noEmit Expected: no errors.

[ ] Step 4: Confirm CI-equivalent checks pass (untracked-file trap)

Run from repo root: git status — confirm no stray untracked files that would break CI; then re-run the two suites above to be sure nothing relied on uncommitted local state.

[ ] Step 5: Final commit if any verification fixups were needed

git add -A && git commit -m "chore(phon-129): verification fixups" || echo "nothing to commit"

Self-Review (completed during authoring)¶

Spec coverage: - §2 architecture (Approach A, transcribe sub-call, in-Worker score) → Tasks 1,2,5 ✓ - §2.2 transcriber default off-the-shelf, ft selectable → Task 5 fetchTranscript ✓ - §3 contract (per_position keyed to canonical, insertions out-of-band, op set, overall_score, threshold_basis, l1 echo, error codes) → Tasks 2,3,5 ✓ - §4 scoring module (cosDist, alignWPER, classify, T=0.112) → Tasks 2,3 ✓ - §5 dev page (record/upload/preloaded, target word, L1 dropdown, transcriber toggle, per-position heat, warming) → Task 7 ✓ - §6.1 drift fixture → Task 4 ✓ - §6.2 L2-ARCTIC validation pooled + per-L1, transcriber-vs-oracle gap → Task 8 ✓ - §6.3 PhonBank error pole → Task 8 (noted as extension) ✓ - §7 L1 seam (optional l1, threshold_basis tag, stratified validation) → Tasks 5,8 ✓

Placeholder scan: the only inline "implement against X" markers are in Task 8 (research harness — parse_annotation/transcribe/alignment), which is inherently host- and drive-dependent and cannot be fully literal; every shipped-code task (1–7, 9) has complete code. No TBD/TODO in production code steps.

Type consistency: PhonemeCache ({normSq, dots} Maps) consistent across Tasks 2/4; cosDist/alignWPER/scorePronunciation signatures stable; PronunciationResult/PositionScore shared between service (Task 6) and component (Task 7); route response (Task 5) matches the PronunciationResult fields.