PHON-129 — Model #2 L2/Accent Pronunciation Scorer Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Ship POST /api/audio/pronounce (target word + audio → per-position cos_dist score, overall score, variant/error class) plus a /dev/pronounce viewer page, validated on L2-ARCTIC.
Architecture: Scoring runs in the Worker (Approach A). The route sub-calls the existing local phonolex_audio inference server (FastAPI; /transcribe + /compare, reached via AUDIO_INFERENCE_URL, already wired) for the produced phonemes, looks up canonical phonemes from D1 words, and computes WPER alignment + cos_dist = clip(1 − cosine, 0, 1) over the learned feature vectors using the PhonemeCache (norms + dots) already loaded by the similarity route. The metric is PHON-126's, ported to TS and pinned to a frozen fixture for drift. The frontend is a dev page mirroring AudioTranscribeViewer.tsx, not the eventual unified tool. (RunPod is not used — the Worker is host-agnostic; the production host is a later deploy decision.)
Tech Stack: TypeScript, Hono (Cloudflare Workers), D1 (SQLite), Vitest (@cloudflare/vitest-pool-workers + cloudflare:test), React + MUI + React Router, Python (validation harness, Polars/NumPy).
Spec: docs/superpowers/specs/2026-06-05-phon-129-l2-accent-scorer-design.md
Grounding: research/2026-06-05-phon-129-l2-accent-scorer/FLEGE_SLM.md
File Structure¶
Worker (scoring + route):
- Create packages/web/workers/src/lib/pronunciationScore.ts — pure scoring (cos_dist, WPER align + traceback, classify, overall). One responsibility: turn (produced, canonical, cache) into the score object.
- Modify packages/web/workers/src/lib/similarity.ts — export phonemeCosine (currently module-private) so the scorer reuses it. No logic change.
- Modify packages/web/workers/src/routes/audio.ts — add the pronounce handler (transcribe sub-call + D1 canonical lookup + cache load + score). Stays in audio.ts; it's one more handler in the audio surface.
- Create packages/web/workers/src/__tests__/pronunciationScore.test.ts — unit tests (synthetic cache, no D1).
- Create packages/web/workers/src/__tests__/pronounce-fixture.test.ts — PHON-126 drift fixture test.
- Modify packages/web/workers/src/__tests__/audio.test.ts — add /api/audio/pronounce route tests.
Frontend (dev page):
- Modify packages/web/frontend/src/services/audioApi.ts — add pronounceAudio() + PronunciationResult type.
- Create packages/web/frontend/src/components/tools/PronunciationViewer.tsx — dev page.
- Create packages/web/frontend/src/components/tools/PronunciationViewer.test.tsx — component tests.
- Modify packages/web/frontend/src/main.tsx — register /dev/pronounce route.
Validation harness (research, not shipped):
- Create research/2026-06-05-phon-129-l2-accent-scorer/gen_score_fixtures.py — emits the drift fixture from vectors.csv.
- Create research/2026-06-05-phon-129-l2-accent-scorer/score_fixtures.json — generated artifact (committed).
- Create research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py — transcribe + score L2-ARCTIC.
- Create research/2026-06-05-phon-129-l2-accent-scorer/02_metrics.py — PHON-126 metrics, pooled + per-L1.
- Create research/2026-06-05-phon-129-l2-accent-scorer/RESULTS.md — GO/NO-GO write-up.
Task 1: Export phonemeCosine from similarity.ts¶
Files:
- Modify: packages/web/workers/src/lib/similarity.ts (the function phonemeCosine declaration)
- [ ] Step 1: Export the function
In packages/web/workers/src/lib/similarity.ts, change the declaration from module-private to exported. Find:
function phonemeCosine(p1: string, p2: string, cache: PhonemeCache): number {
Change to:
export function phonemeCosine(p1: string, p2: string, cache: PhonemeCache): number {
- [ ] Step 2: Verify the workers package still type-checks
Run: cd packages/web/workers && npx tsc --noEmit
Expected: no errors.
- [ ] Step 3: Commit
git add packages/web/workers/src/lib/similarity.ts
git commit -m "refactor(phon-129): export phonemeCosine for reuse by the scorer"
Task 2: Scoring module — cos_dist + WPER alignment with traceback¶
Files:
- Create: packages/web/workers/src/lib/pronunciationScore.ts
- Test: packages/web/workers/src/__tests__/pronunciationScore.test.ts
- [ ] Step 1: Write the failing test
Create packages/web/workers/src/__tests__/pronunciationScore.test.ts. The synthetic cache encodes three phonemes: a,b are close (cosine 0.8 → cos_dist 0.2), a,c are far (cosine 0.0 → cos_dist 1.0). Norms are 1 (unit vectors) so cosine = dot.
import { describe, it, expect } from 'vitest';
import type { PhonemeCache } from '../lib/similarity';
import { cosDist, alignWPER } from '../lib/pronunciationScore';
// Unit-norm synthetic cache: dot == cosine. a·b = 0.8, a·c = 0.0, b·c = 0.0
function synthCache(): PhonemeCache {
const normSq = new Map<string, number>([['a', 1], ['b', 1], ['c', 1]]);
const dots = new Map<string, number>([['a,b', 0.8], ['a,c', 0.0], ['b,c', 0.0]]);
return { normSq, dots };
}
describe('cosDist', () => {
it('is 0 for identical phonemes', () => {
expect(cosDist('a', 'a', synthCache())).toBeCloseTo(0, 6);
});
it('is 1 - cosine for a near pair, clamped to [0,1]', () => {
expect(cosDist('a', 'b', synthCache())).toBeCloseTo(0.2, 6);
});
it('is 1 for an orthogonal pair', () => {
expect(cosDist('a', 'c', synthCache())).toBeCloseTo(1.0, 6);
});
});
describe('alignWPER', () => {
it('scores a perfect match as wper 0, all positions match', () => {
const r = alignWPER(['a', 'b'], ['a', 'b'], synthCache());
expect(r.wper).toBeCloseTo(0, 6);
expect(r.perPosition.map((p) => p.op)).toEqual(['match', 'match']);
expect(r.insertions).toEqual([]);
});
it('records a substitution with its cos_dist at the canonical position', () => {
// canonical [a,a] vs produced [a,b] → pos2 sub a→b, cos_dist 0.2
const r = alignWPER(['a', 'a'], ['a', 'b'], synthCache());
expect(r.perPosition[0]).toMatchObject({ canonical: 'a', produced: 'a', op: 'match' });
expect(r.perPosition[1]).toMatchObject({ canonical: 'a', produced: 'b', op: 'sub' });
expect(r.perPosition[1].cos_dist).toBeCloseTo(0.2, 6);
expect(r.wper).toBeCloseTo(0.1, 6); // 0.2 / 2 canonical positions
});
it('records a deletion (omitted canonical phone) with cos_dist 1', () => {
// canonical [a,b] vs produced [a] → b deleted
const r = alignWPER(['a', 'b'], ['a'], synthCache());
expect(r.perPosition[1]).toMatchObject({ canonical: 'b', produced: null, op: 'del' });
expect(r.perPosition[1].cos_dist).toBe(1);
expect(r.wper).toBeCloseTo(0.5, 6); // one indel / 2 canonical
});
it('records an insertion out-of-band keyed to the preceding canonical index', () => {
// canonical [a] vs produced [a,c] → c inserted after canonical index 0
const r = alignWPER(['a'], ['a', 'c'], synthCache());
expect(r.perPosition).toHaveLength(1);
expect(r.perPosition[0].op).toBe('match');
expect(r.insertions).toEqual([{ produced: 'c', after_canonical_index: 0 }]);
});
it('treats an empty production as full deletion (wper 1)', () => {
const r = alignWPER(['a', 'b'], [], synthCache());
expect(r.perPosition.map((p) => p.op)).toEqual(['del', 'del']);
expect(r.wper).toBeCloseTo(1, 6);
});
});
- [ ] Step 2: Run test to verify it fails
Run: cd packages/web/workers && npx vitest run src/__tests__/pronunciationScore.test.ts
Expected: FAIL — pronunciationScore.ts does not exist / cosDist is not exported.
- [ ] Step 3: Write minimal implementation
Create packages/web/workers/src/lib/pronunciationScore.ts:
/**
* L2/accent pronunciation scoring — PHON-129 Model #2.
*
* WPER alignment of a produced phoneme sequence against the canonical target,
* substitution cost = cos_dist over the learned feature vectors (PHON-126 metric).
* Per-position output is keyed to CANONICAL positions (the targets the learner aims
* at); extra produced phones are reported out-of-band as insertions.
*/
import { phonemeCosine, type PhonemeCache } from './similarity';
export type Op = 'match' | 'sub' | 'del';
export interface PositionScore {
canonical: string;
produced: string | null;
cos_dist: number;
op: Op;
}
export interface Insertion {
produced: string;
after_canonical_index: number;
}
export interface AlignResult {
perPosition: PositionScore[];
insertions: Insertion[];
wper: number;
}
/** cos_dist = clip(1 - cosine, 0, 1). PHON-126's substitution cost. */
export function cosDist(a: string, b: string, cache: PhonemeCache): number {
return Math.min(1, Math.max(0, 1 - phonemeCosine(a, b, cache)));
}
/**
* Levenshtein DP over (canonical rows, produced cols) with soft substitution cost.
* canonical[i-1] vs produced[j-1]. Indel = 1.0. Traceback recovers ops.
* wper = total cost / canonical length.
*/
export function alignWPER(
canonical: string[],
produced: string[],
cache: PhonemeCache,
): AlignResult {
const n = canonical.length;
const m = produced.length;
const dp: number[][] = Array.from({ length: n + 1 }, () => new Array(m + 1).fill(0));
for (let i = 0; i <= n; i++) dp[i][0] = i;
for (let j = 0; j <= m; j++) dp[0][j] = j;
for (let i = 1; i <= n; i++) {
for (let j = 1; j <= m; j++) {
const sub = dp[i - 1][j - 1] + cosDist(canonical[i - 1], produced[j - 1], cache);
const del = dp[i - 1][j] + 1; // omit canonical[i-1]
const ins = dp[i][j - 1] + 1; // extra produced[j-1]
dp[i][j] = Math.min(sub, del, ins);
}
}
// Traceback from (n, m). Collect canonical-keyed ops + out-of-band insertions.
const perPositionRev: PositionScore[] = [];
const insertionsRev: Insertion[] = [];
let i = n;
let j = m;
while (i > 0 || j > 0) {
if (i > 0 && j > 0) {
const subCost = cosDist(canonical[i - 1], produced[j - 1], cache);
if (Math.abs(dp[i][j] - (dp[i - 1][j - 1] + subCost)) < 1e-9) {
perPositionRev.push({
canonical: canonical[i - 1],
produced: produced[j - 1],
cos_dist: subCost,
op: subCost === 0 ? 'match' : 'sub',
});
i--; j--;
continue;
}
}
if (i > 0 && Math.abs(dp[i][j] - (dp[i - 1][j] + 1)) < 1e-9) {
perPositionRev.push({ canonical: canonical[i - 1], produced: null, cos_dist: 1, op: 'del' });
i--;
continue;
}
// insertion: extra produced[j-1], lands after canonical index (i-1)
insertionsRev.push({ produced: produced[j - 1], after_canonical_index: i - 1 });
j--;
}
perPositionRev.reverse();
insertionsRev.reverse();
return {
perPosition: perPositionRev,
insertions: insertionsRev,
wper: n === 0 ? (m === 0 ? 0 : 1) : dp[n][m] / n,
};
}
- [ ] Step 4: Run test to verify it passes
Run: cd packages/web/workers && npx vitest run src/__tests__/pronunciationScore.test.ts
Expected: PASS (all cases).
- [ ] Step 5: Commit
git add packages/web/workers/src/lib/pronunciationScore.ts packages/web/workers/src/__tests__/pronunciationScore.test.ts
git commit -m "feat(phon-129): WPER alignment + cos_dist scoring module"
Task 3: Classification + overall score + top-level scorePronunciation¶
Files:
- Modify: packages/web/workers/src/lib/pronunciationScore.ts
- Modify: packages/web/workers/src/__tests__/pronunciationScore.test.ts
- [ ] Step 1: Write the failing test
Append to packages/web/workers/src/__tests__/pronunciationScore.test.ts:
import { scorePronunciation, VARIANT_ERROR_THRESHOLD } from '../lib/pronunciationScore';
describe('scorePronunciation', () => {
it('labels a near (sub-threshold) substitution as variant', () => {
// a→b cos_dist 0.2 > threshold 0.112 → error; use a closer pair instead
expect(VARIANT_ERROR_THRESHOLD).toBeCloseTo(0.112, 6);
});
it('overall_score is 1 - wper', () => {
const r = scorePronunciation(['a', 'a'], ['a', 'b'], synthCache());
expect(r.overall_score).toBeCloseTo(0.9, 6); // wper 0.1
});
it('word class is error when any position exceeds the threshold', () => {
// a→b cos_dist 0.2 >= 0.112 → that position is an error → word error
const r = scorePronunciation(['a', 'a'], ['a', 'b'], synthCache());
expect(r.variant_vs_error_class).toBe('error');
expect(r.threshold_basis).toBe('l1_agnostic');
});
it('word class is variant for a perfect match', () => {
const r = scorePronunciation(['a', 'b'], ['a', 'b'], synthCache());
expect(r.variant_vs_error_class).toBe('variant');
expect(r.overall_score).toBeCloseTo(1, 6);
});
it('a deletion forces the word to error', () => {
const r = scorePronunciation(['a', 'b'], ['a'], synthCache());
expect(r.variant_vs_error_class).toBe('error');
});
});
- [ ] Step 2: Run test to verify it fails
Run: cd packages/web/workers && npx vitest run src/__tests__/pronunciationScore.test.ts
Expected: FAIL — scorePronunciation / VARIANT_ERROR_THRESHOLD not exported.
- [ ] Step 3: Write minimal implementation
Append to packages/web/workers/src/lib/pronunciationScore.ts:
/** PHON-126 practical boundary: midpoint of variant-75th (0.102) / error-25th (0.122).
* This is the knob SLM-r flags as L1-sensitive (see FLEGE_SLM.md); v6.1 is L1-agnostic. */
export const VARIANT_ERROR_THRESHOLD = 0.112;
export interface PronunciationScore {
per_position: PositionScore[];
insertions: Insertion[];
overall_score: number;
variant_vs_error_class: 'variant' | 'error';
threshold_basis: 'l1_agnostic';
}
/** Per-word label = worst position (any error ⇒ word is error). Deletions are errors. */
export function scorePronunciation(
canonical: string[],
produced: string[],
cache: PhonemeCache,
): PronunciationScore {
const { perPosition, insertions, wper } = alignWPER(canonical, produced, cache);
const hasError = perPosition.some(
(p) => p.op === 'del' || (p.op === 'sub' && p.cos_dist >= VARIANT_ERROR_THRESHOLD),
);
return {
per_position: perPosition,
insertions,
overall_score: Math.min(1, Math.max(0, 1 - wper)),
variant_vs_error_class: hasError ? 'error' : 'variant',
threshold_basis: 'l1_agnostic',
};
}
- [ ] Step 4: Run test to verify it passes
Run: cd packages/web/workers && npx vitest run src/__tests__/pronunciationScore.test.ts
Expected: PASS.
- [ ] Step 5: Commit
git add packages/web/workers/src/lib/pronunciationScore.ts packages/web/workers/src/__tests__/pronunciationScore.test.ts
git commit -m "feat(phon-129): variant/error classification + overall score"
Task 4: PHON-126 drift fixture (Python generator + TS pin test)¶
Files:
- Create: research/2026-06-05-phon-129-l2-accent-scorer/gen_score_fixtures.py
- Create: research/2026-06-05-phon-129-l2-accent-scorer/score_fixtures.json (generated)
- Create: packages/web/workers/src/__tests__/pronounce-fixture.test.ts
- [ ] Step 1: Write the fixture generator
Create research/2026-06-05-phon-129-l2-accent-scorer/gen_score_fixtures.py. It loads the same learned vectors PHON-126 used, computes the exact cos_dist and WPER for a few (canonical, produced) pairs, and emits BOTH the cache data (normSq + dots for the involved phonemes) and the expected outputs — so the TS test is self-contained and needs no D1.
"""Generate score_fixtures.json: PHON-126 cos_dist/WPER ground truth + the cache
data needed to reproduce it in TypeScript. Pins pronunciationScore.ts against the
validated Python metric. Run: uv run python gen_score_fixtures.py"""
import json
from pathlib import Path
import numpy as np
import polars as pl
VECTORS = Path(__file__).resolve().parents[2] / "packages/features/outputs/vectors.csv"
OUT = Path(__file__).resolve().parent / "score_fixtures.json"
# (canonical, produced) pairs spanning match / near-sub / far-sub / deletion.
CASES = [
(["k", "æ", "t"], ["k", "æ", "t"]), # perfect
(["v", "ɛ", "ɹ", "i"], ["b", "ɛ", "ɹ", "i"]), # v→b substitution
(["k", "æ", "t"], ["k", "æ"]), # final deletion
(["s", "ɪ", "t"], ["ʃ", "ɪ", "t"]), # s→ʃ substitution
]
def cos_dist(v1, v2):
c = float(v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2)))
return min(1.0, max(0.0, 1.0 - c))
def wper(canon, prod, vec):
n, m = len(canon), len(prod)
dp = [[0.0] * (m + 1) for _ in range(n + 1)]
for i in range(n + 1): dp[i][0] = i
for j in range(m + 1): dp[0][j] = j
for i in range(1, n + 1):
for j in range(1, m + 1):
sub = dp[i-1][j-1] + cos_dist(vec[canon[i-1]], vec[prod[j-1]])
dp[i][j] = min(sub, dp[i-1][j] + 1, dp[i][j-1] + 1)
return dp[n][m] / n if n else (0.0 if m == 0 else 1.0)
df = pl.read_csv(VECTORS)
feat_cols = [c for c in df.columns if c != "ipa"]
vec = {row["ipa"]: np.array([row[c] for c in feat_cols], dtype=float)
for row in df.iter_rows(named=True)}
phones = sorted({p for c, pr in CASES for p in (*c, *pr)})
norm_sq = {p: float(vec[p] @ vec[p]) for p in phones}
dots = {}
for a in phones:
for b in phones:
if a <= b:
dots[f"{a},{b}"] = float(vec[a] @ vec[b])
cases = []
for canon, prod in CASES:
cases.append({
"canonical": canon,
"produced": prod,
"expected_wper": round(wper(canon, prod, vec), 9),
"expected_per_position_cos_dist": [
round(cos_dist(vec[c], vec[p]), 9) for c, p in zip(canon, prod)
][: min(len(canon), len(prod))],
})
OUT.write_text(json.dumps({"normSq": norm_sq, "dots": dots, "cases": cases}, indent=2))
print(f"wrote {OUT} ({len(cases)} cases, {len(phones)} phonemes)")
- [ ] Step 2: Generate the fixture
Run: cd research/2026-06-05-phon-129-l2-accent-scorer && uv run python gen_score_fixtures.py
Expected: wrote .../score_fixtures.json (4 cases, N phonemes). Confirm the file exists and contains normSq, dots, cases.
- [ ] Step 3: Write the failing TS pin test
Create packages/web/workers/src/__tests__/pronounce-fixture.test.ts:
import { describe, it, expect } from 'vitest';
import type { PhonemeCache } from '../lib/similarity';
import { alignWPER } from '../lib/pronunciationScore';
import fixture from '../../../../../research/2026-06-05-phon-129-l2-accent-scorer/score_fixtures.json';
function fixtureCache(): PhonemeCache {
return {
normSq: new Map(Object.entries(fixture.normSq as Record<string, number>)),
dots: new Map(Object.entries(fixture.dots as Record<string, number>)),
};
}
describe('pronunciationScore matches the PHON-126 Python metric', () => {
const cache = fixtureCache();
for (const c of fixture.cases as Array<{
canonical: string[]; produced: string[];
expected_wper: number; expected_per_position_cos_dist: number[];
}>) {
it(`reproduces wper for ${c.canonical.join('')} vs ${c.produced.join('')}`, () => {
const r = alignWPER(c.canonical, c.produced, cache);
expect(r.wper).toBeCloseTo(c.expected_wper, 6);
const subOrMatch = r.perPosition
.filter((p) => p.op !== 'del')
.map((p) => p.cos_dist);
c.expected_per_position_cos_dist.forEach((expected, idx) => {
expect(subOrMatch[idx]).toBeCloseTo(expected, 6);
});
});
}
});
Note: confirm the relative import depth (../../../../../) resolves from packages/web/workers/src/__tests__/ to repo root research/. Adjust the number of ../ if the test runner reports an unresolved module. Ensure resolveJsonModule is enabled in the workers tsconfig.json; if not, add "resolveJsonModule": true under compilerOptions.
- [ ] Step 4: Run test to verify it passes
Run: cd packages/web/workers && npx vitest run src/__tests__/pronounce-fixture.test.ts
Expected: PASS — TS reproduces Python cos_dist/WPER to 1e-6.
- [ ] Step 5: Commit
git add research/2026-06-05-phon-129-l2-accent-scorer/gen_score_fixtures.py research/2026-06-05-phon-129-l2-accent-scorer/score_fixtures.json packages/web/workers/src/__tests__/pronounce-fixture.test.ts
git commit -m "test(phon-129): pin TS scorer to PHON-126 Python metric via frozen fixture"
Task 5: /api/audio/pronounce route handler¶
Files:
- Modify: packages/web/workers/src/routes/audio.ts
- Modify: packages/web/workers/src/__tests__/audio.test.ts
- [ ] Step 1: Write the failing route tests
Append to packages/web/workers/src/__tests__/audio.test.ts:
describe('POST /api/audio/pronounce', () => {
it('returns 400 when no audio part is present', async () => {
const fd = new FormData();
fd.append('target_word', 'very');
const res = await SELF.fetch('http://localhost/api/audio/pronounce', { method: 'POST', body: fd });
expect(res.status).toBe(400);
const body = await res.json() as Record<string, unknown>;
expect(body).toHaveProperty('detail');
});
it('returns 400 when target_word is missing', async () => {
const fd = new FormData();
fd.append('audio', new Blob([new Uint8Array([1, 2, 3])], { type: 'audio/wav' }), 'clip.wav');
const res = await SELF.fetch('http://localhost/api/audio/pronounce', { method: 'POST', body: fd });
expect(res.status).toBe(400);
const body = await res.json() as Record<string, unknown>;
expect(body).toHaveProperty('detail');
});
it('passes the transcriber sub-call warming state through as 503', async () => {
// canonical lookup hits D1 (unseeded → 500) OR transcribe warms (503).
// Either way the route must not 200 without a transcript. Accept 500/503.
fetchMock
.get('http://127.0.0.1:8000')
.intercept({ path: '/transcribe', method: 'POST' })
.reply(503, { warming: true });
const fd = new FormData();
fd.append('audio', new Blob([new Uint8Array([1, 2, 3])], { type: 'audio/wav' }), 'clip.wav');
fd.append('target_word', 'very');
const res = await SELF.fetch('http://localhost/api/audio/pronounce', { method: 'POST', body: fd });
expect([500, 503]).toContain(res.status);
});
});
Note: the success path needs a seeded D1 (words, phonemes, phoneme_dots), which CI does not provide — consistent with api.test.ts's convention. Do not add a success-path assertion that requires seeded D1; the scoring math is already covered by Tasks 2–4. The route tests cover validation + sub-call wiring.
- [ ] Step 2: Run tests to verify they fail
Run: cd packages/web/workers && npx vitest run src/__tests__/audio.test.ts
Expected: FAIL — /api/audio/pronounce returns 404 (route not registered).
- [ ] Step 3: Implement the handler
In packages/web/workers/src/routes/audio.ts, add imports at the top (after the existing imports):
import { buildPhonemeCache, type PhonemeCache } from '../lib/similarity';
import { scorePronunciation } from '../lib/pronunciationScore';
Add a module-level cache (mirrors the similarity route's lazy load) above the route definitions:
// Phoneme vector cache (norms + dots), loaded once per isolate from D1.
let phonemeCache: PhonemeCache | null = null;
async function getPhonemeCache(db: D1Database): Promise<PhonemeCache> {
if (phonemeCache) return phonemeCache;
const { results: phons } = await db
.prepare('SELECT ipa, norm_sq FROM phonemes')
.all<{ ipa: string; norm_sq: number }>();
const { results: dots } = await db
.prepare('SELECT ipa1, ipa2, dot_product FROM phoneme_dots')
.all<{ ipa1: string; ipa2: string; dot_product: number }>();
phonemeCache = buildPhonemeCache(phons, dots);
return phonemeCache;
}
/** Fetch the produced transcript from the inference host. off-the-shelf → /transcribe;
* ft → /compare (PHON-139 lineage), taking the .ft transcript (fallback to baseline). */
async function fetchTranscript(
base: string, file: File, transcriber: string, language: string | null,
): Promise<{ phonemes: string[]; confidences?: number[]; duration_ms?: number;
coverage?: string; limitations?: string[] } | { warming: true }> {
const path = transcriber === 'ft' ? '/compare' : '/transcribe';
const fwd = new FormData();
fwd.append('audio', file, file.name || 'clip');
if (language) fwd.append('language', language);
let upstream: Response;
try {
upstream = await fetch(`${base}${path}`, { method: 'POST', body: fwd });
} catch {
return { warming: true };
}
if (upstream.status === 503) return { warming: true };
if (!upstream.ok) throw new Error(`inference ${upstream.status}`);
const body = await upstream.json() as Record<string, unknown>;
if (transcriber === 'ft') {
const ft = (body.ft ?? body.baseline) as { phonemes: string[] } | undefined;
if (!ft) throw new Error('compare response missing ft/baseline');
return ft as { phonemes: string[] };
}
return body as { phonemes: string[] };
}
const PRONOUNCE_LIMITS = [
'Scores against the canonical target; assumes the speaker intends canonical.',
'Broad-phoneme only; distortions/covert contrast not modeled (Models #3, #5).',
'variant/error threshold is L1-agnostic in v6.1.',
];
Add the handler (register before export default audio;):
audio.post('/pronounce', async (c) => {
// 1. Validate multipart
const form = await c.req.formData().catch(() => null);
if (!form) return c.json({ detail: 'Missing required multipart field: audio' }, 400);
const fileEntry = form.get('audio');
if (!fileEntry || typeof fileEntry === 'string') {
return c.json({ detail: 'Missing required multipart field: audio' }, 400);
}
const file = fileEntry as File;
if (file.size > MAX_BYTES) return c.json({ detail: 'Audio exceeds 10 MB limit' }, 400);
if (file.type && !file.type.startsWith('audio/')) {
return c.json({ detail: `Unsupported content type: ${file.type}` }, 400);
}
const targetWord = form.get('target_word');
if (typeof targetWord !== 'string' || !targetWord.trim()) {
return c.json({ detail: 'Missing required field: target_word' }, 400);
}
const transcriber = form.get('transcriber') === 'ft' ? 'ft' : 'off-the-shelf';
const l1 = typeof form.get('l1') === 'string' ? (form.get('l1') as string) : null;
const language = typeof form.get('language') === 'string' ? (form.get('language') as string) : null;
const base = c.env.AUDIO_INFERENCE_URL?.replace(/\/$/, '');
if (!base) return c.json({ detail: 'Audio inference host not configured' }, 500);
// 2. Transcribe (produced phonemes)
const transcript = await fetchTranscript(base, file, transcriber, language);
if ('warming' in transcript) {
return c.json({ warming: true, detail: 'Inference host is warming up. Retry shortly.' }, 503);
}
const produced = transcript.phonemes ?? [];
// 3. Canonical phonemes from D1
const row = await c.env.DB
.prepare('SELECT phonemes FROM words WHERE word = ? LIMIT 1')
.bind(targetWord.trim().toLowerCase())
.first<{ phonemes: string | null }>();
if (!row || !row.phonemes) {
return c.json({ detail: `Word not in lexicon: ${targetWord}` }, 404);
}
const canonical = JSON.parse(row.phonemes) as string[];
// 4. Score in-Worker
const cache = await getPhonemeCache(c.env.DB);
const score = scorePronunciation(canonical, produced, cache);
// 5. Assemble response
return c.json({
target_word: targetWord,
canonical_phonemes: canonical,
transcript,
per_position: score.per_position,
insertions: score.insertions,
overall_score: score.overall_score,
variant_vs_error_class: score.variant_vs_error_class,
threshold_basis: score.threshold_basis,
l1,
transcriber,
coverage: 'broad-phoneme',
limitations: PRONOUNCE_LIMITS,
});
});
- [ ] Step 4: Run tests to verify they pass
Run: cd packages/web/workers && npx vitest run src/__tests__/audio.test.ts
Expected: PASS (validation 400s assert exactly; warming case accepts 500/503).
- [ ] Step 5: Type-check the workers package
Run: cd packages/web/workers && npx tsc --noEmit
Expected: no errors. (Confirm DB: D1Database is on Env — it is, per types.ts.)
- [ ] Step 6: Commit
git add packages/web/workers/src/routes/audio.ts packages/web/workers/src/__tests__/audio.test.ts
git commit -m "feat(phon-129): /api/audio/pronounce route — transcribe + canonical lookup + score"
Task 6: Frontend service — pronounceAudio()¶
Files:
- Modify: packages/web/frontend/src/services/audioApi.ts
- Create: packages/web/frontend/src/services/audioApi.pronounce.test.ts
- [ ] Step 1: Write the failing test
Create packages/web/frontend/src/services/audioApi.pronounce.test.ts:
import { describe, it, expect, vi, afterEach } from 'vitest';
import { pronounceAudio } from './audioApi';
import { TranscriberWarmingError } from './audioApi';
afterEach(() => vi.restoreAllMocks());
const sample = {
target_word: 'very', canonical_phonemes: ['v', 'ɛ', 'ɹ', 'i'],
transcript: { phonemes: ['b', 'ɛ', 'ɹ', 'i'], confidences: [], duration_ms: 1,
coverage: 'broad-phoneme', limitations: [] },
per_position: [], insertions: [], overall_score: 0.9,
variant_vs_error_class: 'error', threshold_basis: 'l1_agnostic',
l1: null, transcriber: 'off-the-shelf', coverage: 'broad-phoneme', limitations: [],
};
describe('pronounceAudio', () => {
it('posts multipart and returns the score', async () => {
const fetchSpy = vi.spyOn(globalThis, 'fetch').mockResolvedValue(
new Response(JSON.stringify(sample), { status: 200 }),
);
const res = await pronounceAudio(new Blob(['x']), 'very', { l1: 'Spanish' });
expect(res.overall_score).toBe(0.9);
const [, init] = fetchSpy.mock.calls[0];
expect((init as RequestInit).body).toBeInstanceOf(FormData);
});
it('throws TranscriberWarmingError on 503', async () => {
vi.spyOn(globalThis, 'fetch').mockResolvedValue(
new Response(JSON.stringify({ warming: true, detail: 'warming' }), { status: 503 }),
);
await expect(pronounceAudio(new Blob(['x']), 'very')).rejects.toBeInstanceOf(TranscriberWarmingError);
});
});
- [ ] Step 2: Run test to verify it fails
Run: cd packages/web/frontend && npx vitest run src/services/audioApi.pronounce.test.ts
Expected: FAIL — pronounceAudio not exported.
- [ ] Step 3: Implement
Append to packages/web/frontend/src/services/audioApi.ts:
export interface PositionScore {
canonical: string;
produced: string | null;
cos_dist: number;
op: 'match' | 'sub' | 'del';
}
export interface PronunciationResult {
target_word: string;
canonical_phonemes: string[];
transcript: TranscriptResult;
per_position: PositionScore[];
insertions: { produced: string; after_canonical_index: number }[];
overall_score: number;
variant_vs_error_class: 'variant' | 'error';
threshold_basis: 'l1_agnostic';
l1: string | null;
transcriber: 'off-the-shelf' | 'ft';
coverage: string;
limitations: string[];
}
export async function pronounceAudio(
blob: Blob,
targetWord: string,
opts?: { transcriber?: 'off-the-shelf' | 'ft'; l1?: string; language?: string },
): Promise<PronunciationResult> {
const fd = new FormData();
fd.append('audio', blob, 'recording');
fd.append('target_word', targetWord);
if (opts?.transcriber) fd.append('transcriber', opts.transcriber);
if (opts?.l1) fd.append('l1', opts.l1);
if (opts?.language) fd.append('language', opts.language);
const res = await fetch(`${baseUrl}/api/audio/pronounce`, {
method: 'POST',
headers: { 'X-Request-ID': freshRequestId() },
body: fd,
});
if (res.status === 503) {
const body = await res.json().catch(() => ({ detail: 'Warming up' })) as { detail?: string };
throw new TranscriberWarmingError(body.detail || 'Inference host is warming up.');
}
if (!res.ok) {
const detail = await res.text().catch(() => res.statusText);
throw new Error(`Pronounce failed (${res.status}): ${detail}`);
}
return res.json();
}
- [ ] Step 4: Run test to verify it passes
Run: cd packages/web/frontend && npx vitest run src/services/audioApi.pronounce.test.ts
Expected: PASS.
- [ ] Step 5: Commit
git add packages/web/frontend/src/services/audioApi.ts packages/web/frontend/src/services/audioApi.pronounce.test.ts
git commit -m "feat(phon-129): pronounceAudio frontend service"
Task 7: Frontend dev page — PronunciationViewer.tsx¶
Files:
- Read first: packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx (mirror its structure)
- Create: packages/web/frontend/src/components/tools/PronunciationViewer.tsx
- Create: packages/web/frontend/src/components/tools/PronunciationViewer.test.tsx
- Modify: packages/web/frontend/src/main.tsx
- [ ] Step 1: Read the existing viewer to mirror its patterns
Run: sed -n '1,120p' packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx
Note its imports (MUI components, transcribeAudio, TranscriberWarmingError, recorder hook/refs, preloaded-clip picker via loadAudioSamples) and reuse the same idioms.
- [ ] Step 2: Write the failing component test
Create packages/web/frontend/src/components/tools/PronunciationViewer.test.tsx:
import { describe, it, expect, vi, afterEach } from 'vitest';
import { render, screen, fireEvent, waitFor } from '@testing-library/react';
import PronunciationViewer from './PronunciationViewer';
import * as api from '../../services/audioApi';
afterEach(() => vi.restoreAllMocks());
const result: api.PronunciationResult = {
target_word: 'very', canonical_phonemes: ['v', 'ɛ', 'ɹ', 'i'],
transcript: { phonemes: ['b', 'ɛ', 'ɹ', 'i'], confidences: [], duration_ms: 1,
coverage: 'broad-phoneme', limitations: [] },
per_position: [
{ canonical: 'v', produced: 'b', cos_dist: 0.31, op: 'sub' },
{ canonical: 'ɛ', produced: 'ɛ', cos_dist: 0, op: 'match' },
{ canonical: 'ɹ', produced: 'ɹ', cos_dist: 0, op: 'match' },
{ canonical: 'i', produced: 'i', cos_dist: 0, op: 'match' },
],
insertions: [], overall_score: 0.92, variant_vs_error_class: 'error',
threshold_basis: 'l1_agnostic', l1: null, transcriber: 'off-the-shelf',
coverage: 'broad-phoneme', limitations: ['x'],
};
describe('PronunciationViewer', () => {
it('renders the target-word input and a score after pronounce', async () => {
vi.spyOn(api, 'loadAudioSamples').mockResolvedValue([]);
vi.spyOn(api, 'pronounceAudio').mockResolvedValue(result);
render(<PronunciationViewer />);
const input = screen.getByLabelText(/target word/i);
fireEvent.change(input, { target: { value: 'very' } });
fireEvent.click(screen.getByRole('button', { name: /score|pronounce/i }));
await waitFor(() => expect(screen.getByText(/0\.92|92/)).toBeInTheDocument());
expect(screen.getByText('v')).toBeInTheDocument(); // canonical position rendered
});
});
Note: this test drives pronounce from a preloaded/empty-clip path. If AudioTranscribeViewer gates the action button on a recorded/selected clip, mirror that gating and have the test select a clip or set a blob via the same mechanism the transcribe viewer's test uses (read AudioTranscribeViewer.test.tsx first and copy its clip-injection approach).
- [ ] Step 3: Run test to verify it fails
Run: cd packages/web/frontend && npx vitest run src/components/tools/PronunciationViewer.test.tsx
Expected: FAIL — component does not exist.
- [ ] Step 4: Implement the component
Create packages/web/frontend/src/components/tools/PronunciationViewer.tsx. Mirror AudioTranscribeViewer.tsx's recorder + preloaded-clip + warming-state scaffolding; add the target-word input, optional L1 dropdown, transcriber toggle, and per-position heat row. Concretely:
/**
* PHON-129 Model #2 dev page. Mirrors AudioTranscribeViewer; adds a target word,
* optional L1 tag, transcriber toggle, and per-position cos_dist heat. NOT the
* eventual user-facing tool (that's a later unification spec).
*/
import { useState } from 'react';
import {
Box, TextField, Button, MenuItem, ToggleButton, ToggleButtonGroup,
Typography, Chip, Stack, Tooltip, Alert,
} from '@mui/material';
import {
pronounceAudio, TranscriberWarmingError,
type PronunciationResult, type PositionScore,
} from '../../services/audioApi';
const L1S = ['Arabic', 'Chinese', 'Hindi', 'Korean', 'Spanish', 'Vietnamese', 'unknown'];
function heatColor(cosDist: number, op: PositionScore['op']): string {
if (op === 'del') return '#b71c1c';
const t = Math.min(1, cosDist / 0.5); // 0 → green, ≥0.5 → red
const r = Math.round(76 + t * (183 - 76));
const g = Math.round(175 - t * (175 - 28));
return `rgb(${r}, ${g}, 60)`;
}
export default function PronunciationViewer() {
const [blob, setBlob] = useState<Blob | null>(null);
const [targetWord, setTargetWord] = useState('');
const [l1, setL1] = useState('unknown');
const [transcriber, setTranscriber] = useState<'off-the-shelf' | 'ft'>('off-the-shelf');
const [result, setResult] = useState<PronunciationResult | null>(null);
const [warming, setWarming] = useState(false);
const [error, setError] = useState<string | null>(null);
async function run() {
if (!blob || !targetWord.trim()) return;
setWarming(false); setError(null); setResult(null);
try {
const r = await pronounceAudio(blob, targetWord.trim(), {
transcriber, l1: l1 === 'unknown' ? undefined : l1,
});
setResult(r);
} catch (e) {
if (e instanceof TranscriberWarmingError) setWarming(true);
else setError(e instanceof Error ? e.message : 'Failed');
}
}
return (
<Box sx={{ p: 3, maxWidth: 760, mx: 'auto' }}>
<Typography variant="h5" gutterBottom>Pronunciation Scorer (dev)</Typography>
{/* Recorder / upload / preloaded-clip picker: reuse the same controls as
AudioTranscribeViewer. On clip ready, call setBlob(clipBlob). */}
{/* <AudioCapture onBlob={setBlob} /> — mirror the transcribe viewer's capture UI */}
<Stack direction="row" spacing={2} sx={{ my: 2 }} alignItems="center">
<TextField label="Target word" value={targetWord}
onChange={(e) => setTargetWord(e.target.value)} size="small" />
<TextField select label="L1" value={l1} size="small"
onChange={(e) => setL1(e.target.value)} sx={{ minWidth: 130 }}>
{L1S.map((x) => <MenuItem key={x} value={x}>{x}</MenuItem>)}
</TextField>
<ToggleButtonGroup exclusive size="small" value={transcriber}
onChange={(_, v) => v && setTranscriber(v)}>
<ToggleButton value="off-the-shelf">off-the-shelf</ToggleButton>
<ToggleButton value="ft">ft</ToggleButton>
</ToggleButtonGroup>
<Button variant="contained" onClick={run} disabled={!blob || !targetWord.trim()}>
Score
</Button>
</Stack>
{warming && <Alert severity="info">Inference host is warming up. Retry shortly.</Alert>}
{error && <Alert severity="error">{error}</Alert>}
{result && (
<Box sx={{ mt: 2 }}>
<Stack direction="row" spacing={1} alignItems="center" sx={{ mb: 1 }}>
<Typography variant="h6">Score: {result.overall_score.toFixed(2)}</Typography>
<Chip label={result.variant_vs_error_class}
color={result.variant_vs_error_class === 'error' ? 'error' : 'success'} size="small" />
<Tooltip title="Threshold is L1-agnostic in v6.1">
<Chip label={result.threshold_basis} variant="outlined" size="small" />
</Tooltip>
</Stack>
<Stack direction="row" spacing={0.5}>
{result.per_position.map((p, i) => (
<Tooltip key={i} title={`${p.op} cos_dist ${p.cos_dist.toFixed(2)}`}>
<Box sx={{ textAlign: 'center', minWidth: 36 }}>
<Box sx={{ bgcolor: heatColor(p.cos_dist, p.op), color: '#fff',
borderRadius: 1, px: 1, py: 0.5 }}>{p.canonical}</Box>
<Typography variant="caption">{p.produced ?? '∅'}</Typography>
</Box>
</Tooltip>
))}
</Stack>
{result.limitations.length > 0 && (
<Alert severity="warning" sx={{ mt: 2 }}>{result.limitations.join(' ')}</Alert>
)}
</Box>
)}
</Box>
);
}
Then wire the actual recorder/upload/preloaded controls by copying AudioTranscribeViewer's capture section and calling setBlob when a clip is ready (replace the commented placeholder). Keep the test's clip-injection mechanism aligned with how the transcribe viewer's test supplies a blob.
- [ ] Step 5: Register the dev route
In packages/web/frontend/src/main.tsx, add the import and route alongside the existing /dev/audio:
import PronunciationViewer from './components/tools/PronunciationViewer.tsx';
// ...inside <Routes>:
<Route path="/dev/pronounce" element={<PronunciationViewer />} />
- [ ] Step 6: Run test to verify it passes
Run: cd packages/web/frontend && npx vitest run src/components/tools/PronunciationViewer.test.tsx
Expected: PASS.
- [ ] Step 7: Type-check + build the frontend
Run: cd packages/web/frontend && npx tsc --noEmit && npm run build
Expected: no type errors; build succeeds.
- [ ] Step 8: Commit
git add packages/web/frontend/src/components/tools/PronunciationViewer.tsx packages/web/frontend/src/components/tools/PronunciationViewer.test.tsx packages/web/frontend/src/main.tsx
git commit -m "feat(phon-129): PronunciationViewer dev page + /dev/pronounce route"
Task 8: L2-ARCTIC validation harness¶
Files:
- Create: research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py
- Create: research/2026-06-05-phon-129-l2-accent-scorer/02_metrics.py
- Create: research/2026-06-05-phon-129-l2-accent-scorer/RESULTS.md
This task is a research harness, not unit-tested code. It runs against the external L2-ARCTIC drive and the local phonolex_audio inference server (start it with uv run python -m phonolex_audio --port 8000; the L2-ARCTIC audio and the model are both local, so no RunPod / no uploads). Follow the long-running-jobs policy: checkpoint, SIGINT flush, resume.
- [ ] Step 1: Write
01_run_l2arctic.py
Create research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py. It iterates L2-ARCTIC annotated utterances, transcribes each via the off-the-shelf host, parses the canonical,perceived,errortype phone tier, and emits per-token rows with BOTH the transcriber-derived and oracle (human-perceived) cos_dist. Checkpoint every 200 utts.
"""PHON-129 validation step 1: transcribe + score L2-ARCTIC, emit per-token rows.
Two cos_dist per token: transcriber-derived (real chain) and oracle (human-perceived).
Checkpointed per the long-running-jobs policy. Run:
uv run python 01_run_l2arctic.py --inference-url <host> --out rows.parquet"""
import argparse, json, pickle, signal, sys
from pathlib import Path
import numpy as np, polars as pl
L2ARCTIC = Path("/Volumes/ExternalData2/audio-datasets/l2arctic")
VECTORS = Path(__file__).resolve().parents[2] / "packages/features/outputs/vectors.csv"
SPK_L1 = { # README speaker→L1 map
**{s: "Arabic" for s in ["ABA", "SKA", "YBAA", "ZHAA"]},
**{s: "Chinese" for s in ["BWC", "LXC", "NCC", "TXHC"]},
**{s: "Hindi" for s in ["ASI", "RRBI", "SVBI", "TNI"]},
**{s: "Korean" for s in ["HJK", "HKK", "YDCK", "YKWK"]},
**{s: "Spanish" for s in ["EBVS", "ERMS", "MBMPS", "NJS"]},
**{s: "Vietnamese" for s in ["HQTV", "PNV", "THV", "TLV"]},
}
# Load vectors → cos_dist helper (same metric as the TS scorer / PHON-126).
df = pl.read_csv(VECTORS)
feat = [c for c in df.columns if c != "ipa"]
VEC = {r["ipa"]: np.array([r[c] for c in feat], float) for r in df.iter_rows(named=True)}
def cos_dist(a, b):
if a not in VEC or b not in VEC: return None
c = float(VEC[a] @ VEC[b] / (np.linalg.norm(VEC[a]) * np.linalg.norm(VEC[b])))
return min(1.0, max(0.0, 1.0 - c))
# parse_annotation(textgrid_path) → list[(canonical, perceived, errortype)] from the
# IPA phone tier (the tier whose error labels are IPA, e.g. "ð,d,s"); see DIAGNOSTIC.
# transcribe(url, wav_path) → list[str] produced phonemes via POST /transcribe.
# (Implement both against the L2-ARCTIC TextGrid format + the host contract.)
def load_ckpt(p): return pickle.loads(p.read_bytes()) if p.exists() else {"done": [], "rows": []}
def save_ckpt(p, st): p.write_bytes(pickle.dumps(st))
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--inference-url", required=True)
ap.add_argument("--out", default="rows.parquet")
ap.add_argument("--checkpoint", default="_ckpt.pkl")
ap.add_argument("--checkpoint-every", type=int, default=200)
a = ap.parse_args()
ckpt = Path(a.checkpoint); st = load_ckpt(ckpt); done = set(st["done"])
def flush(*_):
save_ckpt(ckpt, st); print(f"[ckpt] saved at {len(st['done'])} utts");
signal.signal(signal.SIGINT, lambda *_: (flush(), sys.exit(0)))
grids = sorted(L2ARCTIC.glob("*/annotation/*.TextGrid"))
for k, g in enumerate(grids):
uid = f"{g.parts[-3]}/{g.stem}"
if uid in done: continue
spk = g.parts[-3]; wav = L2ARCTIC / spk / "wav" / f"{g.stem}.wav"
gold = parse_annotation(g) # [(canon, perceived, errtype)]
produced = transcribe(a.inference_url, wav)
for canon, perceived, et in gold:
st["rows"].append({
"utt": uid, "speaker": spk, "l1": SPK_L1.get(spk, "?"),
"canonical": canon, "perceived": perceived, "errortype": et,
"cos_dist_oracle": cos_dist(canon, perceived),
# transcriber-derived: align produced→canonical, take this position's cost.
# (Reuse the same WPER alignment; store the matched produced phone + cost.)
})
done.add(uid); st["done"] = sorted(done)
if (k + 1) % a.checkpoint_every == 0: flush()
flush()
pl.DataFrame(st["rows"]).write_parquet(a.out)
print(f"wrote {a.out}: {len(st['rows'])} token rows")
if __name__ == "__main__":
main()
Note: parse_annotation, transcribe, and the transcriber-derived alignment are marked inline — implement them against the L2-ARCTIC TextGrid format (verified: IPA error labels like ð,d,s in the phone tier) and the inference host's /transcribe contract. Reuse the WPER alignment logic from gen_score_fixtures.py for the transcriber-derived per-position cost.
- [ ] Step 2: Write
02_metrics.py
Create research/2026-06-05-phon-129-l2-accent-scorer/02_metrics.py. It computes PHON-126's three diagnostics on the rows, pooled and per-L1.
"""PHON-129 validation step 2: PHON-126's three diagnostics on real audio,
pooled AND per-L1. Run: uv run python 02_metrics.py --rows rows.parquet"""
import argparse
import numpy as np, polars as pl
from scipy.stats import mannwhitneyu, spearmanr
def diagnostics(variant, error):
"""variant/error = arrays of cos_dist. Returns the three PHON-126 metrics."""
out = {}
if len(variant) and len(error):
u, p = mannwhitneyu(variant, error, alternative="less")
out["mannwhitney_U"], out["mannwhitney_p"] = float(u), float(p)
out["variant_75"] = float(np.percentile(variant, 75))
out["error_25"] = float(np.percentile(error, 25))
out["threshold_clean"] = out["variant_75"] < out["error_25"]
return out
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--rows", default="rows.parquet")
ap.add_argument("--col", default="cos_dist_oracle",
help="cos_dist_oracle | cos_dist_transcriber")
a = ap.parse_args()
df = pl.read_parquet(a.rows)
# variant = L2-ARCTIC substitutions (accent); severity = the cos_dist itself for ρ.
# error pole (PhonBank Clinical) appended later as errortype/source tags if present.
subs = df.filter(pl.col("errortype") == "s").drop_nulls(a.col)
def report(frame, label):
v = frame.filter(pl.col("cos_dist_oracle").is_not_null())[a.col].to_numpy()
# Without a disordered error pole, report distribution + severity self-consistency.
rho = spearmanr(frame[a.col].to_numpy(),
frame["cos_dist_oracle"].to_numpy()).correlation if len(frame) else float("nan")
print(f"[{label}] n={len(frame)} mean_cosdist={v.mean():.3f} "
f"transcriber_vs_oracle_rho={rho:.3f}")
report(subs, "POOLED")
for l1 in sorted(subs["l1"].unique().to_list()):
report(subs.filter(pl.col("l1") == l1), l1)
# When the PhonBank-Clinical error pole is appended (source column), run the full
# variant<error separation: diagnostics(variant_cosdist, error_cosdist), pooled + per-L1.
if __name__ == "__main__":
main()
Note: the full variant<error separation needs the PhonBank-Clinical error pole appended to rows.parquet (a source column distinguishing l2arctic vs phonbank_clinical). Step 1 can be extended to ingest PhonBank Clinical the same way; if deferred, 02_metrics.py reports the L2-ARCTIC distributions + the transcriber-vs-oracle agreement (ρ) per-L1, which is the load-bearing chain-fidelity result.
- [ ] Step 3: Run the harness (manual, against the local server)
First start the local inference server in another shell:
uv run python -m phonolex_audio --port 8000 # off-the-shelf wav2vec2-espeak
cd research/2026-06-05-phon-129-l2-accent-scorer
uv run python 01_run_l2arctic.py --inference-url http://127.0.0.1:8000 --out rows.parquet
uv run python 02_metrics.py --rows rows.parquet --col cos_dist_transcriber
uv run python 02_metrics.py --rows rows.parquet --col cos_dist_oracle
- [ ] Step 4: Write
RESULTS.md
Create research/2026-06-05-phon-129-l2-accent-scorer/RESULTS.md capturing: pooled + per-L1 metrics for transcriber and oracle, the transcriber−oracle gap, a GO/NO-GO read on whether the PHON-126 separation replicates on real audio, and an explicit per-L1 L1-sensitivity note (does the threshold/separation drift by L1? — the empirical answer to the design's open question).
- [ ] Step 5: Commit
git add research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py research/2026-06-05-phon-129-l2-accent-scorer/02_metrics.py research/2026-06-05-phon-129-l2-accent-scorer/RESULTS.md
git commit -m "research(phon-129): L2-ARCTIC validation harness — pooled + per-L1 metrics"
Task 9: Full test matrix + final verification¶
Files: none (verification only)
- [ ] Step 1: Run the workers test suite
Run: cd packages/web/workers && npm test
Expected: all pass (pronunciationScore, pronounce-fixture, audio route, existing suites).
- [ ] Step 2: Run the frontend test suite + type-check + build
Run: cd packages/web/frontend && npx vitest run && npx tsc --noEmit && npm run build
Expected: tests pass, no type errors, build succeeds.
- [ ] Step 3: Run the workers type-check
Run: cd packages/web/workers && npx tsc --noEmit
Expected: no errors.
- [ ] Step 4: Confirm CI-equivalent checks pass (untracked-file trap)
Run from repo root: git status — confirm no stray untracked files that would break CI; then re-run the two suites above to be sure nothing relied on uncommitted local state.
- [ ] Step 5: Final commit if any verification fixups were needed
git add -A && git commit -m "chore(phon-129): verification fixups" || echo "nothing to commit"
Self-Review (completed during authoring)¶
Spec coverage:
- §2 architecture (Approach A, transcribe sub-call, in-Worker score) → Tasks 1,2,5 ✓
- §2.2 transcriber default off-the-shelf, ft selectable → Task 5 fetchTranscript ✓
- §3 contract (per_position keyed to canonical, insertions out-of-band, op set, overall_score, threshold_basis, l1 echo, error codes) → Tasks 2,3,5 ✓
- §4 scoring module (cosDist, alignWPER, classify, T=0.112) → Tasks 2,3 ✓
- §5 dev page (record/upload/preloaded, target word, L1 dropdown, transcriber toggle, per-position heat, warming) → Task 7 ✓
- §6.1 drift fixture → Task 4 ✓
- §6.2 L2-ARCTIC validation pooled + per-L1, transcriber-vs-oracle gap → Task 8 ✓
- §6.3 PhonBank error pole → Task 8 (noted as extension) ✓
- §7 L1 seam (optional l1, threshold_basis tag, stratified validation) → Tasks 5,8 ✓
Placeholder scan: the only inline "implement against X" markers are in Task 8 (research harness — parse_annotation/transcribe/alignment), which is inherently host- and drive-dependent and cannot be fully literal; every shipped-code task (1–7, 9) has complete code. No TBD/TODO in production code steps.
Type consistency: PhonemeCache ({normSq, dots} Maps) consistent across Tasks 2/4; cosDist/alignWPER/scorePronunciation signatures stable; PronunciationResult/PositionScore shared between service (Task 6) and component (Task 7); route response (Task 5) matches the PronunciationResult fields.