PHON-129 — Model #2: L2 / Accent Pronunciation Scorer (v6.1)¶
Status: design — pending plan
Ticket: PHON-129
Parent: PHON-44 Audio
Umbrella spec: docs/superpowers/specs/2026-05-30-v6-audio-support-design.md §2.2, §3, §6
Research grounding: research/2026-06-05-phon-129-l2-accent-scorer/FLEGE_SLM.md
Date: 2026-06-05
Depends on: Model #1 (PHON-128, Done) — transcription sub-call.
1. Goal¶
Given a target word and an audio production, score how far the production is from canonical, per phoneme position, plus an overall score and a per-word variant-vs-error label. Serves L2 language learners, accent-modification clinicians, ESL teachers, and voice coaches — users who are explicitly trying to match a canonical target.
The metric is PHON-126's validated cos_dist over the learned articulatory feature vectors. Model #1
produces the transcript; this ticket adds the scoring layer + a dev-page surface.
1.1 Scope guard — this is a dev page, not the user-facing tool¶
The frontend deliverable is a dev/validation page mirroring AudioTranscribeViewer.tsx, in the
same spirit as the PHON-128 viewer. The eventual polished, user-facing Pronunciation tool is a
separate later spec that unifies the v6 audio dev surfaces into one tool. This ticket does not
build production tool polish (no onboarding, no marketing copy, no cross-tool threading beyond the
existing pattern). YAGNI on everything the unification spec will own.
1.2 Non-goals¶
- Not for disordered speech. Model #2 assumes the speaker intends canonical; canonical bias is
the feature. Disordered/covert-contrast cases are Models #3 and #5. The honest limit is surfaced
in the response
limitations[]and on the dev page. - Not L1-conditioned scoring yet. v6.1 ships population-agnostic scoring. It threads the L1 seam and reports L1-stratified validation, but does not yet condition the metric on L1 (see §7).
- Not a new inference host. Reuses the existing local
phonolex_audioserver (FastAPI;AUDIO_INFERENCE_URL, already wired inwrangler.toml/.dev.vars); all scoring runs in-Worker. RunPod (the umbrella spec's eventual public scale-to-zero deploy) is out of scope for PHON-129 — the Worker is host-agnostic viaAUDIO_INFERENCE_URL, so the production host is a later deploy decision, not a code concern here.
2. Architecture¶
2.1 Request flow (Approach A — scoring in the Worker)¶
POST /api/audio/pronounce { target_word, audio, transcriber?, l1? }
│
├─ 1. Validate multipart (reuse audio.ts proxy validation: 10 MB cap, audio/* type)
├─ 2. Sub-call the local phonolex_audio inference server (AUDIO_INFERENCE_URL) → produced phonemes[]
│ transcriber = "off-the-shelf" (default) | "ft"
├─ 3. Look up canonical phonemes for target_word from D1 `words.phonemes`
│ (word not found → 404 { detail })
├─ 4. Score in-Worker (pronunciationScore.ts):
│ WPER align(produced, canonical), sub-cost = cos_dist = clip(1 − cosine, 0, 1)
│ reuse the PhonemeCache (norms + dots) loaded from D1 phonemes/phoneme_dots
└─ 5. Return scoring contract (§3)
Why Approach A (scoring in TS, not co-located with the model on the inference host): zero new inference
infra; reuses the PhonemeCache + phonemeCosine already in similarity.ts; pure-TS scoring is
unit-testable under cloudflare:test; matches the umbrella spec's "in-Worker cos_dist" recipe. The
only cost — re-implementing PHON-126's ~30-line metric in TS — is neutralized by a frozen
cross-language fixture (§6.1).
2.2 Transcriber selection — off-the-shelf by default¶
transcriber defaults to "off-the-shelf" (wav2vec2-lv-60-espeak-cv-ft, no FT). Rationale:
- The scorer measures distance from what was actually produced to canonical. The transcriber must faithfully report the production. The PHON-139 FT model's faithfulness gain (collapse 17.4% vs off-the-shelf 33%) was measured on disordered child speech; L2 is not disordered, so the off-the-shelf model is the appropriate default for this population and matches the umbrella recipe.
transcriber: "ft"is supported so the dev page can A/B both on L2 audio ("we're not limited"). Routes to the local server's/comparepath (the PHON-139 lineage;phonolex_audiolaunched with--ft-checkpointexposes off-the-shelf vs FT side by side).
3. API contract¶
POST /api/audio/pronounce — multipart/form-data:
| field | type | required | notes |
|---|---|---|---|
audio |
file (audio/*) | yes | ≤ 10 MB; client records via MediaRecorder |
target_word |
string | yes | looked up in D1 words for canonical phonemes |
transcriber |
"off-the-shelf" | "ft" |
no | default off-the-shelf |
l1 |
string | no | forward-compat seam; echoed/tagged only in v6.1 |
Response (200):
{
"target_word": "very",
"canonical_phonemes": ["v","ɛ","ɹ","i"],
"transcript": { "phonemes": ["b","ɛ","ɹ","i"], "confidences": [...],
"duration_ms": 812, "coverage": "broad-phoneme", "limitations": [...] },
"per_position": [ // one entry PER CANONICAL position
{ "canonical": "v", "produced": "b", "cos_dist": 0.31, "op": "sub" },
{ "canonical": "ɛ", "produced": "ɛ", "cos_dist": 0.0, "op": "match" },
{ "canonical": "ɹ", "produced": "ɹ", "cos_dist": 0.0, "op": "match" },
{ "canonical": "i", "produced": "i", "cos_dist": 0.0, "op": "match" }
],
"insertions": [ // extra produced phones (out-of-band)
// { "produced": "ə", "after_canonical_index": 2 }
],
"overall_score": 0.92, // 1 − WPER, in [0,1]
"variant_vs_error_class": "error", // per-word aggregate = worst position
"threshold_basis": "l1_agnostic", // honest tag: the knob SLM-r says is L1-sensitive
"l1": null, // echoed if provided
"transcriber": "off-the-shelf",
"coverage": "broad-phoneme",
"limitations": ["Scores against the canonical target; assumes the speaker intends canonical.",
"Broad-phoneme only; distortions/covert contrast not modeled (Models #3, #5).",
"variant/error threshold is L1-agnostic in v6.1."]
}
Contract decisions:
- per_position is keyed to canonical positions (the targets). Substitutions and deletions
occupy a canonical slot; insertions (extra produced phones) go in insertions[] since they do
not map to a target position.
- op ∈ {match, sub, del} per canonical position. cos_dist for del = 1.0 (max).
- overall_score = 1 − WPER is the headline number.
- l1 optional, echoed; not used in scoring in v6.1.
- threshold_basis: "l1_agnostic" makes the SLM-r-flagged limitation legible in the payload and
marks the seam for the L1-conditioned upgrade.
Errors:
- target word not in lexicon → 404 { detail }
- transcriber host warming/unreachable → 503 { warming: true, detail } (passthrough from the
audio.ts proxy pattern)
- transcript empty / unintelligible clip → 200 with overall_score: 0 and a limitations note
(do not 500 on a legitimately-unintelligible production)
4. Scoring module — pronunciationScore.ts¶
Pure TS, no D1/IO inside the hot path (cache passed in). Reuses phonemeCosine from similarity.ts.
cosDist(a, b, cache) = clip(1 − phonemeCosine(a, b, cache), 0, 1)alignWPER(produced, canonical, cache)— flat-sequence Levenshtein DP; sub-cost =cosDist, indel = 1.0; traceback recovers the op path. Returns{ perPosition[], insertions[], wper }wherewper = totalCost / canonical.length(normalized by canonical length, matching PHON-126).classify(perPosition):- per substitution:
cos_dist < T→variant, elseerror;del→error. T = 0.112— midpoint of PHON-126's variant-75th (0.102) and error-25th (0.122). Single named constant, documented as the L1-agnostic boundary.- per-word
variant_vs_error_class= worst position class (anyerror⇒ word iserror). overallScore = clamp(1 − wper, 0, 1).
Cache loading: the route loads PhonemeCache (norms from phonemes, dots from phoneme_dots) once
per isolate on cold start — same mechanism the similarity route already uses. No new tables.
5. Frontend — PronunciationViewer.tsx (dev page)¶
Mirrors AudioTranscribeViewer.tsx. Mounted as a dev route alongside the transcribe viewer.
Controls:
- Record (MediaRecorder) + file upload + curated preloaded-clip picker (reuse loadAudioSamples).
- target_word text input (the one new control).
- L1 dropdown — {Arabic, Chinese, Hindi, Korean, Spanish, Vietnamese, unknown} (the L2-ARCTIC
L1s); wired to l1, tags output only.
- Transcriber toggle — off-the-shelf / ft.
Display:
- Canonical phoneme row; under each, the produced phone; per-position cos_dist heat
(low=green → high=red); del slots marked empty, insertions[] rendered between slots.
- Overall score; variant_vs_error_class badge with the l1_agnostic caveat inline.
- Warming state reused from the transcribe viewer's TranscriberWarmingError handling.
Service: add pronounceAudio(blob, targetWord, opts) to audioApi.ts — multipart, same
warming-error semantics as transcribeAudio.
6. Validation¶
Lab dir: research/2026-06-05-phon-129-l2-accent-scorer/. Checkpoint per the long-running-jobs
policy (--checkpoint-dir, SIGINT flush, resume).
6.1 Metric port correctness (drift guard)¶
score_fixtures.json — run PHON-126's Python cos_dist/WPER on a set of (produced, canonical) pairs,
freeze outputs; a vitest asserts pronunciationScore.ts reproduces them to 1e-6. Gives the
"verbatim metric" guarantee without co-locating scoring on the model host.
6.2 Real-audio validation — L2-ARCTIC (primary)¶
Data: /Volumes/ExternalData2/audio-datasets/l2arctic — 24 speakers, balanced 6 L1s × 4 (Arabic,
Chinese, Hindi, Korean, Spanish, Vietnamese), ~150 annotated utts each (3,621 annotated TextGrids).
Phone-tier gold = canonical,perceived,errortype triples (s/d/a).
01_run_l2arctic.py— per annotated utt: transcribe (off-the-shelf) →cos_distper position; parse the gold phone tier.- Two cos_dist computations per token: (a) from the transcriber output (the real chain), and (b) from the human-perceived phone (oracle). The (a)−(b) gap isolates transcriber error from metric error.
02_metrics.py— PHON-126's three diagnostics extended to real audio:- Mann-Whitney U (one-sided, variant < error)
- practical threshold (variant 75th pct < error 25th pct)
- Spearman ρ (severity rank vs cos_dist)
- reported per-L1 (6 groups) AND pooled.
Release gate: pooled separation replicates PHON-126 on real audio (Mann-Whitney significant + practical threshold clean), and the per-L1 breakdown is reported. Per-L1 threshold drift is a finding (feeds the §7 follow-on), not a release blocker.
6.3 Error pole — PhonBank Clinical (secondary)¶
Disordered child speech (narrow actual) as the "error" pole for the full variant<error contrast on
real audio: L2-ARCTIC substitutions = variant class, PhonBank Clinical substitutions = error class.
Flag the age/recording confound (L2-ARCTIC adult vs PhonBank child) honestly; this is a
supporting check, not the gate.
7. The L1 seam (why it exists, what it is, what it is not)¶
Grounded in FLEGE_SLM.md. SLM-r: L2 substitution structure is conditioned by the speaker's L1 at the
position-sensitive allophone level (equivalence classification; perceived cross-language
dissimilarity). The El Kheir et al. (2023) L1-MultiMDD system — same encoder family, same L2-ARCTIC
corpus, same eSpeak tooling — shows L1-conditioning drops false-rejection rate 5.46→4.26 (the
"variant mis-flagged as error" mode) and PER 13.70→12.52.
v6.1 design response (seam, not implementation):
1. /api/audio/pronounce accepts optional l1; echoed + threshold_basis: "l1_agnostic" tag.
2. Validation is L1-stratified (§6.2) — the empirical tell for whether L1-conditioning is needed.
3. Upgrade path (out of scope for v6.1): PHON-137/138's confusion channel
P(produced | canonical, population, position) with population = L1 becomes the SLM-r-faithful,
position-sensitive, probabilistic L1 prior. The l1 seam means it slots in without an API change.
8. Files touched¶
New:
- packages/web/workers/src/routes/audio.ts → add pronounce handler (or audioPronounce.ts if
audio.ts grows past one responsibility).
- packages/web/workers/src/lib/pronunciationScore.ts — scoring module.
- packages/web/workers/src/__tests__/audioPronounce.test.ts — route + alignment + fixture tests.
- packages/web/frontend/src/components/tools/PronunciationViewer.tsx (+ .test.tsx).
- research/2026-06-05-phon-129-l2-accent-scorer/{01_run_l2arctic.py,02_metrics.py,score_fixtures.json,RESULTS.md}.
Modified:
- packages/web/frontend/src/services/audioApi.ts — add pronounceAudio().
- packages/web/workers/src/index.ts — already mounts /api/audio; new sub-route only.
- Dev-route registration for PronunciationViewer (mirror the transcribe viewer's wiring).
Reused unchanged: similarity.ts phonemeCosine + PhonemeCache; audio.ts multipart validation +
warming-proxy pattern; words.phonemes lookup; phonemes/phoneme_dots D1 tables.
9. Done when¶
POST /api/audio/pronouncelive with the §3 contract (off-the-shelf default,ftselectable,l1seam).pronunciationScore.tspasses the frozen PHON-126 fixture (1e-6) + alignment edge-case tests.PronunciationViewerdev page mounted, mirroring the transcribe viewer (record/upload/preloaded, target word, L1 tag, transcriber toggle, per-position heat + score + class badge).- L2-ARCTIC validation run: PHON-126's three metrics reported pooled + per-L1;
RESULTS.mdwritten with a GO/NO-GO read and an explicit per-L1 L1-sensitivity note. - v6.1 release notes published.
10. References¶
- Umbrella:
docs/superpowers/specs/2026-05-30-v6-audio-support-design.md research/2026-06-05-phon-129-l2-accent-scorer/FLEGE_SLM.md- PHON-126 findings:
research/2026-05-28-phon-126-feature-vector-graded-error/findings.md - PHON-128 viewer (
AudioTranscribeViewer.tsx) +audio.tsproxy pattern +audioApi.ts similarity.ts(cos_dist substrate),[[project_audio_data_reservoir]],[[project_audio_targeted_models]]