PHON-129 — Model #2: L2 / Accent Pronunciation Scorer (v6.1)¶

Status: design — pending plan Ticket: PHON-129 Parent: PHON-44 Audio Umbrella spec: docs/superpowers/specs/2026-05-30-v6-audio-support-design.md §2.2, §3, §6 Research grounding: research/2026-06-05-phon-129-l2-accent-scorer/FLEGE_SLM.md Date: 2026-06-05 Depends on: Model #1 (PHON-128, Done) — transcription sub-call.

1. Goal¶

Given a target word and an audio production, score how far the production is from canonical, per phoneme position, plus an overall score and a per-word variant-vs-error label. Serves L2 language learners, accent-modification clinicians, ESL teachers, and voice coaches — users who are explicitly trying to match a canonical target.

The metric is PHON-126's validated cos_dist over the learned articulatory feature vectors. Model #1 produces the transcript; this ticket adds the scoring layer + a dev-page surface.

1.1 Scope guard — this is a dev page, not the user-facing tool¶

The frontend deliverable is a dev/validation page mirroring AudioTranscribeViewer.tsx, in the same spirit as the PHON-128 viewer. The eventual polished, user-facing Pronunciation tool is a separate later spec that unifies the v6 audio dev surfaces into one tool. This ticket does not build production tool polish (no onboarding, no marketing copy, no cross-tool threading beyond the existing pattern). YAGNI on everything the unification spec will own.

1.2 Non-goals¶

Not for disordered speech. Model #2 assumes the speaker intends canonical; canonical bias is the feature. Disordered/covert-contrast cases are Models #3 and #5. The honest limit is surfaced in the response limitations[] and on the dev page.
Not L1-conditioned scoring yet. v6.1 ships population-agnostic scoring. It threads the L1 seam and reports L1-stratified validation, but does not yet condition the metric on L1 (see §7).
Not a new inference host. Reuses the existing local phonolex_audio server (FastAPI; AUDIO_INFERENCE_URL, already wired in wrangler.toml/.dev.vars); all scoring runs in-Worker. RunPod (the umbrella spec's eventual public scale-to-zero deploy) is out of scope for PHON-129 — the Worker is host-agnostic via AUDIO_INFERENCE_URL, so the production host is a later deploy decision, not a code concern here.

2. Architecture¶

2.1 Request flow (Approach A — scoring in the Worker)¶

POST /api/audio/pronounce   { target_word, audio, transcriber?, l1? }
  │
  ├─ 1. Validate multipart (reuse audio.ts proxy validation: 10 MB cap, audio/* type)
  ├─ 2. Sub-call the local phonolex_audio inference server (AUDIO_INFERENCE_URL)  →  produced phonemes[]
  │        transcriber = "off-the-shelf" (default) | "ft"
  ├─ 3. Look up canonical phonemes for target_word from D1 `words.phonemes`
  │        (word not found → 404 { detail })
  ├─ 4. Score in-Worker (pronunciationScore.ts):
  │        WPER align(produced, canonical), sub-cost = cos_dist = clip(1 − cosine, 0, 1)
  │        reuse the PhonemeCache (norms + dots) loaded from D1 phonemes/phoneme_dots
  └─ 5. Return scoring contract (§3)

Why Approach A (scoring in TS, not co-located with the model on the inference host): zero new inference infra; reuses the PhonemeCache + phonemeCosine already in similarity.ts; pure-TS scoring is unit-testable under cloudflare:test; matches the umbrella spec's "in-Worker cos_dist" recipe. The only cost — re-implementing PHON-126's ~30-line metric in TS — is neutralized by a frozen cross-language fixture (§6.1).

2.2 Transcriber selection — off-the-shelf by default¶

transcriber defaults to "off-the-shelf" (wav2vec2-lv-60-espeak-cv-ft, no FT). Rationale:

The scorer measures distance from what was actually produced to canonical. The transcriber must faithfully report the production. The PHON-139 FT model's faithfulness gain (collapse 17.4% vs off-the-shelf 33%) was measured on disordered child speech; L2 is not disordered, so the off-the-shelf model is the appropriate default for this population and matches the umbrella recipe.
transcriber: "ft" is supported so the dev page can A/B both on L2 audio ("we're not limited"). Routes to the local server's /compare path (the PHON-139 lineage; phonolex_audio launched with --ft-checkpoint exposes off-the-shelf vs FT side by side).

3. API contract¶

POST /api/audio/pronounce — multipart/form-data:

field	type	required	notes
`audio`	file (audio/*)	yes	≤ 10 MB; client records via MediaRecorder
`target_word`	string	yes	looked up in D1 `words` for canonical phonemes
`transcriber`	`"off-the-shelf"` \| `"ft"`	no	default `off-the-shelf`
`l1`	string	no	forward-compat seam; echoed/tagged only in v6.1

Response (200):

{
  "target_word": "very",
  "canonical_phonemes": ["v","ɛ","ɹ","i"],
  "transcript": { "phonemes": ["b","ɛ","ɹ","i"], "confidences": [...],
                  "duration_ms": 812, "coverage": "broad-phoneme", "limitations": [...] },
  "per_position": [                          // one entry PER CANONICAL position
    { "canonical": "v", "produced": "b", "cos_dist": 0.31, "op": "sub" },
    { "canonical": "ɛ", "produced": "ɛ", "cos_dist": 0.0,  "op": "match" },
    { "canonical": "ɹ", "produced": "ɹ", "cos_dist": 0.0,  "op": "match" },
    { "canonical": "i", "produced": "i", "cos_dist": 0.0,  "op": "match" }
  ],
  "insertions": [                            // extra produced phones (out-of-band)
    // { "produced": "ə", "after_canonical_index": 2 }
  ],
  "overall_score": 0.92,                     // 1 − WPER, in [0,1]
  "variant_vs_error_class": "error",         // per-word aggregate = worst position
  "threshold_basis": "l1_agnostic",          // honest tag: the knob SLM-r says is L1-sensitive
  "l1": null,                                // echoed if provided
  "transcriber": "off-the-shelf",
  "coverage": "broad-phoneme",
  "limitations": ["Scores against the canonical target; assumes the speaker intends canonical.",
                  "Broad-phoneme only; distortions/covert contrast not modeled (Models #3, #5).",
                  "variant/error threshold is L1-agnostic in v6.1."]
}

Contract decisions: - per_position is keyed to canonical positions (the targets). Substitutions and deletions occupy a canonical slot; insertions (extra produced phones) go in insertions[] since they do not map to a target position. - op ∈ {match, sub, del} per canonical position. cos_dist for del = 1.0 (max). - overall_score = 1 − WPER is the headline number. - l1 optional, echoed; not used in scoring in v6.1. - threshold_basis: "l1_agnostic" makes the SLM-r-flagged limitation legible in the payload and marks the seam for the L1-conditioned upgrade.

Errors: - target word not in lexicon → 404 { detail } - transcriber host warming/unreachable → 503 { warming: true, detail } (passthrough from the audio.ts proxy pattern) - transcript empty / unintelligible clip → 200 with overall_score: 0 and a limitations note (do not 500 on a legitimately-unintelligible production)

4. Scoring module — `pronunciationScore.ts`¶

Pure TS, no D1/IO inside the hot path (cache passed in). Reuses phonemeCosine from similarity.ts.

cosDist(a, b, cache) = clip(1 − phonemeCosine(a, b, cache), 0, 1)
alignWPER(produced, canonical, cache) — flat-sequence Levenshtein DP; sub-cost = cosDist, indel = 1.0; traceback recovers the op path. Returns { perPosition[], insertions[], wper } where wper = totalCost / canonical.length (normalized by canonical length, matching PHON-126).
classify(perPosition):
per substitution: cos_dist < T → variant, else error; del → error.
T = 0.112 — midpoint of PHON-126's variant-75th (0.102) and error-25th (0.122). Single named constant, documented as the L1-agnostic boundary.
per-word variant_vs_error_class = worst position class (any error ⇒ word is error).
overallScore = clamp(1 − wper, 0, 1).

Cache loading: the route loads PhonemeCache (norms from phonemes, dots from phoneme_dots) once per isolate on cold start — same mechanism the similarity route already uses. No new tables.

5. Frontend — `PronunciationViewer.tsx` (dev page)¶

Mirrors AudioTranscribeViewer.tsx. Mounted as a dev route alongside the transcribe viewer.

Controls: - Record (MediaRecorder) + file upload + curated preloaded-clip picker (reuse loadAudioSamples). - target_word text input (the one new control). - L1 dropdown — {Arabic, Chinese, Hindi, Korean, Spanish, Vietnamese, unknown} (the L2-ARCTIC L1s); wired to l1, tags output only. - Transcriber toggle — off-the-shelf / ft.

Display: - Canonical phoneme row; under each, the produced phone; per-position cos_dist heat (low=green → high=red); del slots marked empty, insertions[] rendered between slots. - Overall score; variant_vs_error_class badge with the l1_agnostic caveat inline. - Warming state reused from the transcribe viewer's TranscriberWarmingError handling.

Service: add pronounceAudio(blob, targetWord, opts) to audioApi.ts — multipart, same warming-error semantics as transcribeAudio.

6. Validation¶

Lab dir: research/2026-06-05-phon-129-l2-accent-scorer/. Checkpoint per the long-running-jobs policy (--checkpoint-dir, SIGINT flush, resume).

6.1 Metric port correctness (drift guard)¶

score_fixtures.json — run PHON-126's Python cos_dist/WPER on a set of (produced, canonical) pairs, freeze outputs; a vitest asserts pronunciationScore.ts reproduces them to 1e-6. Gives the "verbatim metric" guarantee without co-locating scoring on the model host.

6.2 Real-audio validation — L2-ARCTIC (primary)¶

Data: /Volumes/ExternalData2/audio-datasets/l2arctic — 24 speakers, balanced 6 L1s × 4 (Arabic, Chinese, Hindi, Korean, Spanish, Vietnamese), ~150 annotated utts each (3,621 annotated TextGrids). Phone-tier gold = canonical,perceived,errortype triples (s/d/a).

01_run_l2arctic.py — per annotated utt: transcribe (off-the-shelf) → cos_dist per position; parse the gold phone tier.
Two cos_dist computations per token: (a) from the transcriber output (the real chain), and (b) from the human-perceived phone (oracle). The (a)−(b) gap isolates transcriber error from metric error.
02_metrics.py — PHON-126's three diagnostics extended to real audio:
Mann-Whitney U (one-sided, variant < error)
practical threshold (variant 75th pct < error 25th pct)
Spearman ρ (severity rank vs cos_dist)
reported per-L1 (6 groups) AND pooled.

Release gate: pooled separation replicates PHON-126 on real audio (Mann-Whitney significant + practical threshold clean), and the per-L1 breakdown is reported. Per-L1 threshold drift is a finding (feeds the §7 follow-on), not a release blocker.

6.3 Error pole — PhonBank Clinical (secondary)¶

Disordered child speech (narrow actual) as the "error" pole for the full variant<error contrast on real audio: L2-ARCTIC substitutions = variant class, PhonBank Clinical substitutions = error class. Flag the age/recording confound (L2-ARCTIC adult vs PhonBank child) honestly; this is a supporting check, not the gate.

7. The L1 seam (why it exists, what it is, what it is not)¶

Grounded in FLEGE_SLM.md. SLM-r: L2 substitution structure is conditioned by the speaker's L1 at the position-sensitive allophone level (equivalence classification; perceived cross-language dissimilarity). The El Kheir et al. (2023) L1-MultiMDD system — same encoder family, same L2-ARCTIC corpus, same eSpeak tooling — shows L1-conditioning drops false-rejection rate 5.46→4.26 (the "variant mis-flagged as error" mode) and PER 13.70→12.52.

v6.1 design response (seam, not implementation): 1. /api/audio/pronounce accepts optional l1; echoed + threshold_basis: "l1_agnostic" tag. 2. Validation is L1-stratified (§6.2) — the empirical tell for whether L1-conditioning is needed. 3. Upgrade path (out of scope for v6.1): PHON-137/138's confusion channel P(produced | canonical, population, position) with population = L1 becomes the SLM-r-faithful, position-sensitive, probabilistic L1 prior. The l1 seam means it slots in without an API change.

8. Files touched¶

New: - packages/web/workers/src/routes/audio.ts → add pronounce handler (or audioPronounce.ts if audio.ts grows past one responsibility). - packages/web/workers/src/lib/pronunciationScore.ts — scoring module. - packages/web/workers/src/__tests__/audioPronounce.test.ts — route + alignment + fixture tests. - packages/web/frontend/src/components/tools/PronunciationViewer.tsx (+ .test.tsx). - research/2026-06-05-phon-129-l2-accent-scorer/{01_run_l2arctic.py,02_metrics.py,score_fixtures.json,RESULTS.md}.

Modified: - packages/web/frontend/src/services/audioApi.ts — add pronounceAudio(). - packages/web/workers/src/index.ts — already mounts /api/audio; new sub-route only. - Dev-route registration for PronunciationViewer (mirror the transcribe viewer's wiring).

Reused unchanged: similarity.ts phonemeCosine + PhonemeCache; audio.ts multipart validation + warming-proxy pattern; words.phonemes lookup; phonemes/phoneme_dots D1 tables.

9. Done when¶

POST /api/audio/pronounce live with the §3 contract (off-the-shelf default, ft selectable, l1 seam).
pronunciationScore.ts passes the frozen PHON-126 fixture (1e-6) + alignment edge-case tests.
PronunciationViewer dev page mounted, mirroring the transcribe viewer (record/upload/preloaded, target word, L1 tag, transcriber toggle, per-position heat + score + class badge).
L2-ARCTIC validation run: PHON-126's three metrics reported pooled + per-L1; RESULTS.md written with a GO/NO-GO read and an explicit per-L1 L1-sensitivity note.
v6.1 release notes published.

10. References¶

Umbrella: docs/superpowers/specs/2026-05-30-v6-audio-support-design.md
research/2026-06-05-phon-129-l2-accent-scorer/FLEGE_SLM.md
PHON-126 findings: research/2026-05-28-phon-126-feature-vector-graded-error/findings.md
PHON-128 viewer (AudioTranscribeViewer.tsx) + audio.ts proxy pattern + audioApi.ts
similarity.ts (cos_dist substrate), [[project_audio_data_reservoir]], [[project_audio_targeted_models]]