Skip to content

PHON-129 — Model #2: L2 / Accent Pronunciation Scorer (v6.1)

Status: design — pending plan Ticket: PHON-129 Parent: PHON-44 Audio Umbrella spec: docs/superpowers/specs/2026-05-30-v6-audio-support-design.md §2.2, §3, §6 Research grounding: research/2026-06-05-phon-129-l2-accent-scorer/FLEGE_SLM.md Date: 2026-06-05 Depends on: Model #1 (PHON-128, Done) — transcription sub-call.


1. Goal

Given a target word and an audio production, score how far the production is from canonical, per phoneme position, plus an overall score and a per-word variant-vs-error label. Serves L2 language learners, accent-modification clinicians, ESL teachers, and voice coaches — users who are explicitly trying to match a canonical target.

The metric is PHON-126's validated cos_dist over the learned articulatory feature vectors. Model #1 produces the transcript; this ticket adds the scoring layer + a dev-page surface.

1.1 Scope guard — this is a dev page, not the user-facing tool

The frontend deliverable is a dev/validation page mirroring AudioTranscribeViewer.tsx, in the same spirit as the PHON-128 viewer. The eventual polished, user-facing Pronunciation tool is a separate later spec that unifies the v6 audio dev surfaces into one tool. This ticket does not build production tool polish (no onboarding, no marketing copy, no cross-tool threading beyond the existing pattern). YAGNI on everything the unification spec will own.

1.2 Non-goals

  • Not for disordered speech. Model #2 assumes the speaker intends canonical; canonical bias is the feature. Disordered/covert-contrast cases are Models #3 and #5. The honest limit is surfaced in the response limitations[] and on the dev page.
  • Not L1-conditioned scoring yet. v6.1 ships population-agnostic scoring. It threads the L1 seam and reports L1-stratified validation, but does not yet condition the metric on L1 (see §7).
  • Not a new inference host. Reuses the existing local phonolex_audio server (FastAPI; AUDIO_INFERENCE_URL, already wired in wrangler.toml/.dev.vars); all scoring runs in-Worker. RunPod (the umbrella spec's eventual public scale-to-zero deploy) is out of scope for PHON-129 — the Worker is host-agnostic via AUDIO_INFERENCE_URL, so the production host is a later deploy decision, not a code concern here.

2. Architecture

2.1 Request flow (Approach A — scoring in the Worker)

POST /api/audio/pronounce   { target_word, audio, transcriber?, l1? }
  │
  ├─ 1. Validate multipart (reuse audio.ts proxy validation: 10 MB cap, audio/* type)
  ├─ 2. Sub-call the local phonolex_audio inference server (AUDIO_INFERENCE_URL)  →  produced phonemes[]
  │        transcriber = "off-the-shelf" (default) | "ft"
  ├─ 3. Look up canonical phonemes for target_word from D1 `words.phonemes`
  │        (word not found → 404 { detail })
  ├─ 4. Score in-Worker (pronunciationScore.ts):
  │        WPER align(produced, canonical), sub-cost = cos_dist = clip(1 − cosine, 0, 1)
  │        reuse the PhonemeCache (norms + dots) loaded from D1 phonemes/phoneme_dots
  └─ 5. Return scoring contract (§3)

Why Approach A (scoring in TS, not co-located with the model on the inference host): zero new inference infra; reuses the PhonemeCache + phonemeCosine already in similarity.ts; pure-TS scoring is unit-testable under cloudflare:test; matches the umbrella spec's "in-Worker cos_dist" recipe. The only cost — re-implementing PHON-126's ~30-line metric in TS — is neutralized by a frozen cross-language fixture (§6.1).

2.2 Transcriber selection — off-the-shelf by default

transcriber defaults to "off-the-shelf" (wav2vec2-lv-60-espeak-cv-ft, no FT). Rationale:

  • The scorer measures distance from what was actually produced to canonical. The transcriber must faithfully report the production. The PHON-139 FT model's faithfulness gain (collapse 17.4% vs off-the-shelf 33%) was measured on disordered child speech; L2 is not disordered, so the off-the-shelf model is the appropriate default for this population and matches the umbrella recipe.
  • transcriber: "ft" is supported so the dev page can A/B both on L2 audio ("we're not limited"). Routes to the local server's /compare path (the PHON-139 lineage; phonolex_audio launched with --ft-checkpoint exposes off-the-shelf vs FT side by side).

3. API contract

POST /api/audio/pronouncemultipart/form-data:

field type required notes
audio file (audio/*) yes ≤ 10 MB; client records via MediaRecorder
target_word string yes looked up in D1 words for canonical phonemes
transcriber "off-the-shelf" | "ft" no default off-the-shelf
l1 string no forward-compat seam; echoed/tagged only in v6.1

Response (200):

{
  "target_word": "very",
  "canonical_phonemes": ["v","ɛ","ɹ","i"],
  "transcript": { "phonemes": ["b","ɛ","ɹ","i"], "confidences": [...],
                  "duration_ms": 812, "coverage": "broad-phoneme", "limitations": [...] },
  "per_position": [                          // one entry PER CANONICAL position
    { "canonical": "v", "produced": "b", "cos_dist": 0.31, "op": "sub" },
    { "canonical": "ɛ", "produced": "ɛ", "cos_dist": 0.0,  "op": "match" },
    { "canonical": "ɹ", "produced": "ɹ", "cos_dist": 0.0,  "op": "match" },
    { "canonical": "i", "produced": "i", "cos_dist": 0.0,  "op": "match" }
  ],
  "insertions": [                            // extra produced phones (out-of-band)
    // { "produced": "ə", "after_canonical_index": 2 }
  ],
  "overall_score": 0.92,                     // 1 − WPER, in [0,1]
  "variant_vs_error_class": "error",         // per-word aggregate = worst position
  "threshold_basis": "l1_agnostic",          // honest tag: the knob SLM-r says is L1-sensitive
  "l1": null,                                // echoed if provided
  "transcriber": "off-the-shelf",
  "coverage": "broad-phoneme",
  "limitations": ["Scores against the canonical target; assumes the speaker intends canonical.",
                  "Broad-phoneme only; distortions/covert contrast not modeled (Models #3, #5).",
                  "variant/error threshold is L1-agnostic in v6.1."]
}

Contract decisions: - per_position is keyed to canonical positions (the targets). Substitutions and deletions occupy a canonical slot; insertions (extra produced phones) go in insertions[] since they do not map to a target position. - op ∈ {match, sub, del} per canonical position. cos_dist for del = 1.0 (max). - overall_score = 1 − WPER is the headline number. - l1 optional, echoed; not used in scoring in v6.1. - threshold_basis: "l1_agnostic" makes the SLM-r-flagged limitation legible in the payload and marks the seam for the L1-conditioned upgrade.

Errors: - target word not in lexicon → 404 { detail } - transcriber host warming/unreachable → 503 { warming: true, detail } (passthrough from the audio.ts proxy pattern) - transcript empty / unintelligible clip → 200 with overall_score: 0 and a limitations note (do not 500 on a legitimately-unintelligible production)


4. Scoring module — pronunciationScore.ts

Pure TS, no D1/IO inside the hot path (cache passed in). Reuses phonemeCosine from similarity.ts.

  • cosDist(a, b, cache) = clip(1 − phonemeCosine(a, b, cache), 0, 1)
  • alignWPER(produced, canonical, cache) — flat-sequence Levenshtein DP; sub-cost = cosDist, indel = 1.0; traceback recovers the op path. Returns { perPosition[], insertions[], wper } where wper = totalCost / canonical.length (normalized by canonical length, matching PHON-126).
  • classify(perPosition):
  • per substitution: cos_dist < Tvariant, else error; delerror.
  • T = 0.112 — midpoint of PHON-126's variant-75th (0.102) and error-25th (0.122). Single named constant, documented as the L1-agnostic boundary.
  • per-word variant_vs_error_class = worst position class (any error ⇒ word is error).
  • overallScore = clamp(1 − wper, 0, 1).

Cache loading: the route loads PhonemeCache (norms from phonemes, dots from phoneme_dots) once per isolate on cold start — same mechanism the similarity route already uses. No new tables.


5. Frontend — PronunciationViewer.tsx (dev page)

Mirrors AudioTranscribeViewer.tsx. Mounted as a dev route alongside the transcribe viewer.

Controls: - Record (MediaRecorder) + file upload + curated preloaded-clip picker (reuse loadAudioSamples). - target_word text input (the one new control). - L1 dropdown{Arabic, Chinese, Hindi, Korean, Spanish, Vietnamese, unknown} (the L2-ARCTIC L1s); wired to l1, tags output only. - Transcriber toggle — off-the-shelf / ft.

Display: - Canonical phoneme row; under each, the produced phone; per-position cos_dist heat (low=green → high=red); del slots marked empty, insertions[] rendered between slots. - Overall score; variant_vs_error_class badge with the l1_agnostic caveat inline. - Warming state reused from the transcribe viewer's TranscriberWarmingError handling.

Service: add pronounceAudio(blob, targetWord, opts) to audioApi.ts — multipart, same warming-error semantics as transcribeAudio.


6. Validation

Lab dir: research/2026-06-05-phon-129-l2-accent-scorer/. Checkpoint per the long-running-jobs policy (--checkpoint-dir, SIGINT flush, resume).

6.1 Metric port correctness (drift guard)

score_fixtures.json — run PHON-126's Python cos_dist/WPER on a set of (produced, canonical) pairs, freeze outputs; a vitest asserts pronunciationScore.ts reproduces them to 1e-6. Gives the "verbatim metric" guarantee without co-locating scoring on the model host.

6.2 Real-audio validation — L2-ARCTIC (primary)

Data: /Volumes/ExternalData2/audio-datasets/l2arctic — 24 speakers, balanced 6 L1s × 4 (Arabic, Chinese, Hindi, Korean, Spanish, Vietnamese), ~150 annotated utts each (3,621 annotated TextGrids). Phone-tier gold = canonical,perceived,errortype triples (s/d/a).

  • 01_run_l2arctic.py — per annotated utt: transcribe (off-the-shelf) → cos_dist per position; parse the gold phone tier.
  • Two cos_dist computations per token: (a) from the transcriber output (the real chain), and (b) from the human-perceived phone (oracle). The (a)−(b) gap isolates transcriber error from metric error.
  • 02_metrics.py — PHON-126's three diagnostics extended to real audio:
  • Mann-Whitney U (one-sided, variant < error)
  • practical threshold (variant 75th pct < error 25th pct)
  • Spearman ρ (severity rank vs cos_dist)
  • reported per-L1 (6 groups) AND pooled.

Release gate: pooled separation replicates PHON-126 on real audio (Mann-Whitney significant + practical threshold clean), and the per-L1 breakdown is reported. Per-L1 threshold drift is a finding (feeds the §7 follow-on), not a release blocker.

6.3 Error pole — PhonBank Clinical (secondary)

Disordered child speech (narrow actual) as the "error" pole for the full variant<error contrast on real audio: L2-ARCTIC substitutions = variant class, PhonBank Clinical substitutions = error class. Flag the age/recording confound (L2-ARCTIC adult vs PhonBank child) honestly; this is a supporting check, not the gate.


7. The L1 seam (why it exists, what it is, what it is not)

Grounded in FLEGE_SLM.md. SLM-r: L2 substitution structure is conditioned by the speaker's L1 at the position-sensitive allophone level (equivalence classification; perceived cross-language dissimilarity). The El Kheir et al. (2023) L1-MultiMDD system — same encoder family, same L2-ARCTIC corpus, same eSpeak tooling — shows L1-conditioning drops false-rejection rate 5.46→4.26 (the "variant mis-flagged as error" mode) and PER 13.70→12.52.

v6.1 design response (seam, not implementation): 1. /api/audio/pronounce accepts optional l1; echoed + threshold_basis: "l1_agnostic" tag. 2. Validation is L1-stratified (§6.2) — the empirical tell for whether L1-conditioning is needed. 3. Upgrade path (out of scope for v6.1): PHON-137/138's confusion channel P(produced | canonical, population, position) with population = L1 becomes the SLM-r-faithful, position-sensitive, probabilistic L1 prior. The l1 seam means it slots in without an API change.


8. Files touched

New: - packages/web/workers/src/routes/audio.ts → add pronounce handler (or audioPronounce.ts if audio.ts grows past one responsibility). - packages/web/workers/src/lib/pronunciationScore.ts — scoring module. - packages/web/workers/src/__tests__/audioPronounce.test.ts — route + alignment + fixture tests. - packages/web/frontend/src/components/tools/PronunciationViewer.tsx (+ .test.tsx). - research/2026-06-05-phon-129-l2-accent-scorer/{01_run_l2arctic.py,02_metrics.py,score_fixtures.json,RESULTS.md}.

Modified: - packages/web/frontend/src/services/audioApi.ts — add pronounceAudio(). - packages/web/workers/src/index.ts — already mounts /api/audio; new sub-route only. - Dev-route registration for PronunciationViewer (mirror the transcribe viewer's wiring).

Reused unchanged: similarity.ts phonemeCosine + PhonemeCache; audio.ts multipart validation + warming-proxy pattern; words.phonemes lookup; phonemes/phoneme_dots D1 tables.


9. Done when

  • POST /api/audio/pronounce live with the §3 contract (off-the-shelf default, ft selectable, l1 seam).
  • pronunciationScore.ts passes the frozen PHON-126 fixture (1e-6) + alignment edge-case tests.
  • PronunciationViewer dev page mounted, mirroring the transcribe viewer (record/upload/preloaded, target word, L1 tag, transcriber toggle, per-position heat + score + class badge).
  • L2-ARCTIC validation run: PHON-126's three metrics reported pooled + per-L1; RESULTS.md written with a GO/NO-GO read and an explicit per-L1 L1-sensitivity note.
  • v6.1 release notes published.

10. References

  • Umbrella: docs/superpowers/specs/2026-05-30-v6-audio-support-design.md
  • research/2026-06-05-phon-129-l2-accent-scorer/FLEGE_SLM.md
  • PHON-126 findings: research/2026-05-28-phon-126-feature-vector-graded-error/findings.md
  • PHON-128 viewer (AudioTranscribeViewer.tsx) + audio.ts proxy pattern + audioApi.ts
  • similarity.ts (cos_dist substrate), [[project_audio_data_reservoir]], [[project_audio_targeted_models]]