PHON-142 — FT-L2 connected-speech transcriber: an L1-conditioning comparative study¶

Status: design — pending plan Ticket: PHON-142 (to file; next free key verified 2026-06-05, highest was PHON-141) Parent: PHON-44 Audio · feeds Model #2 (PHON-129) Grounding: research/2026-06-05-phon-129-l2-accent-scorer/FLEGE_SLM.md, .../RESULTS.md Date: 2026-06-05

1. Why¶

PHON-129 validated Model #2's metric (cos_dist replicates PHON-126 on real L2 audio, all 6 L1s) but exposed the transcriber as the binding constraint: the off-the-shelf model canonical-collapses ~59% of substitution positions on connected L2 speech (the error is emitted as canonical → invisible to the scorer). The PHON-139 FT model can't help — it was trained on single words and produces garbage on sentences (5 phonemes for a 38-phoneme utterance). So Model #2 needs a faithful, connected-speech L2 transcriber, and the open question — does conditioning on L1 help, and where should L1 live — should be answered by measurement, not assertion.

This spike runs the full comparison in one shot: a faithful connected-speech FT transcriber, an L1-conditioned-encoder variant, and an L1 scoring-prior, all evaluated against the off-the-shelf baseline on held-out speakers across all 6 L1s.

2. Experiment matrix (4 chains × 6 L1s, speaker-held-out)¶

Every chain produces a whole-sequence phoneme transcript, scored by text-aligning to the known canonical (the PHON-129 metric) — no audio segmenter (the target is always known, so word/position attribution is deterministic from the alignment; CTC timestamps anchor any UI highlight).

#	Chain	L1 lives in	Question it answers
0	off-the-shelf `wav2vec2-lv-60-espeak-cv-ft`	—	baseline (the 59% collapse reference)
1	FT-L2-faithful (no L1 input)	nowhere	does faithful connected-speech FT beat off-the-shelf?
2	FT-L2-L1-encoder (El Kheir aux L1 head)	transcriber	does L1-in-the-acoustic-model beat faithful?
3	FT-L2-faithful + L1 scoring-prior	scorer	does L1-in-the-scorer beat faithful?

Comparisons: 1 vs 0 = faithfulness payoff; 2 vs 1 and 3 vs 1 = does L1 help; 2 vs 3 = encoder vs scorer for L1.

3. Data¶

Corpus: L2-ARCTIC, all 24 speakers / 6 L1s (Arabic, Chinese, Hindi, Korean, Spanish, Vietnamese), per-phone gold (the IPA/ISU/ERROR-tier annotations the PHON-129 parser already handles).
Training target = the PRODUCED (perceived) broad-40 sequence reconstructed per utterance from the phone tier (canonical where ok, perceived where annotated). Never canonical — this is the structural anti-collapse property (lifted from PHON-139).
Split: speaker-held-out — 3 speakers/L1 train (18), 1 speaker/L1 test (6). Eval is on unseen speakers (the honest test). ~2,700 train / ~900 test annotated utts.
OOV phones (outside the broad-40 set) handled as in PHON-129 (excluded from cos_dist, not crashed).

4. Architectures¶

#1 Faithful: PHON-139 recipe A, near-verbatim — wav2vec2-lv-60-espeak base, broad-40 CTC head Linear(1024→42) (0=pad,1=blank,2..41=40 phonemes), conv front-end frozen, transformer+head fine-tuned on connected-speech perceived labels. Only change vs PHON-139: connected L2 sentences as data. Train ≥2 seeds for stability.
#2 L1-encoder: #1 + an auxiliary L1 classifier branch over the pooled encoder representation; the L1-aware embedding is fused (concat) into the CTC-head input; joint loss L = L_CTC + λ·L_L1-CE (El Kheir et al. 2023 blueprint, FLEGE_SLM.md §5). Declared L1 at inference (one-hot when known; inferred-embedding path noted as a future extension). Train ≥2 seeds.
#3 Scoring-prior: not a model — P(produced | canonical, L1, position) counted from the train-split annotations (Laplace-smoothed), applied at the variant/error decision on #1's output: an L1-typical substitution (high channel prob for that L1×position) is pulled toward variant, an L1-atypical one toward error. The PHON-137/138 channel form, conditioned on L1.

5. Scoring (shared across all chains)¶

transcript → align(produced, canonical) [WPER, PHON-129 pronunciationScore metric] → per-position cos_dist → ok/sub partition vs human gold. Reuses research/2026-06-05-phon-129-l2-accent-scorer/ harness (with the --transcriber registry extended to the new models). The metric is pinned to PHON-126 (score_fixtures.json).

6. Infra¶

RunPod GPU pods (fresh; ~$185 balance; the deleted serverless endpoints are unrelated — these are training pods, spun up and torn down). Parallel containers: #1 and #2 train concurrently; seeds parallelize further. Est. cost well under budget (wav2vec2 FT on ~2.7k utts × few epochs ≈ 1–2 GPU-hr per run).
Checkpointing per the CLAUDE.md long-running-jobs policy (lifted from PHON-139 train.py): atomic full-state save, --checkpoint-every, resume on restart, SIGINT→final-save, expandable_segments:True.
Data to pod: package the annotated Spanish…Vietnamese wavs + reconstructed produced-label JSONL, upload to the pod. Licensing: L2-ARCTIC is CC-BY-NC (Tier B) — training/algorithm-dev is permitted, the data stays private on the pod and is not redistributed; the trained model can ship. Flag for lawyer at ship, consistent with the existing PhonBank/L2-ARCTIC stance.
Execution via the model-trainer agent to drive runs, compare architectures, and analyze metrics.
Dependency: RunPod API access (key) — to be provided by the user; pod-launch commands run via the session ! shell where interactive auth is needed.

7. Evaluation¶

On the 6 held-out test speakers, for all 4 chains, pooled and per-L1: - Canonical-collapse rate at annotated substitution positions (the PHON-129 headline: off-the-shelf = 59%; lower is better). - PHON-126 diagnostics D1 (Mann-Whitney ok<sub), D2 (practical threshold), D3 (Spearman severity). - FRR (false-rejection: an L1-typical correct-for-that-accent production flagged as error) — the El Kheir headline metric where L1-conditioning should show its value. - Transcriber PER vs the produced (perceived) gold on held-out speakers (raw transcription quality).

8. Serving + tagging¶

Extend phonolex_audio to a multi-model registry: off-the-shelf, ft-l2, ft-child (rename the PHON-139 checkpoint slot). transcriber selects per request; /compare takes any pair; each model carries its own coverage/limitations metadata. (The PHON-129 route already passes a transcriber field — extend its allowed values.)

9. Out of scope¶

Redo-child (connected-speech disordered-child FT): separate later ticket. Clinical child assessment is word-elicited (GFTA/PERCEPT), so word-level FT-Child is appropriate; connected/narrative child is a distinct population + scarcer gold.
Production word-boundary forced alignment (for open-vocabulary / unknown-target input): not needed here (Model #2 always knows the target → text-alignment suffices). Deferred to productization.
Inferred-L1 path for #2 (no declared L1): note as extension; v1 uses declared L1.

10. Done when¶

research/2026-06-05-phon-142-ft-l2/RESULTS.md ranks the 4 chains on collapse / D1–D3 / FRR / PER, pooled + per-L1, with: (a) a GO/NO-GO on whether a faithful FT-L2 transcriber should become Model #2's production transcriber (beating off-the-shelf's 59% collapse), and (b) a verdict on whether L1-conditioning earns its place and where it belongs (encoder vs scorer). Checkpoints + the produced-label dataset retained per policy. PR #124 (the scorer) is the downstream consumer of whichever transcriber wins.

11. References¶

research/2026-06-05-phon-129-l2-accent-scorer/{FLEGE_SLM.md, RESULTS.md} — grounding + the validation that motivates this
El Kheir, Chowdhury & Ali (2023), L1-aware Multilingual Mispronunciation Detection (arXiv 2309.07719) — encoder L1-conditioning blueprint (same wav2vec2 family, same L2-ARCTIC corpus)
research/2026-06-03-phon-139-transcriber-ft/train.py — the CTC anti-collapse trainer to adapt
PHON-126 — the validated cos_dist metric (score_fixtures.json pin)
packages/audio/src/phonolex_audio/{server.py, transcribe.py, transcribe_ft.py} — serving layer to extend