PHON-142 — FT-L2 L1-Conditioning Comparative Study — Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking. Training tasks (4, 5) should be dispatched to themodel-traineragent.
Goal: Train a faithful connected-speech L2 transcriber + an L1-conditioned variant, and rank 4 transcribe→score chains (off-the-shelf / FT-L2-faithful / FT-L2-L1-encoder / faithful+L1-scoring-prior) on held-out L2-ARCTIC across all 6 L1s, to decide Model #2's production transcriber and whether L1-conditioning earns its place.
Architecture: Anti-collapse CTC fine-tuning of wav2vec2-lv-60-espeak toward produced broad-40 phonemes (PHON-139 recipe) on L2-ARCTIC connected sentences; an El Kheir-style auxiliary-L1-head variant; a statistical L1 scoring-prior. Whole-sequence transcript → text-align to known canonical (PHON-129 metric, no audio segmenter). Trained on RunPod GPU pods, parallel.
Tech Stack: PyTorch, HuggingFace transformers (wav2vec2 CTC), Polars, librosa/soundfile, runpodctl + SSH, the PHON-129 eval harness.
Spec: docs/superpowers/specs/2026-06-05-phon-142-ft-l2-l1-transcriber-study.md
Lift from: research/2026-06-03-phon-139-transcriber-ft/train.py (trainer), research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py (parser + cos_dist).
File Structure¶
All new work under research/2026-06-05-phon-142-ft-l2/:
- build_l2_dataset.py — reconstruct produced-label dataset (all 6 L1s) + speaker-held-out split → data/{train,test}.jsonl. One responsibility: gold → training/eval JSONL.
- train_l2.py — connected-speech faithful FT (adapts PHON-139 train.py). One responsibility: train the faithful model.
- model_l1_encoder.py — the L1-aware model module (base + aux-L1 head + fusion). One responsibility: the architecture.
- train_l2_l1.py — thin trainer wrapping model_l1_encoder.py (reuses train_l2.py data/loop). One responsibility: train the L1-encoder model.
- scoring_prior.py — estimate + apply P(produced|canonical,L1,position). One responsibility: the L1 prior.
- eval_matrix.py — run all 4 chains on the test split, emit per-token rows. One responsibility: produce the comparison rows.
- metrics_matrix.py — collapse/D1-D3/FRR/PER, pooled + per-L1, across chains → tables. One responsibility: the comparison report.
- runpod/{provision.sh, sync.sh, run_training.sh} — pod lifecycle + data sync + launch.
- RESULTS.md — the ranking + GO recommendation.
Reuse unchanged: PHON-129 01_run_l2arctic.py parsing helpers (import them), score_fixtures.json (metric pin), packages/features/outputs/vectors.csv (broad-40 inventory).
Phase 0 — Data preparation (local, test-driven)¶
Task 1: Build the produced-label dataset + speaker-held-out split¶
Files:
- Create: research/2026-06-05-phon-142-ft-l2/build_l2_dataset.py
- Create: research/2026-06-05-phon-142-ft-l2/test_build_l2_dataset.py
- [ ] Step 1: Write the failing test
# test_build_l2_dataset.py — run: uv run python -m pytest test_build_l2_dataset.py -v
from build_l2_dataset import reconstruct_produced, SPK_L1, TRAIN_SPK, TEST_SPK
def test_reconstruct_produced_uses_perceived_at_subs_canonical_elsewhere():
# tokens: (canonical, perceived, errortype)
toks = [("k", "k", "ok"), ("ae", "ae", "ok"), ("t", "d", "s")]
assert reconstruct_produced(toks) == ["k", "ae", "d"] # produced = perceived where sub
def test_reconstruct_drops_deletions_and_keeps_additions():
toks = [("k", "k", "ok"), ("t", "sil", "d"), ("sil", "s", "a")]
# deletion -> phone omitted from produced; addition -> extra produced phone present
assert reconstruct_produced(toks) == ["k", "s"]
def test_split_is_speaker_disjoint_and_covers_6_l1s():
assert set(TRAIN_SPK).isdisjoint(set(TEST_SPK))
assert {SPK_L1[s] for s in TEST_SPK} == {"Arabic","Chinese","Hindi","Korean","Spanish","Vietnamese"}
assert len(TEST_SPK) == 6 and len(TRAIN_SPK) == 18
-
[ ] Step 2: Run it, expect FAIL (
build_l2_datasetmissing). Run:cd research/2026-06-05-phon-142-ft-l2 && uv run python -m pytest test_build_l2_dataset.py -v -
[ ] Step 3: Implement
build_l2_dataset.py
Import the PHON-129 parser (don't re-derive): parse_annotation, the IPA-tier logic, SPK_L1. Key pieces:
import sys, json
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "2026-06-05-phon-129-l2-accent-scorer"))
# Reuse the validated parser + speaker map from the PHON-129 harness:
from importlib import import_module
_h = import_module("01_run_l2arctic") # module name starts with a digit -> import_module
parse_annotation = _h.parse_annotation
SPK_L1 = _h.SPK_L1 # {ABA:Arabic, ...} (24 speakers; suitcase_corpus excluded below)
L2 = Path("/Volumes/ExternalData2/audio-datasets/l2arctic")
# Held-out: 1 speaker/L1 to TEST (pick deterministically — last alphabetically per L1), rest TRAIN.
_by_l1 = {}
for spk, l1 in sorted(SPK_L1.items()):
_by_l1.setdefault(l1, []).append(spk)
TEST_SPK = sorted(v[-1] for v in _by_l1.values()) # 6
TRAIN_SPK = sorted(s for v in _by_l1.values() for s in v[:-1]) # 18
def reconstruct_produced(tokens):
"""tokens: list[(canonical, perceived, errortype)] -> produced broad phone list.
ok/sub -> use perceived; deletion ('d') -> omit; addition ('a') -> include perceived.
'sil' is never a phone."""
out = []
for canon, perceived, et in tokens:
if et == "d": # canonical phone deleted -> not produced
continue
ph = perceived
if ph and ph != "sil":
out.append(ph)
return out
def main():
out_dir = Path(__file__).resolve().parent / "data"; out_dir.mkdir(exist_ok=True)
for split, spks in (("train", TRAIN_SPK), ("test", TEST_SPK)):
rows = []
for spk in spks:
for tg in sorted((L2 / spk / "annotation").glob("*.TextGrid")):
wav = L2 / spk / "wav" / f"{tg.stem}.wav"
if not wav.exists(): continue
toks = parse_annotation(tg) # [(canon, perceived, et), ...]
produced = reconstruct_produced(toks)
if not produced: continue
rows.append({"wav": str(wav), "speaker": spk, "l1": SPK_L1[spk],
"utt": tg.stem, "produced": produced,
"canonical": [c for c,_,_ in toks if c != "sil"]})
(out_dir / f"{split}.jsonl").write_text("\n".join(json.dumps(r) for r in rows))
print(f"{split}: {len(rows)} utts, {len(spks)} speakers")
if __name__ == "__main__":
main()
parse_annotation (it may return per-utt token lists keyed differently) and adapt the call; the test pins reconstruct_produced which is self-contained.
-
[ ] Step 4: Run test, expect PASS, then build the data:
uv run python -m pytest test_build_l2_dataset.py -vthenuv run python build_l2_dataset.pyExpected:train: ~2700 utts, 18 speakers/test: ~900 utts, 6 speakers. -
[ ] Step 5: Sanity-check the dataset (no speaker leak, produced≠canonical where subs exist):
uv run python -c " import json,collections tr=[json.loads(l) for l in open('data/train.jsonl')]; te=[json.loads(l) for l in open('data/test.jsonl')] assert not ({r['speaker'] for r in tr} & {r['speaker'] for r in te}), 'speaker leak!' print('train L1s:', collections.Counter(r['l1'] for r in tr)) print('test L1s:', collections.Counter(r['l1'] for r in te)) print('mean produced len:', sum(len(r['produced']) for r in tr)/len(tr)) " -
[ ] Step 6: Commit (
.gitignorethedata/JSONL + any wav copies — Tier B, not committed):echo "data/" >> research/2026-06-05-phon-142-ft-l2/.gitignore git add research/2026-06-05-phon-142-ft-l2/build_l2_dataset.py research/2026-06-05-phon-142-ft-l2/test_build_l2_dataset.py research/2026-06-05-phon-142-ft-l2/.gitignore git commit -m "data(phon-142): produced-label L2-ARCTIC dataset builder + speaker-held-out split"
Phase 1 — RunPod environment¶
Task 2: Provision a GPU pod, sync data + code, verify¶
Files: Create research/2026-06-05-phon-142-ft-l2/runpod/{provision.sh,sync.sh}
-
[ ] Step 1: Provision script (
runpod/provision.sh) — a single GPU pod (e.g. RTX A5000/4090, PyTorch image). The user has run this pattern before; userunpodctl:#!/usr/bin/env bash # Provision a RunPod GPU pod for FT-L2 training. Prints the pod id + ssh. set -euo pipefail runpodctl create pod \ --name phon142-ft-l2 \ --gpuType "NVIDIA GeForce RTX 4090" \ --imageName "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04" \ --gpuCount 1 --volumeSize 60 --containerDiskSize 30 --ports "22/tcp" runpodctl get pod # capture id + ssh endpoint -
[ ] Step 2: Run it + record the pod id / SSH endpoint. Verify GPU:
ssh <pod> "nvidia-smi | head -15 && python -c 'import torch;print(torch.cuda.is_available())'"Expected: GPU listed,True. -
[ ] Step 3: Sync script (
runpod/sync.sh) — push the trainer code + the (gitignored) data JSONL + the referenced wavs. Because the JSONL holds absolute Mac wav paths, rewrite them to the pod path on sync, or rsync the wavs into a mirrored tree and pass a--wav-rootto the trainer. Implement the--wav-rootapproach:(#!/usr/bin/env bash set -euo pipefail POD="$1" # ssh target rsync -az research/2026-06-05-phon-142-ft-l2/ "$POD:/workspace/phon142/" # mirror only the wavs referenced by train+test (keeps upload small) uv run python - <<'PY' import json,sys paths={r["wav"] for f in ("train","test") for r in map(json.loads, open(f"research/2026-06-05-phon-142-ft-l2/data/{f}.jsonl"))} open("/tmp/phon142_wavs.txt","w").write("\n".join(sorted(paths))) PY rsync -az --files-from=/tmp/phon142_wavs.txt / "$POD:/workspace/wavroot/" ssh "$POD" "cd /workspace/phon142 && pip install -q transformers librosa soundfile polars numpy"train_l2.pytakes--wav-root /workspace/wavrootand resolveswavpaths relative to it.) -
[ ] Step 4: Verify data on pod:
ssh <pod> "wc -l /workspace/phon142/data/*.jsonl && ls /workspace/wavroot | head" -
[ ] Step 5: Commit the runpod scripts:
git add research/2026-06-05-phon-142-ft-l2/runpod/ git commit -m "infra(phon-142): runpod pod provision + data sync scripts"
Phase 2 — Training (dispatch to the model-trainer agent)¶
Task 3: Faithful connected-speech FT (train_l2.py)¶
Files: Create research/2026-06-05-phon-142-ft-l2/train_l2.py
- [ ] Step 1: Adapt PHON-139
train.py. Lift its CTC trainer wholesale (wav2vec2-lv-60-espeak base,Linear(1024→42)broad-40 head, frozen conv front-end, checkpoint policy, recipe A perceived-hard CTC loss). The only substantive changes: - Data loader reads
data/train.jsonl(this study's format): each row →(librosa.load(wav_root/wav, sr=16000), produced_labels). MapproducedIPA → broad-40 ids (offset 2; pad=0, blank=1) viavectors.csvorder. Connected-speech utts (no length cap beyond batch memory). - Args:
--wav-root,--train data/train.jsonl,--val data/test.jsonl(held-out speakers as val),--checkpoint-dir ckpt/faithful_s{seed},--epochs,--seed. -
Keep recipe A (hard CTC); drop recipe B / sim-matrix (YAGNI for faithful).
-
[ ] Step 2: Pilot gate (small, fast) — MUST pass before full run. On the pod:
python train_l2.py --wav-root /workspace/wavroot --train data/train.jsonl --val data/test.jsonl --pilot --pilot-train 200 --epochs 2 --checkpoint-dir ckpt/pilot_faithfulExpected: loss decreases; a held-out sanity decode of one val utt produces a multi-phoneme connected-speech transcript (NOT the 5-phoneme collapse the word-FT showed). If the pilot collapses, STOP and report — the connected-speech data/labels need inspection before burning the full run. -
[ ] Step 3: Full run, ≥2 seeds (parallel containers if provisioned).
for s in 1 2; do python train_l2.py --wav-root /workspace/wavroot --train data/train.jsonl --val data/test.jsonl --epochs 4 --seed $s --checkpoint-dir ckpt/faithful_s$s --checkpoint-every 300; done(Run seeds on separate pods/containers for speed.) -
[ ] Step 4: Verify checkpoints + val PER.
ssh <pod> "ls -lh /workspace/phon142/ckpt/faithful_s*/state.pt"; record per-seed val PER from the train log. -
[ ] Step 5: Pull checkpoints back; commit the trainer (NOT the 3.5GB ckpts — gitignore them).
echo "ckpt/" >> research/2026-06-05-phon-142-ft-l2/.gitignore # rsync ckpt/faithful_s*/state.pt down to research/2026-06-05-phon-142-ft-l2/ckpt/ (gitignored, local) git add research/2026-06-05-phon-142-ft-l2/train_l2.py research/2026-06-05-phon-142-ft-l2/.gitignore git commit -m "train(phon-142): faithful connected-speech L2 FT trainer + checkpoints (local)"
Task 4: L1-encoder variant (model_l1_encoder.py + train_l2_l1.py)¶
Files: Create model_l1_encoder.py, train_l2_l1.py
-
[ ] Step 1: Architecture (
model_l1_encoder.py). Wrap the base wav2vec2-CTC; add an auxiliary L1 head over the mean-pooled encoder output; fuse the L1 embedding into the CTC head input (El Kheir blueprint):import torch, torch.nn as nn from transformers import AutoModelForCTC N_L1 = 6 # Arabic, Chinese, Hindi, Korean, Spanish, Vietnamese (sorted) L1S = ["Arabic","Chinese","Hindi","Korean","Spanish","Vietnamese"] class L1AwareCTC(nn.Module): def __init__(self, base_id, n_labels=42, l1_emb=64): super().__init__() self.base = AutoModelForCTC.from_pretrained(base_id) H = self.base.config.hidden_size self.base.lm_head = nn.Identity() # we own the head self.l1_clf = nn.Linear(H, N_L1) # aux L1 classifier (on mean-pooled) self.l1_emb = nn.Embedding(N_L1, l1_emb) self.ctc_head = nn.Linear(H + l1_emb, n_labels) # fused head def forward(self, input_values, l1_id=None): h = self.base.wav2vec2(input_values).last_hidden_state # [B,T,H] pooled = h.mean(dim=1) # [B,H] l1_logits = self.l1_clf(pooled) # [B,6] # train: use the TRUE l1_id for the embedding (teacher-forced); infer: declared l1_id emb = self.l1_emb(l1_id) # [B,l1_emb] fused = torch.cat([h, emb.unsqueeze(1).expand(-1, h.size(1), -1)], dim=-1) return self.ctc_head(fused), l1_logits # CTC logits [B,T,42], l1 logits [B,6] -
[ ] Step 2: Trainer (
train_l2_l1.py) = reusetrain_l2.py's data loop; lossL = L_CTC + λ·CE(l1_logits, true_l1)(λ=0.3); passl1_idper batch (from the row'sl1). Args mirrortrain_l2.py+--lambda-l1. -
[ ] Step 3: Pilot gate (same as Task 3 Step 2, with
train_l2_l1.py): confirm CTC loss drops AND L1 classification accuracy rises above chance (1/6). STOP if either fails. -
[ ] Step 4: Full run ≥2 seeds (parallel),
ckpt/l1enc_s{seed}. Verify checkpoints + val PER + val L1-acc. -
[ ] Step 5: Commit trainer + module (ckpts gitignored):
git add research/2026-06-05-phon-142-ft-l2/model_l1_encoder.py research/2026-06-05-phon-142-ft-l2/train_l2_l1.py git commit -m "train(phon-142): El Kheir-style L1-aware encoder variant + trainer"
Phase 3 — Scoring prior (local, test-driven)¶
Task 5: L1 scoring-prior (scoring_prior.py)¶
Files: Create scoring_prior.py, test_scoring_prior.py
-
[ ] Step 1: Failing test
from scoring_prior import build_prior, classify_with_prior def test_l1_typical_sub_pulled_to_variant(): # train rows: Spanish frequently produces b for canonical v at onset rows = [{"l1":"Spanish","canonical":"v","produced":"b","position":"onset"}]*20 prior = build_prior(rows) # an onset v->b for Spanish at moderate cos_dist should classify variant (L1-typical) assert classify_with_prior("v","b","onset","Spanish",cos_dist=0.30,prior=prior) == "variant" # the same substitution from Korean (unseen for that L1) stays error assert classify_with_prior("v","b","onset","Korean",cos_dist=0.30,prior=prior) == "error" -
[ ] Step 2: Run, expect FAIL.
-
[ ] Step 3: Implement.
build_prior(rows)→ Laplace-smoothedP(produced|canonical,L1,position)counts from the train split.classify_with_prior(...): classifyvariantifcos_dist < T_PHON126 (0.112)OR the L1-conditioned channel probability of that produced-given-canonical-at-position exceeds a thresholdP_MIN; elseerror. (Deletions → error.) -
[ ] Step 4: Run test, expect PASS. Commit.
git add research/2026-06-05-phon-142-ft-l2/scoring_prior.py research/2026-06-05-phon-142-ft-l2/test_scoring_prior.py git commit -m "feat(phon-142): L1 scoring-prior (P(produced|canonical,L1,position))"
Phase 4 — Evaluation & report¶
Task 6: Run the 4-chain matrix on held-out test (eval_matrix.py)¶
Files: Create eval_matrix.py
- [ ] Step 1: Implement. For each test utt, transcribe with each model and score (reuse PHON-129
cos_dist+wper_align): - chain 0: off-the-shelf (the local
phonolex_audioor direct HF load). - chain 1: FT-L2-faithful (load
ckpt/faithful_s1/state.pt, decode liketranscribe_ft.py). - chain 2: FT-L2-L1-encoder (load
L1AwareCTC, pass the row's L1). -
chain 3: chain-1 transcript +
scoring_prior.classify_with_prior. Emit per-token rows:utt, speaker, l1, canonical, perceived(gold), errortype, chain, cos_dist, collapsed(bool), class_pred. (Models can be loaded locally for eval — no serving registry needed for the study.) -
[ ] Step 2: Run on the 6 held-out speakers →
eval_rows.parquet. Verify row counts per chain match. -
[ ] Step 3: Commit
eval_matrix.py(parquet gitignored).
Task 7: Comparison metrics + RESULTS.md (metrics_matrix.py)¶
Files: Create metrics_matrix.py, RESULTS.md
-
[ ] Step 1: Implement metrics (extend PHON-129
02_metrics.py): per chain, pooled + per-L1 — canonical-collapse rate at sub positions, D1 (MW ok<sub), D2 (ok_p75<sub_p25), D3 (Spearman), FRR (fraction ofok/L1-typical tokens predictederror), and transcriber PER vs produced gold. -
[ ] Step 2: Run; build the ranking tables (chains × metrics, pooled + per-L1).
-
[ ] Step 3: Write
RESULTS.md— the 4-chain ranking with: (a) GO/NO-GO on FT-L2-faithful replacing off-the-shelf as Model #2's transcriber (target: collapse ≪ 59%, D2 PASS), and (b) does L1-conditioning help + encoder (chain 2) vs scorer (chain 3) verdict (target: lower FRR without PER regression). Honest per-L1 notes. -
[ ] Step 4: Commit report.
git add research/2026-06-05-phon-142-ft-l2/{eval_matrix.py,metrics_matrix.py,RESULTS.md} git commit -m "research(phon-142): 4-chain comparison metrics + RESULTS"
Task 8: Tear down pods + finalize¶
- [ ] Step 1: Pull all checkpoints down (gitignored local), then
runpodctl remove pod <id>for each pod (stop billing). Verifyrunpodctl get podshows none running. - [ ] Step 2: File PHON-142 in Jira (verified next free key) linking this spec + RESULTS; set status per outcome.
- [ ] Step 3: Update memory ([[project_audio_targeted_models]]) with the verdict (which transcriber + where L1 lives).
Phase 5 — Conditional: serving registry (only if a winner emerges)¶
Task 9 (conditional): multi-model registry in phonolex_audio¶
Only if FT-L2-faithful (or L1-encoder) wins and we want it live for Model #2:
- [ ] Extend packages/audio/src/phonolex_audio/{server.py,__main__.py} to load a registry of named models (off-the-shelf, ft-l2, ft-child), transcriber selects per request, /compare takes any pair, each carries its own coverage/limitations. Tests in packages/audio/tests/. Rename the PHON-139 --ft-checkpoint path to ft-child. (Full TDD plan for this written when the study picks a winner — out of scope until then.)
Self-Review¶
Spec coverage: §2 matrix → Tasks 6/7 (all 4 chains) ✓; §3 data/split → Task 1 ✓; §4 faithful → Task 3, L1-encoder → Task 4, scoring-prior → Task 5 ✓; §5 scoring (no segmenter) → Task 6 ✓; §6 RunPod/parallel/checkpoint/model-trainer → Tasks 2–4 ✓; §7 eval metrics (collapse/D1-3/FRR/PER) → Task 7 ✓; §8 serving → Task 9 (conditional) ✓; §9 out-of-scope (redo-child, forced-align) honored; §10 RESULTS → Task 7 ✓.
Placeholder scan: training-internal details (full loop) are intentionally delegated to the model-trainer agent with exact architecture (model_l1_encoder.py given verbatim) + the PHON-139 trainer to lift — this is research granularity, not a TODO. Data/prior/metrics tasks have complete code or pinned tests. No "TBD".
Type/name consistency: reconstruct_produced, SPK_L1, TRAIN_SPK/TEST_SPK, L1AwareCTC(forward → ctc_logits, l1_logits), build_prior/classify_with_prior, broad-40 id offset (pad=0,blank=1,phonemes=2..41) consistent across tasks and with PHON-139/PHON-129.
Gates: pilot-before-full on both training tasks (collapse check), speaker-disjoint split test, scoring-prior unit test, metric pin to PHON-126 — fail-fast before GPU spend.