PHON-142 — FT-L2 L1-Conditioning Comparative Study — Implementation Plan¶

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking. Training tasks (4, 5) should be dispatched to the model-trainer agent.

Goal: Train a faithful connected-speech L2 transcriber + an L1-conditioned variant, and rank 4 transcribe→score chains (off-the-shelf / FT-L2-faithful / FT-L2-L1-encoder / faithful+L1-scoring-prior) on held-out L2-ARCTIC across all 6 L1s, to decide Model #2's production transcriber and whether L1-conditioning earns its place.

Architecture: Anti-collapse CTC fine-tuning of wav2vec2-lv-60-espeak toward produced broad-40 phonemes (PHON-139 recipe) on L2-ARCTIC connected sentences; an El Kheir-style auxiliary-L1-head variant; a statistical L1 scoring-prior. Whole-sequence transcript → text-align to known canonical (PHON-129 metric, no audio segmenter). Trained on RunPod GPU pods, parallel.

Tech Stack: PyTorch, HuggingFace transformers (wav2vec2 CTC), Polars, librosa/soundfile, runpodctl + SSH, the PHON-129 eval harness.

Spec: docs/superpowers/specs/2026-06-05-phon-142-ft-l2-l1-transcriber-study.md Lift from: research/2026-06-03-phon-139-transcriber-ft/train.py (trainer), research/2026-06-05-phon-129-l2-accent-scorer/01_run_l2arctic.py (parser + cos_dist).

File Structure¶

All new work under research/2026-06-05-phon-142-ft-l2/: - build_l2_dataset.py — reconstruct produced-label dataset (all 6 L1s) + speaker-held-out split → data/{train,test}.jsonl. One responsibility: gold → training/eval JSONL. - train_l2.py — connected-speech faithful FT (adapts PHON-139 train.py). One responsibility: train the faithful model. - model_l1_encoder.py — the L1-aware model module (base + aux-L1 head + fusion). One responsibility: the architecture. - train_l2_l1.py — thin trainer wrapping model_l1_encoder.py (reuses train_l2.py data/loop). One responsibility: train the L1-encoder model. - scoring_prior.py — estimate + apply P(produced|canonical,L1,position). One responsibility: the L1 prior. - eval_matrix.py — run all 4 chains on the test split, emit per-token rows. One responsibility: produce the comparison rows. - metrics_matrix.py — collapse/D1-D3/FRR/PER, pooled + per-L1, across chains → tables. One responsibility: the comparison report. - runpod/{provision.sh, sync.sh, run_training.sh} — pod lifecycle + data sync + launch. - RESULTS.md — the ranking + GO recommendation.

Reuse unchanged: PHON-129 01_run_l2arctic.py parsing helpers (import them), score_fixtures.json (metric pin), packages/features/outputs/vectors.csv (broad-40 inventory).

Phase 0 — Data preparation (local, test-driven)¶

Task 1: Build the produced-label dataset + speaker-held-out split¶

Files: - Create: research/2026-06-05-phon-142-ft-l2/build_l2_dataset.py - Create: research/2026-06-05-phon-142-ft-l2/test_build_l2_dataset.py

[ ] Step 1: Write the failing test

# test_build_l2_dataset.py — run: uv run python -m pytest test_build_l2_dataset.py -v
from build_l2_dataset import reconstruct_produced, SPK_L1, TRAIN_SPK, TEST_SPK

def test_reconstruct_produced_uses_perceived_at_subs_canonical_elsewhere():
    # tokens: (canonical, perceived, errortype)
    toks = [("k", "k", "ok"), ("ae", "ae", "ok"), ("t", "d", "s")]
    assert reconstruct_produced(toks) == ["k", "ae", "d"]  # produced = perceived where sub

def test_reconstruct_drops_deletions_and_keeps_additions():
    toks = [("k", "k", "ok"), ("t", "sil", "d"), ("sil", "s", "a")]
    # deletion -> phone omitted from produced; addition -> extra produced phone present
    assert reconstruct_produced(toks) == ["k", "s"]

def test_split_is_speaker_disjoint_and_covers_6_l1s():
    assert set(TRAIN_SPK).isdisjoint(set(TEST_SPK))
    assert {SPK_L1[s] for s in TEST_SPK} == {"Arabic","Chinese","Hindi","Korean","Spanish","Vietnamese"}
    assert len(TEST_SPK) == 6 and len(TRAIN_SPK) == 18

[ ] Step 2: Run it, expect FAIL (build_l2_dataset missing). Run: cd research/2026-06-05-phon-142-ft-l2 && uv run python -m pytest test_build_l2_dataset.py -v
[ ] Step 3: Implement build_l2_dataset.py

Import the PHON-129 parser (don't re-derive): parse_annotation, the IPA-tier logic, SPK_L1. Key pieces:

import sys, json
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "2026-06-05-phon-129-l2-accent-scorer"))
# Reuse the validated parser + speaker map from the PHON-129 harness:
from importlib import import_module
_h = import_module("01_run_l2arctic")  # module name starts with a digit -> import_module
parse_annotation = _h.parse_annotation
SPK_L1 = _h.SPK_L1   # {ABA:Arabic, ...} (24 speakers; suitcase_corpus excluded below)

L2 = Path("/Volumes/ExternalData2/audio-datasets/l2arctic")
# Held-out: 1 speaker/L1 to TEST (pick deterministically — last alphabetically per L1), rest TRAIN.
_by_l1 = {}
for spk, l1 in sorted(SPK_L1.items()):
    _by_l1.setdefault(l1, []).append(spk)
TEST_SPK  = sorted(v[-1] for v in _by_l1.values())          # 6
TRAIN_SPK = sorted(s for v in _by_l1.values() for s in v[:-1])  # 18

def reconstruct_produced(tokens):
    """tokens: list[(canonical, perceived, errortype)] -> produced broad phone list.
    ok/sub -> use perceived; deletion ('d') -> omit; addition ('a') -> include perceived.
    'sil' is never a phone."""
    out = []
    for canon, perceived, et in tokens:
        if et == "d":            # canonical phone deleted -> not produced
            continue
        ph = perceived
        if ph and ph != "sil":
            out.append(ph)
    return out

def main():
    out_dir = Path(__file__).resolve().parent / "data"; out_dir.mkdir(exist_ok=True)
    for split, spks in (("train", TRAIN_SPK), ("test", TEST_SPK)):
        rows = []
        for spk in spks:
            for tg in sorted((L2 / spk / "annotation").glob("*.TextGrid")):
                wav = L2 / spk / "wav" / f"{tg.stem}.wav"
                if not wav.exists(): continue
                toks = parse_annotation(tg)              # [(canon, perceived, et), ...]
                produced = reconstruct_produced(toks)
                if not produced: continue
                rows.append({"wav": str(wav), "speaker": spk, "l1": SPK_L1[spk],
                             "utt": tg.stem, "produced": produced,
                             "canonical": [c for c,_,_ in toks if c != "sil"]})
        (out_dir / f"{split}.jsonl").write_text("\n".join(json.dumps(r) for r in rows))
        print(f"{split}: {len(rows)} utts, {len(spks)} speakers")

if __name__ == "__main__":
    main()

Note: confirm the actual signature of the PHON-129 parse_annotation (it may return per-utt token lists keyed differently) and adapt the call; the test pins reconstruct_produced which is self-contained.

[ ] Step 4: Run test, expect PASS, then build the data: uv run python -m pytest test_build_l2_dataset.py -v then uv run python build_l2_dataset.py Expected: train: ~2700 utts, 18 speakers / test: ~900 utts, 6 speakers.

[ ] Step 5: Sanity-check the dataset (no speaker leak, produced≠canonical where subs exist):

uv run python -c "
import json,collections
tr=[json.loads(l) for l in open('data/train.jsonl')]; te=[json.loads(l) for l in open('data/test.jsonl')]
assert not ({r['speaker'] for r in tr} & {r['speaker'] for r in te}), 'speaker leak!'
print('train L1s:', collections.Counter(r['l1'] for r in tr))
print('test  L1s:', collections.Counter(r['l1'] for r in te))
print('mean produced len:', sum(len(r['produced']) for r in tr)/len(tr))
"

[ ] Step 6: Commit (.gitignore the data/ JSONL + any wav copies — Tier B, not committed):

echo "data/" >> research/2026-06-05-phon-142-ft-l2/.gitignore
git add research/2026-06-05-phon-142-ft-l2/build_l2_dataset.py research/2026-06-05-phon-142-ft-l2/test_build_l2_dataset.py research/2026-06-05-phon-142-ft-l2/.gitignore
git commit -m "data(phon-142): produced-label L2-ARCTIC dataset builder + speaker-held-out split"

Phase 1 — RunPod environment¶

Task 2: Provision a GPU pod, sync data + code, verify¶

Files: Create research/2026-06-05-phon-142-ft-l2/runpod/{provision.sh,sync.sh}

[ ] Step 1: Provision script (runpod/provision.sh) — a single GPU pod (e.g. RTX A5000/4090, PyTorch image). The user has run this pattern before; use runpodctl:

#!/usr/bin/env bash
# Provision a RunPod GPU pod for FT-L2 training. Prints the pod id + ssh.
set -euo pipefail
runpodctl create pod \
  --name phon142-ft-l2 \
  --gpuType "NVIDIA GeForce RTX 4090" \
  --imageName "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04" \
  --gpuCount 1 --volumeSize 60 --containerDiskSize 30 --ports "22/tcp"
runpodctl get pod   # capture id + ssh endpoint

[ ] Step 2: Run it + record the pod id / SSH endpoint. Verify GPU: ssh <pod> "nvidia-smi | head -15 && python -c 'import torch;print(torch.cuda.is_available())'" Expected: GPU listed, True.

[ ] Step 3: Sync script (runpod/sync.sh) — push the trainer code + the (gitignored) data JSONL + the referenced wavs. Because the JSONL holds absolute Mac wav paths, rewrite them to the pod path on sync, or rsync the wavs into a mirrored tree and pass a --wav-root to the trainer. Implement the --wav-root approach:

#!/usr/bin/env bash
set -euo pipefail
POD="$1"   # ssh target
rsync -az research/2026-06-05-phon-142-ft-l2/ "$POD:/workspace/phon142/"
# mirror only the wavs referenced by train+test (keeps upload small)
uv run python - <<'PY'
import json,sys
paths={r["wav"] for f in ("train","test") for r in map(json.loads, open(f"research/2026-06-05-phon-142-ft-l2/data/{f}.jsonl"))}
open("/tmp/phon142_wavs.txt","w").write("\n".join(sorted(paths)))
PY
rsync -az --files-from=/tmp/phon142_wavs.txt / "$POD:/workspace/wavroot/"
ssh "$POD" "cd /workspace/phon142 && pip install -q transformers librosa soundfile polars numpy"

(train_l2.py takes --wav-root /workspace/wavroot and resolves wav paths relative to it.)

[ ] Step 4: Verify data on pod: ssh <pod> "wc -l /workspace/phon142/data/*.jsonl && ls /workspace/wavroot | head"

[ ] Step 5: Commit the runpod scripts:

git add research/2026-06-05-phon-142-ft-l2/runpod/
git commit -m "infra(phon-142): runpod pod provision + data sync scripts"

Phase 2 — Training (dispatch to the `model-trainer` agent)¶

Task 3: Faithful connected-speech FT (`train_l2.py`)¶

Files: Create research/2026-06-05-phon-142-ft-l2/train_l2.py

[ ] Step 1: Adapt PHON-139 train.py. Lift its CTC trainer wholesale (wav2vec2-lv-60-espeak base, Linear(1024→42) broad-40 head, frozen conv front-end, checkpoint policy, recipe A perceived-hard CTC loss). The only substantive changes:
Data loader reads data/train.jsonl (this study's format): each row → (librosa.load(wav_root/wav, sr=16000), produced_labels). Map produced IPA → broad-40 ids (offset 2; pad=0, blank=1) via vectors.csv order. Connected-speech utts (no length cap beyond batch memory).
Args: --wav-root, --train data/train.jsonl, --val data/test.jsonl (held-out speakers as val), --checkpoint-dir ckpt/faithful_s{seed}, --epochs, --seed.
Keep recipe A (hard CTC); drop recipe B / sim-matrix (YAGNI for faithful).
[ ] Step 2: Pilot gate (small, fast) — MUST pass before full run. On the pod: python train_l2.py --wav-root /workspace/wavroot --train data/train.jsonl --val data/test.jsonl --pilot --pilot-train 200 --epochs 2 --checkpoint-dir ckpt/pilot_faithful Expected: loss decreases; a held-out sanity decode of one val utt produces a multi-phoneme connected-speech transcript (NOT the 5-phoneme collapse the word-FT showed). If the pilot collapses, STOP and report — the connected-speech data/labels need inspection before burning the full run.
[ ] Step 3: Full run, ≥2 seeds (parallel containers if provisioned). for s in 1 2; do python train_l2.py --wav-root /workspace/wavroot --train data/train.jsonl --val data/test.jsonl --epochs 4 --seed $s --checkpoint-dir ckpt/faithful_s$s --checkpoint-every 300; done (Run seeds on separate pods/containers for speed.)
[ ] Step 4: Verify checkpoints + val PER. ssh <pod> "ls -lh /workspace/phon142/ckpt/faithful_s*/state.pt"; record per-seed val PER from the train log.

[ ] Step 5: Pull checkpoints back; commit the trainer (NOT the 3.5GB ckpts — gitignore them).

echo "ckpt/" >> research/2026-06-05-phon-142-ft-l2/.gitignore
# rsync ckpt/faithful_s*/state.pt down to research/2026-06-05-phon-142-ft-l2/ckpt/ (gitignored, local)
git add research/2026-06-05-phon-142-ft-l2/train_l2.py research/2026-06-05-phon-142-ft-l2/.gitignore
git commit -m "train(phon-142): faithful connected-speech L2 FT trainer + checkpoints (local)"

Task 4: L1-encoder variant (`model_l1_encoder.py` + `train_l2_l1.py`)¶

Files: Create model_l1_encoder.py, train_l2_l1.py

[ ] Step 1: Architecture (model_l1_encoder.py). Wrap the base wav2vec2-CTC; add an auxiliary L1 head over the mean-pooled encoder output; fuse the L1 embedding into the CTC head input (El Kheir blueprint):

import torch, torch.nn as nn
from transformers import AutoModelForCTC

N_L1 = 6  # Arabic, Chinese, Hindi, Korean, Spanish, Vietnamese (sorted)
L1S = ["Arabic","Chinese","Hindi","Korean","Spanish","Vietnamese"]

class L1AwareCTC(nn.Module):
    def __init__(self, base_id, n_labels=42, l1_emb=64):
        super().__init__()
        self.base = AutoModelForCTC.from_pretrained(base_id)
        H = self.base.config.hidden_size
        self.base.lm_head = nn.Identity()                 # we own the head
        self.l1_clf = nn.Linear(H, N_L1)                   # aux L1 classifier (on mean-pooled)
        self.l1_emb = nn.Embedding(N_L1, l1_emb)
        self.ctc_head = nn.Linear(H + l1_emb, n_labels)    # fused head
    def forward(self, input_values, l1_id=None):
        h = self.base.wav2vec2(input_values).last_hidden_state   # [B,T,H]
        pooled = h.mean(dim=1)                                    # [B,H]
        l1_logits = self.l1_clf(pooled)                          # [B,6]
        # train: use the TRUE l1_id for the embedding (teacher-forced); infer: declared l1_id
        emb = self.l1_emb(l1_id)                                  # [B,l1_emb]
        fused = torch.cat([h, emb.unsqueeze(1).expand(-1, h.size(1), -1)], dim=-1)
        return self.ctc_head(fused), l1_logits                   # CTC logits [B,T,42], l1 logits [B,6]

[ ] Step 2: Trainer (train_l2_l1.py) = reuse train_l2.py's data loop; loss L = L_CTC + λ·CE(l1_logits, true_l1) (λ=0.3); pass l1_id per batch (from the row's l1). Args mirror train_l2.py + --lambda-l1.
[ ] Step 3: Pilot gate (same as Task 3 Step 2, with train_l2_l1.py): confirm CTC loss drops AND L1 classification accuracy rises above chance (1/6). STOP if either fails.
[ ] Step 4: Full run ≥2 seeds (parallel), ckpt/l1enc_s{seed}. Verify checkpoints + val PER + val L1-acc.

[ ] Step 5: Commit trainer + module (ckpts gitignored):

git add research/2026-06-05-phon-142-ft-l2/model_l1_encoder.py research/2026-06-05-phon-142-ft-l2/train_l2_l1.py
git commit -m "train(phon-142): El Kheir-style L1-aware encoder variant + trainer"

Phase 3 — Scoring prior (local, test-driven)¶

Task 5: L1 scoring-prior (`scoring_prior.py`)¶

Files: Create scoring_prior.py, test_scoring_prior.py

[ ] Step 1: Failing test

from scoring_prior import build_prior, classify_with_prior

def test_l1_typical_sub_pulled_to_variant():
    # train rows: Spanish frequently produces b for canonical v at onset
    rows = [{"l1":"Spanish","canonical":"v","produced":"b","position":"onset"}]*20
    prior = build_prior(rows)
    # an onset v->b for Spanish at moderate cos_dist should classify variant (L1-typical)
    assert classify_with_prior("v","b","onset","Spanish",cos_dist=0.30,prior=prior) == "variant"
    # the same substitution from Korean (unseen for that L1) stays error
    assert classify_with_prior("v","b","onset","Korean",cos_dist=0.30,prior=prior) == "error"

[ ] Step 2: Run, expect FAIL.
[ ] Step 3: Implement. build_prior(rows) → Laplace-smoothed P(produced|canonical,L1,position) counts from the train split. classify_with_prior(...): classify variant if cos_dist < T_PHON126 (0.112) OR the L1-conditioned channel probability of that produced-given-canonical-at-position exceeds a threshold P_MIN; else error. (Deletions → error.)

[ ] Step 4: Run test, expect PASS. Commit.

git add research/2026-06-05-phon-142-ft-l2/scoring_prior.py research/2026-06-05-phon-142-ft-l2/test_scoring_prior.py
git commit -m "feat(phon-142): L1 scoring-prior (P(produced|canonical,L1,position))"

Phase 4 — Evaluation & report¶

Task 6: Run the 4-chain matrix on held-out test (`eval_matrix.py`)¶

Files: Create eval_matrix.py

[ ] Step 1: Implement. For each test utt, transcribe with each model and score (reuse PHON-129 cos_dist + wper_align):
chain 0: off-the-shelf (the local phonolex_audio or direct HF load).
chain 1: FT-L2-faithful (load ckpt/faithful_s1/state.pt, decode like transcribe_ft.py).
chain 2: FT-L2-L1-encoder (load L1AwareCTC, pass the row's L1).
chain 3: chain-1 transcript + scoring_prior.classify_with_prior. Emit per-token rows: utt, speaker, l1, canonical, perceived(gold), errortype, chain, cos_dist, collapsed(bool), class_pred. (Models can be loaded locally for eval — no serving registry needed for the study.)
[ ] Step 2: Run on the 6 held-out speakers → eval_rows.parquet. Verify row counts per chain match.
[ ] Step 3: Commit eval_matrix.py (parquet gitignored).

Task 7: Comparison metrics + RESULTS.md (`metrics_matrix.py`)¶

Files: Create metrics_matrix.py, RESULTS.md

[ ] Step 1: Implement metrics (extend PHON-129 02_metrics.py): per chain, pooled + per-L1 — canonical-collapse rate at sub positions, D1 (MW ok<sub), D2 (ok_p75<sub_p25), D3 (Spearman), FRR (fraction of ok/L1-typical tokens predicted error), and transcriber PER vs produced gold.
[ ] Step 2: Run; build the ranking tables (chains × metrics, pooled + per-L1).
[ ] Step 3: Write RESULTS.md — the 4-chain ranking with: (a) GO/NO-GO on FT-L2-faithful replacing off-the-shelf as Model #2's transcriber (target: collapse ≪ 59%, D2 PASS), and (b) does L1-conditioning help + encoder (chain 2) vs scorer (chain 3) verdict (target: lower FRR without PER regression). Honest per-L1 notes.

[ ] Step 4: Commit report.

git add research/2026-06-05-phon-142-ft-l2/{eval_matrix.py,metrics_matrix.py,RESULTS.md}
git commit -m "research(phon-142): 4-chain comparison metrics + RESULTS"

Task 8: Tear down pods + finalize¶

[ ] Step 1: Pull all checkpoints down (gitignored local), then runpodctl remove pod <id> for each pod (stop billing). Verify runpodctl get pod shows none running.
[ ] Step 2: File PHON-142 in Jira (verified next free key) linking this spec + RESULTS; set status per outcome.
[ ] Step 3: Update memory ([[project_audio_targeted_models]]) with the verdict (which transcriber + where L1 lives).

Phase 5 — Conditional: serving registry (only if a winner emerges)¶

Task 9 (conditional): multi-model registry in `phonolex_audio`¶

Only if FT-L2-faithful (or L1-encoder) wins and we want it live for Model #2: - [ ] Extend packages/audio/src/phonolex_audio/{server.py,__main__.py} to load a registry of named models (off-the-shelf, ft-l2, ft-child), transcriber selects per request, /compare takes any pair, each carries its own coverage/limitations. Tests in packages/audio/tests/. Rename the PHON-139 --ft-checkpoint path to ft-child. (Full TDD plan for this written when the study picks a winner — out of scope until then.)

Self-Review¶

Spec coverage: §2 matrix → Tasks 6/7 (all 4 chains) ✓; §3 data/split → Task 1 ✓; §4 faithful → Task 3, L1-encoder → Task 4, scoring-prior → Task 5 ✓; §5 scoring (no segmenter) → Task 6 ✓; §6 RunPod/parallel/checkpoint/model-trainer → Tasks 2–4 ✓; §7 eval metrics (collapse/D1-3/FRR/PER) → Task 7 ✓; §8 serving → Task 9 (conditional) ✓; §9 out-of-scope (redo-child, forced-align) honored; §10 RESULTS → Task 7 ✓.

Placeholder scan: training-internal details (full loop) are intentionally delegated to the model-trainer agent with exact architecture (model_l1_encoder.py given verbatim) + the PHON-139 trainer to lift — this is research granularity, not a TODO. Data/prior/metrics tasks have complete code or pinned tests. No "TBD".

Type/name consistency: reconstruct_produced, SPK_L1, TRAIN_SPK/TEST_SPK, L1AwareCTC(forward → ctc_logits, l1_logits), build_prior/classify_with_prior, broad-40 id offset (pad=0,blank=1,phonemes=2..41) consistent across tasks and with PHON-139/PHON-129.

Gates: pilot-before-full on both training tasks (collapse check), speaker-disjoint split test, scoring-prior unit test, metric pin to PHON-126 — fail-fast before GPU spend.