Skip to content

v6 Audio — Training Data Union Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Assemble a length-and-population-diverse training union of (16 kHz mono audio, produced broad-40 phoneme labels, metadata) with subject-disjoint train/val/test splits and a passing validation audit — the dataset the unified faithful transcriber will be fine-tuned on.

Architecture: A developer-local Python builder in a new research dir. Per-source extractors (lifted from this session's triage harnesses) each emit manifest rows + cached 16 kHz clips onto the external drive; a merge step normalizes labels to the broad-40 inventory and assigns subject-disjoint splits; an audit step enforces the gate (all-16k, label coverage, length diversity, split disjointness). Sources land in two waves: in-hand (PhonBank child, L2-ARCTIC, LibriSpeech — extractors already exist) then download-dependent enrichment (Common Voice, TORGO).

Tech Stack: Python 3.12, uv, polars (manifest), soundfile + librosa (audio I/O + resample), pytest (validation gates). Reuses packages/features/outputs/vectors.csv (broad-40 inventory), packages/data CMU loader + ARPA→IPA mapping.

Why this is the foundation: the ft-l2 existence proof showed length-generalization survives fine-tuning iff the union contains connected speech. This union deliberately mixes connected (LibriSpeech / L2-ARCTIC sentences) with word-level deviant data (PhonBank), which is the load-bearing property the retrain depends on.


Critical constraints (read before any task)

  • EVERY clip resamples to 16 kHz. PhonBank cache is already 16k; L2-ARCTIC and TORGO are 44.1 kHz; LibriSpeech is 16k native. The bug that cost hours this session: sf.read without resampling fed the model 44.1k → garbage. Always librosa.resample(orig_sr=sr, target_sr=16000) when sr != 16000.
  • Labels are PRODUCED phonemes, not canonical (faithful recipe), normalized to the broad-40 inventory in vectors.csv (the 26-d Bayesian set's segment list — the canonical inventory).
  • Splits are SUBJECT-disjoint, never utterance-random, or the val/test leak.
  • Outputs are gitignored, on the external drive (/Volumes/ExternalData1/audio-union/). Only the scripts + UNION.md are committed. This follows the reservoir convention ([[project_audio_data_reservoir]]).
  • Checkpoint the heavy extraction (clip resampling/caching): skip clips already cached so reruns resume. Reference the checkpoint policy in CLAUDE.md.

File Structure

research/2026-06-06-audio-union/
├── lib_labels.py        # narrow IPA + ARPABET → broad-40 normalization (single source of truth)
├── lib_manifest.py      # manifest row schema + parquet read/write helpers
├── extract_phonbank.py  # PhonBank child %pho → rows + 16k clips  (lift run_triage.load_words)
├── extract_l2arctic.py  # L2-ARCTIC perceived-phone TextGrids → rows + 16k clips (lift l2libri)
├── extract_librispeech.py # LibriSpeech + CMU-G2P → rows + 16k clips (lift l2libri)
├── extract_commonvoice.py # Common Voice CC0 subset → rows + 16k clips   (Wave 2)
├── extract_torgo.py     # TORGO dysarthria → rows + 16k clips            (Wave 2)
├── build_union.py       # orchestrate extractors → merge → subject-disjoint splits → manifest
├── audit_union.py       # THE GATE: 16k check, label coverage, length/pop histograms, split disjointness → UNION.md
├── tests/
│   └── test_union.py    # validation tests (developer-local, data-dependent; NOT in CI)
└── UNION.md             # lab notes: sources, counts, decisions, audit output

# Outputs (gitignored, external drive):
/Volumes/ExternalData1/audio-union/
├── clips_16k/<source>/<id>.wav     # cached resampled clips
└── union_manifest.parquet          # the dataset manifest

Manifest row schema (lib_manifest.py), one row per clip:

field type notes
id str <source>_<natural-key>
source str phonbank | l2arctic | librispeech | commonvoice | torgo
population str child | l2 | clean | dysarthria
length_class str word | sentence
subject str speaker/subject id — splits are disjoint on this
clip_path str absolute path to the 16k wav
produced list[str] produced broad-40 phoneme tokens
n_phonemes int len(produced)
duration_ms int clip duration
split str train | val | test (assigned in build_union)

Task 1: Scaffold + label normalization (lib_labels.py)

Files: - Create: research/2026-06-06-audio-union/lib_labels.py - Create: research/2026-06-06-audio-union/tests/test_union.py

  • [ ] Step 1: Write the failing test
# tests/test_union.py
import sys; from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from lib_labels import to_broad40, INV

def test_inventory_is_40ish_and_has_core_phonemes():
    assert {"p", "t", "k", "s", "ɹ", "i", "ɑ", "tʃ"} <= INV
    assert len(INV) >= 39

def test_narrow_ipa_strips_diacritics_to_broad():
    # length mark, primary stress, dentalized /s/ diacritic → broad /s/
    assert to_broad40("ˈsːe̪") == ["s", "e"]

def test_arpabet_maps_to_broad_ipa():
    # ARPABET with stress digits → broad IPA tokens
    assert to_broad40(["T", "W", "EH1", "N", "T", "IY0"], arpabet=True) == ["t", "w", "ɛ", "n", "t", "i"]

def test_affricates_kept_as_units():
    assert to_broad40("tʃiz") == ["tʃ", "i", "z"]
  • [ ] Step 2: Run it, verify it fails

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k labels -v Expected: FAIL — ModuleNotFoundError: lib_labels.

  • [ ] Step 3: Implement lib_labels.py

Lift the normalize() + INV/TWO/VARIANT logic from research/2026-06-06-audio-triage/run_triage.py, and the ARPABET→IPA map from packages/data (load_arpa_to_ipa). Single source of truth for label normalization across all extractors.

# lib_labels.py
import csv, sys, unicodedata
from pathlib import Path

VEC = Path("packages/features/outputs/vectors.csv")
INV = [r["ipa"] for r in csv.DictReader(open(VEC))]
INV_SET = set(INV)
INV = set(INV)  # public: membership set
TWO = {"tʃ", "dʒ"}
# variant collapses to broad-40 (ASCII g→ɡ, r→ɹ, flap→t, drop glottal/aspiration)
VARIANT = {"r": "ɹ", "g": "ɡ", "ʧ": "tʃ", "ʤ": "dʒ", "ɫ": "l", "ɾ": "t", "ʔ": "", "ʰ": ""}

def _arpa_map():
    sys.path.insert(0, "packages/data/src")
    from phonolex_data.mappings import load_arpa_to_ipa
    return load_arpa_to_ipa()
_ARPA = None

def to_broad40(seq, arpabet: bool = False) -> list[str]:
    """Narrow IPA string OR ARPABET token list → broad-40 IPA tokens."""
    global _ARPA
    if arpabet:
        if _ARPA is None: _ARPA = _arpa_map()
        s = "".join(_ARPA.get("".join(c for c in tok if not c.isdigit()).upper(), "") for tok in seq)
    else:
        s = seq
    s = unicodedata.normalize("NFD", s)
    s = "".join(c for c in s if not unicodedata.combining(c) and c not in "ˈˌːˑ ’.")
    s = "".join(VARIANT.get(c, c) for c in s)
    out, i = [], 0
    while i < len(s):
        if i + 1 < len(s) and s[i:i+2] in TWO:
            out.append(s[i:i+2]); i += 2; continue
        if s[i] in INV_SET:
            out.append(s[i])
        i += 1
    return out
  • [ ] Step 4: Run tests, verify pass

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k labels -v Expected: PASS (4 tests). If to_broad40 ARPABET path differs from the map's casing, fix the lookup, not the test.

  • [ ] Step 5: Commit
git add research/2026-06-06-audio-union/lib_labels.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): broad-40 label normalization (narrow IPA + ARPABET)"

Task 2: Manifest schema + I/O (lib_manifest.py)

Files: - Create: research/2026-06-06-audio-union/lib_manifest.py - Modify: research/2026-06-06-audio-union/tests/test_union.py

  • [ ] Step 1: Write the failing test
# append to tests/test_union.py
from lib_manifest import Row, rows_to_parquet, read_manifest

def test_row_roundtrips_through_parquet(tmp_path):
    rows = [Row(id="phonbank_x", source="phonbank", population="child",
                length_class="word", subject="s1", clip_path="/tmp/x.wav",
                produced=["k", "æ", "t"], duration_ms=600).asdict()]
    p = tmp_path / "m.parquet"
    rows_to_parquet(rows, p)
    df = read_manifest(p)
    assert df["n_phonemes"][0] == 3 and df["produced"][0].to_list() == ["k", "æ", "t"]
  • [ ] Step 2: Run it, verify it fails

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k roundtrip -v Expected: FAIL — ModuleNotFoundError: lib_manifest.

  • [ ] Step 3: Implement lib_manifest.py
# lib_manifest.py
from dataclasses import dataclass, field, asdict
import polars as pl
from pathlib import Path

@dataclass
class Row:
    id: str; source: str; population: str; length_class: str
    subject: str; clip_path: str; produced: list[str]; duration_ms: int
    split: str = "train"
    def asdict(self):
        d = asdict(self); d["n_phonemes"] = len(self.produced); return d

def rows_to_parquet(rows: list[dict], path: Path):
    pl.DataFrame(rows).write_parquet(path)

def read_manifest(path: Path) -> pl.DataFrame:
    return pl.read_parquet(path)
  • [ ] Step 4: Run tests, verify pass

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k roundtrip -v Expected: PASS.

  • [ ] Step 5: Commit
git add research/2026-06-06-audio-union/lib_manifest.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): manifest row schema + parquet I/O"

Task 3: Shared clip cache helper (lib_audio.py)

Files: - Create: research/2026-06-06-audio-union/lib_audio.py - Modify: tests/test_union.py

The single resample-and-cache function every extractor uses. Checkpoint = skip if cached.

  • [ ] Step 1: Write the failing test
# append to tests/test_union.py
import numpy as np, soundfile as sf
from lib_audio import cache_clip, SR

def test_cache_clip_resamples_to_16k_and_is_idempotent(tmp_path):
    src = tmp_path / "in.wav"
    sf.write(src, np.zeros(44100, dtype="float32"), 44100)  # 1s @ 44.1k
    out = cache_clip(str(src), tmp_path / "clips", "torgo", "u1", start_ms=None, end_ms=None)
    arr, sr = sf.read(out)
    assert sr == SR == 16000 and len(arr) == 16000
    mtime = out.stat().st_mtime
    cache_clip(str(src), tmp_path / "clips", "torgo", "u1", None, None)  # rerun
    assert out.stat().st_mtime == mtime  # not rewritten (checkpoint)
  • [ ] Step 2: Run it, verify it fails

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k cache_clip -v Expected: FAIL — ModuleNotFoundError: lib_audio.

  • [ ] Step 3: Implement lib_audio.py
# lib_audio.py
from pathlib import Path
import numpy as np, soundfile as sf
SR = 16000

def cache_clip(src_path, clips_dir: Path, source: str, clip_id: str,
               start_ms, end_ms) -> Path:
    out = Path(clips_dir) / source / f"{clip_id}.wav"
    if out.exists():
        return out                                   # checkpoint: already cached
    out.parent.mkdir(parents=True, exist_ok=True)
    arr, sr = sf.read(src_path, dtype="float32")
    if arr.ndim > 1: arr = arr.mean(axis=1)
    if start_ms is not None and end_ms is not None:
        arr = arr[int(start_ms/1000*sr):int(end_ms/1000*sr)]
    if sr != SR:
        import librosa
        arr = librosa.resample(arr, orig_sr=sr, target_sr=SR)
    sf.write(out, arr.astype("float32"), SR)
    return out
  • [ ] Step 4: Run tests, verify pass

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k cache_clip -v Expected: PASS.

  • [ ] Step 5: Commit
git add research/2026-06-06-audio-union/lib_audio.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): resample-to-16k clip cache (idempotent/checkpointed)"

Task 4: PhonBank child extractor (in-hand)

Files: - Create: research/2026-06-06-audio-union/extract_phonbank.py - Modify: tests/test_union.py

Lift load_words() from research/2026-06-06-audio-triage/run_triage.py (reads /Volumes/ExternalData1/phonbank/dataset_production_new_2026-06-03.jsonl, clips in _utt_cache_16k). population=child, length_class=word, label = to_broad40(actual_phonology), subject = corpus_name + subject.

  • [ ] Step 1: Write the failing test (data-dependent — guarded skip if drive absent)
# append to tests/test_union.py
import pytest
from pathlib import Path
PB = Path("/Volumes/ExternalData1/phonbank/dataset_production_new_2026-06-03.jsonl")

@pytest.mark.skipif(not PB.exists(), reason="ExternalData1 not mounted")
def test_phonbank_extractor_yields_valid_child_word_rows(tmp_path):
    from extract_phonbank import extract
    rows = extract(tmp_path / "clips", limit=20)
    assert len(rows) > 0
    r = rows[0]
    assert r["source"] == "phonbank" and r["population"] == "child" and r["length_class"] == "word"
    assert all(p in __import__("lib_labels").INV for p in r["produced"])
    arr, sr = __import__("soundfile").read(r["clip_path"]); assert sr == 16000
  • [ ] Step 2: Run it, verify it fails

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k phonbank -v Expected: FAIL — ModuleNotFoundError: extract_phonbank (or SKIP if drive unmounted — mount it).

  • [ ] Step 3: Implement extract_phonbank.py
# extract_phonbank.py
import json
from pathlib import Path
from lib_labels import to_broad40
from lib_audio import cache_clip
from lib_manifest import Row

JSONL = Path("/Volumes/ExternalData1/phonbank/dataset_production_new_2026-06-03.jsonl")

def extract(clips_dir: Path, limit: int | None = None) -> list[dict]:
    rows = []
    for line in open(JSONL):
        r = json.loads(line)
        if r.get("corpus_name") == "ACAD" or not r.get("audio_path"): continue
        s, e = r.get("start_ms"), r.get("end_ms")
        if s is None or e is None or e <= s or (e - s) > 30000: continue
        produced = to_broad40(r.get("actual_phonology") or "")
        if not produced: continue
        subj = f"{r['corpus_name']}_{r['subject']}"
        cid = f"{subj}_{s}_{e}"
        try:
            clip = cache_clip(r["audio_path"], clips_dir, "phonbank", cid, s, e)
        except Exception:
            continue
        rows.append(Row(id=f"phonbank_{cid}", source="phonbank", population="child",
                        length_class="word", subject=subj, clip_path=str(clip),
                        produced=produced, duration_ms=e - s).asdict())
        if limit and len(rows) >= limit: break
    return rows
  • [ ] Step 4: Run tests, verify pass

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k phonbank -v Expected: PASS.

  • [ ] Step 5: Commit
git add research/2026-06-06-audio-union/extract_phonbank.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): PhonBank child-word extractor"

Task 5: L2-ARCTIC extractor (in-hand)

Files: - Create: research/2026-06-06-audio-union/extract_l2arctic.py - Modify: tests/test_union.py

Lift the perceived-phone TextGrid parsing + word/sentence segmentation from research/2026-06-06-audio-triage/run_triage_l2libri.py (/Volumes/ExternalData2/audio-datasets/l2arctic/<SPK>/, annotation phones tier → perceived phone, skip deletions; word tier for l2_word; full sentence for l2_sentence; ARPABET→IPA→broad-40). population=l2, subject=speaker, two length_class values. 44.1k → resample via cache_clip.

  • [ ] Step 1: Write the failing test
# append to tests/test_union.py
L2 = Path("/Volumes/ExternalData2/audio-datasets/l2arctic")

@pytest.mark.skipif(not L2.exists(), reason="ExternalData2 not mounted")
def test_l2arctic_yields_l2_rows_both_lengths_16k(tmp_path):
    from extract_l2arctic import extract
    rows = extract(tmp_path / "clips", speakers=["ABA"], max_utts=10)
    assert any(r["length_class"] == "l2_sentence" for r in rows)
    for r in rows:
        assert r["source"] == "l2arctic" and r["population"] == "l2"
        arr, sr = __import__("soundfile").read(r["clip_path"]); assert sr == 16000
  • [ ] Step 2: Run it, verify it fails

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k l2arctic -v Expected: FAIL — ModuleNotFoundError.

  • [ ] Step 3: Implement extract_l2arctic.py

Port run_triage_l2libri.py's L2-ARCTIC section. Map length_class to l2_sentence / l2_word (treated as sentence/word by the audit's length-diversity check — keep the L2-specific labels for provenance, normalize to sentence|word in build_union). Reuse to_broad40(..., arpabet=...) and cache_clip. (Code: lift verbatim from the triage harness's read_textgrid + perceived-phone extraction; do not rewrite the TextGrid parser.)

  • [ ] Step 4: Run tests, verify pass

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k l2arctic -v Expected: PASS.

  • [ ] Step 5: Commit
git add research/2026-06-06-audio-union/extract_l2arctic.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): L2-ARCTIC perceived-phone extractor (word+sentence, 16k)"

Task 6: LibriSpeech extractor (in-hand)

Files: - Create: research/2026-06-06-audio-union/extract_librispeech.py - Modify: tests/test_union.py

Lift the LibriSpeech loader from run_triage_l2libri.py (hf-cache at /Volumes/ExternalData2/hf-cache, CMU-G2P of the transcript → broad-40 produced labels — canonical≈produced for clean speech; this is the connected-speech length anchor). population=clean, length_class=sentence, subject=speaker_id. 16k native (still route through cache_clip for uniformity).

  • [ ] Step 1: Write the failing test
# append to tests/test_union.py
HF = Path("/Volumes/ExternalData2/hf-cache")

@pytest.mark.skipif(not HF.exists(), reason="hf-cache not mounted")
def test_librispeech_yields_clean_sentences_16k(tmp_path):
    from extract_librispeech import extract
    rows = extract(tmp_path / "clips", n=8)
    assert rows and all(r["population"] == "clean" and r["length_class"] == "sentence" for r in rows)
    assert max(r["n_phonemes"] for r in rows) > 15   # genuinely connected
  • [ ] Step 2: Run it, verify it fails

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k librispeech -v Expected: FAIL — ModuleNotFoundError.

  • [ ] Step 3: Implement extract_librispeech.py

Port the LibriSpeech + CMU-G2P section from run_triage_l2libri.py (uses phonolex_data.loaders.cmudict.load_cmudict; drop OOV words; HF_HOME env). Write each utterance's audio to a temp wav then cache_clip (it's already 16k; resample is a no-op). Record OOV rate in a module-level counter for the audit.

  • [ ] Step 4: Run tests, verify pass

Run: HF_HOME=/Volumes/ExternalData2/hf-cache uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k librispeech -v Expected: PASS.

  • [ ] Step 5: Commit
git add research/2026-06-06-audio-union/extract_librispeech.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): LibriSpeech clean-sentence extractor (CMU-G2P labels)"

Task 7: Build + merge + subject-disjoint splits (build_union.py)

Files: - Create: research/2026-06-06-audio-union/build_union.py - Modify: tests/test_union.py

Orchestrate the in-hand extractors, normalize l2_sentence/l2_wordsentence/word, assign subject-disjoint 80/10/10 splits (a subject lands entirely in one split), write the manifest.

  • [ ] Step 1: Write the failing test
# append to tests/test_union.py
def test_splits_are_subject_disjoint_and_length_diverse():
    from build_union import assign_splits, normalize_length
    rows = ([{"subject": f"s{i}", "length_class": "word"} for i in range(8)] +
            [{"subject": f"s{i}", "length_class": "sentence"} for i in range(8, 16)])
    out = assign_splits(rows, seed=13)
    by_subj = {}
    for r in out: by_subj.setdefault(r["subject"], set()).add(r["split"])
    assert all(len(v) == 1 for v in by_subj.values())          # no subject leaks across splits
    assert {"train", "val", "test"} <= {r["split"] for r in out}
    assert normalize_length("l2_sentence") == "sentence"
  • [ ] Step 2: Run it, verify it fails

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k splits -v Expected: FAIL — ModuleNotFoundError: build_union.

  • [ ] Step 3: Implement build_union.py
# build_union.py
import argparse
from pathlib import Path
import numpy as np
from lib_manifest import rows_to_parquet

def normalize_length(lc: str) -> str:
    return "sentence" if "sentence" in lc else "word"

def assign_splits(rows: list[dict], seed: int = 13) -> list[dict]:
    subs = sorted({r["subject"] for r in rows})
    rng = np.random.default_rng(seed); rng.shuffle(subs)
    n = len(subs); val = set(subs[: max(1, n // 10)]); test = set(subs[n // 10: n // 5])
    for r in rows:
        r["split"] = "val" if r["subject"] in val else "test" if r["subject"] in test else "train"
    return rows

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--out-dir", default="/Volumes/ExternalData1/audio-union")
    ap.add_argument("--phonbank-limit", type=int, default=None)
    ap.add_argument("--libri-n", type=int, default=2000)
    ap.add_argument("--l2-max-utts", type=int, default=150)
    args = ap.parse_args()
    out = Path(args.out_dir); clips = out / "clips_16k"; clips.mkdir(parents=True, exist_ok=True)

    from extract_phonbank import extract as ph
    from extract_l2arctic import extract as l2
    from extract_librispeech import extract as ls
    rows = ph(clips, limit=args.phonbank_limit)
    rows += l2(clips, max_utts=args.l2_max_utts)
    rows += ls(clips, n=args.libri_n)
    for r in rows: r["length_class"] = normalize_length(r["length_class"])
    rows = assign_splits(rows)
    rows_to_parquet(rows, out / "union_manifest.parquet")
    print(f"[union] {len(rows)} rows -> {out/'union_manifest.parquet'}")

if __name__ == "__main__":
    main()
  • [ ] Step 4: Run tests, verify pass

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k splits -v Expected: PASS.

  • [ ] Step 5: Build the in-hand union for real, then commit the script

Run (long; checkpointed via clip cache — safe to rerun): HF_HOME=/Volumes/ExternalData2/hf-cache uv run python research/2026-06-06-audio-union/build_union.py Expected: prints [union] N rows with N in the tens of thousands.

git add research/2026-06-06-audio-union/build_union.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): build/merge + subject-disjoint splits (in-hand sources)"

Task 8: Audit gate (audit_union.py) — the pass/fail for the dataset

Files: - Create: research/2026-06-06-audio-union/audit_union.py - Create: research/2026-06-06-audio-union/UNION.md - Modify: tests/test_union.py

The audit is the deliverable's gate. It asserts: every clip is 16k; ≥99% of produced tokens are in the broad-40 inventory; length diversity present (both word and sentence non-trivially represented — the load-bearing property); population diversity present; splits subject-disjoint. Emits counts + a length histogram to UNION.md.

  • [ ] Step 1: Write the failing test
# append to tests/test_union.py
def test_audit_passes_on_a_synthetic_diverse_manifest(tmp_path):
    from audit_union import audit
    import polars as pl
    rows = []
    for i in range(50):
        lc = "word" if i % 2 else "sentence"; pop = ["child","l2","clean"][i % 3]
        rows.append(dict(id=f"x{i}", source="phonbank", population=pop, length_class=lc,
                         subject=f"s{i%10}", clip_path="x", produced=["k","æ","t"],
                         n_phonemes=3, duration_ms=600, split=["train","val","test"][i%3]))
    p = tmp_path / "m.parquet"; pl.DataFrame(rows).write_parquet(p)
    report = audit(p, check_audio=False)
    assert report["pass"] is True
    assert report["length_classes"]["word"] > 0 and report["length_classes"]["sentence"] > 0
  • [ ] Step 2: Run it, verify it fails

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k audit -v Expected: FAIL — ModuleNotFoundError: audit_union.

  • [ ] Step 3: Implement audit_union.py
# audit_union.py
import argparse
from pathlib import Path
import polars as pl, soundfile as sf
from lib_labels import INV

def audit(manifest_path, check_audio: bool = True) -> dict:
    df = pl.read_parquet(manifest_path)
    lc = dict(df.group_by("length_class").len().iter_rows())
    pop = dict(df.group_by("population").len().iter_rows())
    src = dict(df.group_by("source").len().iter_rows())
    # subject-disjoint splits
    leak = (df.group_by("subject").agg(pl.col("split").n_unique().alias("k"))
              .filter(pl.col("k") > 1).height)
    # label coverage
    toks = df["produced"].explode().drop_nulls().to_list()
    cov = sum(t in INV for t in toks) / max(1, len(toks))
    # SR check on a sample
    bad_sr = 0
    if check_audio:
        for p in df["clip_path"].sample(min(200, df.height), seed=1).to_list():
            try:
                if sf.info(p).samplerate != 16000: bad_sr += 1
            except Exception: bad_sr += 1
    report = {
        "rows": df.height, "length_classes": lc, "populations": pop, "sources": src,
        "subject_split_leaks": leak, "label_coverage": round(cov, 4), "bad_sr_sampled": bad_sr,
        "pass": (leak == 0 and cov >= 0.99 and bad_sr == 0
                 and lc.get("word", 0) > 0 and lc.get("sentence", 0) > 0
                 and len(pop) >= 2),
    }
    return report

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--manifest", default="/Volumes/ExternalData1/audio-union/union_manifest.parquet")
    args = ap.parse_args()
    rep = audit(args.manifest)
    import json; print(json.dumps(rep, indent=2))
    Path("research/2026-06-06-audio-union/UNION.md").write_text(
        "# Audio Training Union — audit\n\n```json\n" + json.dumps(rep, indent=2) + "\n```\n")
    assert rep["pass"], f"AUDIT FAILED: {rep}"

if __name__ == "__main__":
    main()
  • [ ] Step 4: Run tests, verify pass; then run the audit on the real union

Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k audit -v → PASS. Run: uv run python research/2026-06-06-audio-union/audit_union.py Expected: prints the report, "pass": true, and writes UNION.md. If pass is false, the report names the failing gate (leak / coverage / SR / missing length or population) — fix the offending extractor, rerun build_union.py, re-audit.

  • [ ] Step 5: Commit
git add research/2026-06-06-audio-union/audit_union.py research/2026-06-06-audio-union/UNION.md research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): dataset audit gate + UNION.md report"

Wave 2 — enrichment (download-dependent; do after the in-hand union passes audit)

These add accent/gender breadth (Common Voice) and the dysarthria population (TORGO). They are additive: the union is already trainable after Task 8. Each follows the same shape — extractor + test + add to build_union.py + re-audit.

Task 9: Common Voice extractor (CC0; accent + gender breadth)

Files: Create extract_commonvoice.py; modify build_union.py, tests/test_union.py.

  • [ ] Step 1: Download a CC0 subset. Run (documents the source in UNION.md): pull a bounded English subset via hf_hub_download per the LibriSpeech gotcha in research/2026-05-31-audio-data-reservoir/scripts/dl_librispeech_train.py (explicit-shard pattern, HF_HUB_DOWNLOAD_TIMEOUT=30). Target mozilla-foundation/common_voice_* English validated split, ~20–40h, to /Volumes/ExternalData2/hf-cache. Cap before realize (random-sample-to-cap is the first gate, per [[feedback_cap_before_realize]]).
  • [ ] Step 2: Write the failing testextract_commonvoice.extract yields population=clean (or accent), length_class=sentence, 16k, labels = to_broad40(CMU-G2P(text)), and carries accent/gender/age metadata fields (Common Voice provides them — they enable the future gender axis; store in the row even though the MVP ignores them).
  • [ ] Step 3: Implement the extractor (mirror extract_librispeech — CV is 48k MP3 → cache_clip resamples; CMU-G2P labels; drop OOV).
  • [ ] Step 4: Run test → PASS; add cv(...) to build_union.py; rerun build + audit → still "pass": true, now with a commonvoice source and a larger clean/accent population.
  • [ ] Step 5: Commit feat(audio-union): Common Voice CC0 extractor (accent/gender metadata).

Task 10: TORGO extractor (open download; dysarthria population)

Files: Create extract_torgo.py; modify build_union.py, tests/test_union.py.

  • [ ] Step 1: Download TORGO. Direct .tar.bz2 from the U-Toronto page (no gate) to /Volumes/ExternalData2/audio-datasets/torgo/. Document the "academic non-profit" license caveat in UNION.md (flagged for the deploy-time license review, like the other Tier-B/NC sources).
  • [ ] Step 2: Write the failing testextract_torgo.extract yields population=dysarthria, both length_class values where available, 16k (TORGO is 44.1k → cache_clip resamples), labels from the phn_* TIMIT-phone transcriptions → broad-40 (TIMIT phone set → IPA → to_broad40). Skip control speakers OR tag them population=clean (decide in UNION.md; default: include dysarthric speakers only for the deviant signal).
  • [ ] Step 3: Implement the extractor (parse phn_* alignment files for produced phones; map TIMIT→IPA; cache_clip).
  • [ ] Step 4: Run test → PASS; add torgo(...) to build_union.py; rerun build + audit → "pass": true, now with a dysarthria population present.
  • [ ] Step 5: Commit feat(audio-union): TORGO dysarthria extractor (TIMIT-phone labels, 16k).

Self-Review

1. Spec coverage (design §3, §9): The union mixes connected (LibriSpeech/L2-ARCTIC sentences) + word-level deviant (PhonBank) — the length-diversity property the existence proof requires (Task 7 normalizes length; Task 8 gates on both word and sentence present). Real-audio-only, produced labels, no synthetic — satisfied (no generator anywhere). Sources match §9 (LibriSpeech, Common Voice, PhonBank, TORGO, L2-ARCTIC). 16k resample enforced everywhere (Task 3 + audit SR check). Subject-disjoint splits (Task 7 + audit leak check). Gender metadata captured for the future axis (Task 9). Gap check: the design also names the retrain itself — that is deliberately a SEPARATE plan (Subsystem A continues after the union); this plan stops at a trainable, audited dataset. No other gaps.

2. Placeholder scan: No TBD/TODO. Wave-2 tasks (9, 10) compress the 5-step pattern into one line each because they are download-gated and structurally identical to Tasks 4–6 (which show full code) — the engineer mirrors those. Acceptable per "Similar to Task N" only because the template code is fully shown in 4–6; if executing 9–10 cold, copy the extract_librispeech/extract_phonbank shape.

3. Type consistency: Row.asdict() adds n_phonemes; to_broad40 signature (seq, arpabet=False) used consistently; cache_clip(src, clips_dir, source, clip_id, start_ms, end_ms) and audit(manifest_path, check_audio=True) stable across tasks; extract(...) returns list[dict] (manifest rows) everywhere.


Execution Handoff

Plan complete and saved to docs/superpowers/plans/2026-06-06-v6-audio-data-union.md. Two execution options:

1. Subagent-Driven (recommended) — I dispatch a fresh subagent per task, review between tasks, fast iteration. Well-suited here since each extractor is independent and the audit gate gives an objective per-task check.

2. Inline Execution — execute tasks in this session with checkpoints for review.

Which approach?