v6 Audio — Training Data Union Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Assemble a length-and-population-diverse training union of (16 kHz mono audio, produced broad-40 phoneme labels, metadata) with subject-disjoint train/val/test splits and a passing validation audit — the dataset the unified faithful transcriber will be fine-tuned on.
Architecture: A developer-local Python builder in a new research dir. Per-source extractors (lifted from this session's triage harnesses) each emit manifest rows + cached 16 kHz clips onto the external drive; a merge step normalizes labels to the broad-40 inventory and assigns subject-disjoint splits; an audit step enforces the gate (all-16k, label coverage, length diversity, split disjointness). Sources land in two waves: in-hand (PhonBank child, L2-ARCTIC, LibriSpeech — extractors already exist) then download-dependent enrichment (Common Voice, TORGO).
Tech Stack: Python 3.12, uv, polars (manifest), soundfile + librosa (audio I/O + resample), pytest (validation gates). Reuses packages/features/outputs/vectors.csv (broad-40 inventory), packages/data CMU loader + ARPA→IPA mapping.
Why this is the foundation: the ft-l2 existence proof showed length-generalization survives fine-tuning iff the union contains connected speech. This union deliberately mixes connected (LibriSpeech / L2-ARCTIC sentences) with word-level deviant data (PhonBank), which is the load-bearing property the retrain depends on.
Critical constraints (read before any task)¶
- EVERY clip resamples to 16 kHz. PhonBank cache is already 16k; L2-ARCTIC and TORGO are 44.1 kHz; LibriSpeech is 16k native. The bug that cost hours this session:
sf.readwithout resampling fed the model 44.1k → garbage. Alwayslibrosa.resample(orig_sr=sr, target_sr=16000)whensr != 16000. - Labels are PRODUCED phonemes, not canonical (faithful recipe), normalized to the broad-40 inventory in
vectors.csv(the 26-d Bayesian set's segment list — the canonical inventory). - Splits are SUBJECT-disjoint, never utterance-random, or the val/test leak.
- Outputs are gitignored, on the external drive (
/Volumes/ExternalData1/audio-union/). Only the scripts +UNION.mdare committed. This follows the reservoir convention ([[project_audio_data_reservoir]]). - Checkpoint the heavy extraction (clip resampling/caching): skip clips already cached so reruns resume. Reference the checkpoint policy in
CLAUDE.md.
File Structure¶
research/2026-06-06-audio-union/
├── lib_labels.py # narrow IPA + ARPABET → broad-40 normalization (single source of truth)
├── lib_manifest.py # manifest row schema + parquet read/write helpers
├── extract_phonbank.py # PhonBank child %pho → rows + 16k clips (lift run_triage.load_words)
├── extract_l2arctic.py # L2-ARCTIC perceived-phone TextGrids → rows + 16k clips (lift l2libri)
├── extract_librispeech.py # LibriSpeech + CMU-G2P → rows + 16k clips (lift l2libri)
├── extract_commonvoice.py # Common Voice CC0 subset → rows + 16k clips (Wave 2)
├── extract_torgo.py # TORGO dysarthria → rows + 16k clips (Wave 2)
├── build_union.py # orchestrate extractors → merge → subject-disjoint splits → manifest
├── audit_union.py # THE GATE: 16k check, label coverage, length/pop histograms, split disjointness → UNION.md
├── tests/
│ └── test_union.py # validation tests (developer-local, data-dependent; NOT in CI)
└── UNION.md # lab notes: sources, counts, decisions, audit output
# Outputs (gitignored, external drive):
/Volumes/ExternalData1/audio-union/
├── clips_16k/<source>/<id>.wav # cached resampled clips
└── union_manifest.parquet # the dataset manifest
Manifest row schema (lib_manifest.py), one row per clip:
| field | type | notes |
|---|---|---|
id |
str | <source>_<natural-key> |
source |
str | phonbank | l2arctic | librispeech | commonvoice | torgo |
population |
str | child | l2 | clean | dysarthria |
length_class |
str | word | sentence |
subject |
str | speaker/subject id — splits are disjoint on this |
clip_path |
str | absolute path to the 16k wav |
produced |
list[str] | produced broad-40 phoneme tokens |
n_phonemes |
int | len(produced) |
duration_ms |
int | clip duration |
split |
str | train | val | test (assigned in build_union) |
Task 1: Scaffold + label normalization (lib_labels.py)¶
Files:
- Create: research/2026-06-06-audio-union/lib_labels.py
- Create: research/2026-06-06-audio-union/tests/test_union.py
- [ ] Step 1: Write the failing test
# tests/test_union.py
import sys; from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from lib_labels import to_broad40, INV
def test_inventory_is_40ish_and_has_core_phonemes():
assert {"p", "t", "k", "s", "ɹ", "i", "ɑ", "tʃ"} <= INV
assert len(INV) >= 39
def test_narrow_ipa_strips_diacritics_to_broad():
# length mark, primary stress, dentalized /s/ diacritic → broad /s/
assert to_broad40("ˈsːe̪") == ["s", "e"]
def test_arpabet_maps_to_broad_ipa():
# ARPABET with stress digits → broad IPA tokens
assert to_broad40(["T", "W", "EH1", "N", "T", "IY0"], arpabet=True) == ["t", "w", "ɛ", "n", "t", "i"]
def test_affricates_kept_as_units():
assert to_broad40("tʃiz") == ["tʃ", "i", "z"]
- [ ] Step 2: Run it, verify it fails
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k labels -v
Expected: FAIL — ModuleNotFoundError: lib_labels.
- [ ] Step 3: Implement
lib_labels.py
Lift the normalize() + INV/TWO/VARIANT logic from research/2026-06-06-audio-triage/run_triage.py, and the ARPABET→IPA map from packages/data (load_arpa_to_ipa). Single source of truth for label normalization across all extractors.
# lib_labels.py
import csv, sys, unicodedata
from pathlib import Path
VEC = Path("packages/features/outputs/vectors.csv")
INV = [r["ipa"] for r in csv.DictReader(open(VEC))]
INV_SET = set(INV)
INV = set(INV) # public: membership set
TWO = {"tʃ", "dʒ"}
# variant collapses to broad-40 (ASCII g→ɡ, r→ɹ, flap→t, drop glottal/aspiration)
VARIANT = {"r": "ɹ", "g": "ɡ", "ʧ": "tʃ", "ʤ": "dʒ", "ɫ": "l", "ɾ": "t", "ʔ": "", "ʰ": ""}
def _arpa_map():
sys.path.insert(0, "packages/data/src")
from phonolex_data.mappings import load_arpa_to_ipa
return load_arpa_to_ipa()
_ARPA = None
def to_broad40(seq, arpabet: bool = False) -> list[str]:
"""Narrow IPA string OR ARPABET token list → broad-40 IPA tokens."""
global _ARPA
if arpabet:
if _ARPA is None: _ARPA = _arpa_map()
s = "".join(_ARPA.get("".join(c for c in tok if not c.isdigit()).upper(), "") for tok in seq)
else:
s = seq
s = unicodedata.normalize("NFD", s)
s = "".join(c for c in s if not unicodedata.combining(c) and c not in "ˈˌːˑ ’.")
s = "".join(VARIANT.get(c, c) for c in s)
out, i = [], 0
while i < len(s):
if i + 1 < len(s) and s[i:i+2] in TWO:
out.append(s[i:i+2]); i += 2; continue
if s[i] in INV_SET:
out.append(s[i])
i += 1
return out
- [ ] Step 4: Run tests, verify pass
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k labels -v
Expected: PASS (4 tests). If to_broad40 ARPABET path differs from the map's casing, fix the lookup, not the test.
- [ ] Step 5: Commit
git add research/2026-06-06-audio-union/lib_labels.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): broad-40 label normalization (narrow IPA + ARPABET)"
Task 2: Manifest schema + I/O (lib_manifest.py)¶
Files:
- Create: research/2026-06-06-audio-union/lib_manifest.py
- Modify: research/2026-06-06-audio-union/tests/test_union.py
- [ ] Step 1: Write the failing test
# append to tests/test_union.py
from lib_manifest import Row, rows_to_parquet, read_manifest
def test_row_roundtrips_through_parquet(tmp_path):
rows = [Row(id="phonbank_x", source="phonbank", population="child",
length_class="word", subject="s1", clip_path="/tmp/x.wav",
produced=["k", "æ", "t"], duration_ms=600).asdict()]
p = tmp_path / "m.parquet"
rows_to_parquet(rows, p)
df = read_manifest(p)
assert df["n_phonemes"][0] == 3 and df["produced"][0].to_list() == ["k", "æ", "t"]
- [ ] Step 2: Run it, verify it fails
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k roundtrip -v
Expected: FAIL — ModuleNotFoundError: lib_manifest.
- [ ] Step 3: Implement
lib_manifest.py
# lib_manifest.py
from dataclasses import dataclass, field, asdict
import polars as pl
from pathlib import Path
@dataclass
class Row:
id: str; source: str; population: str; length_class: str
subject: str; clip_path: str; produced: list[str]; duration_ms: int
split: str = "train"
def asdict(self):
d = asdict(self); d["n_phonemes"] = len(self.produced); return d
def rows_to_parquet(rows: list[dict], path: Path):
pl.DataFrame(rows).write_parquet(path)
def read_manifest(path: Path) -> pl.DataFrame:
return pl.read_parquet(path)
- [ ] Step 4: Run tests, verify pass
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k roundtrip -v
Expected: PASS.
- [ ] Step 5: Commit
git add research/2026-06-06-audio-union/lib_manifest.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): manifest row schema + parquet I/O"
Task 3: Shared clip cache helper (lib_audio.py)¶
Files:
- Create: research/2026-06-06-audio-union/lib_audio.py
- Modify: tests/test_union.py
The single resample-and-cache function every extractor uses. Checkpoint = skip if cached.
- [ ] Step 1: Write the failing test
# append to tests/test_union.py
import numpy as np, soundfile as sf
from lib_audio import cache_clip, SR
def test_cache_clip_resamples_to_16k_and_is_idempotent(tmp_path):
src = tmp_path / "in.wav"
sf.write(src, np.zeros(44100, dtype="float32"), 44100) # 1s @ 44.1k
out = cache_clip(str(src), tmp_path / "clips", "torgo", "u1", start_ms=None, end_ms=None)
arr, sr = sf.read(out)
assert sr == SR == 16000 and len(arr) == 16000
mtime = out.stat().st_mtime
cache_clip(str(src), tmp_path / "clips", "torgo", "u1", None, None) # rerun
assert out.stat().st_mtime == mtime # not rewritten (checkpoint)
- [ ] Step 2: Run it, verify it fails
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k cache_clip -v
Expected: FAIL — ModuleNotFoundError: lib_audio.
- [ ] Step 3: Implement
lib_audio.py
# lib_audio.py
from pathlib import Path
import numpy as np, soundfile as sf
SR = 16000
def cache_clip(src_path, clips_dir: Path, source: str, clip_id: str,
start_ms, end_ms) -> Path:
out = Path(clips_dir) / source / f"{clip_id}.wav"
if out.exists():
return out # checkpoint: already cached
out.parent.mkdir(parents=True, exist_ok=True)
arr, sr = sf.read(src_path, dtype="float32")
if arr.ndim > 1: arr = arr.mean(axis=1)
if start_ms is not None and end_ms is not None:
arr = arr[int(start_ms/1000*sr):int(end_ms/1000*sr)]
if sr != SR:
import librosa
arr = librosa.resample(arr, orig_sr=sr, target_sr=SR)
sf.write(out, arr.astype("float32"), SR)
return out
- [ ] Step 4: Run tests, verify pass
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k cache_clip -v
Expected: PASS.
- [ ] Step 5: Commit
git add research/2026-06-06-audio-union/lib_audio.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): resample-to-16k clip cache (idempotent/checkpointed)"
Task 4: PhonBank child extractor (in-hand)¶
Files:
- Create: research/2026-06-06-audio-union/extract_phonbank.py
- Modify: tests/test_union.py
Lift load_words() from research/2026-06-06-audio-triage/run_triage.py (reads /Volumes/ExternalData1/phonbank/dataset_production_new_2026-06-03.jsonl, clips in _utt_cache_16k). population=child, length_class=word, label = to_broad40(actual_phonology), subject = corpus_name + subject.
- [ ] Step 1: Write the failing test (data-dependent — guarded skip if drive absent)
# append to tests/test_union.py
import pytest
from pathlib import Path
PB = Path("/Volumes/ExternalData1/phonbank/dataset_production_new_2026-06-03.jsonl")
@pytest.mark.skipif(not PB.exists(), reason="ExternalData1 not mounted")
def test_phonbank_extractor_yields_valid_child_word_rows(tmp_path):
from extract_phonbank import extract
rows = extract(tmp_path / "clips", limit=20)
assert len(rows) > 0
r = rows[0]
assert r["source"] == "phonbank" and r["population"] == "child" and r["length_class"] == "word"
assert all(p in __import__("lib_labels").INV for p in r["produced"])
arr, sr = __import__("soundfile").read(r["clip_path"]); assert sr == 16000
- [ ] Step 2: Run it, verify it fails
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k phonbank -v
Expected: FAIL — ModuleNotFoundError: extract_phonbank (or SKIP if drive unmounted — mount it).
- [ ] Step 3: Implement
extract_phonbank.py
# extract_phonbank.py
import json
from pathlib import Path
from lib_labels import to_broad40
from lib_audio import cache_clip
from lib_manifest import Row
JSONL = Path("/Volumes/ExternalData1/phonbank/dataset_production_new_2026-06-03.jsonl")
def extract(clips_dir: Path, limit: int | None = None) -> list[dict]:
rows = []
for line in open(JSONL):
r = json.loads(line)
if r.get("corpus_name") == "ACAD" or not r.get("audio_path"): continue
s, e = r.get("start_ms"), r.get("end_ms")
if s is None or e is None or e <= s or (e - s) > 30000: continue
produced = to_broad40(r.get("actual_phonology") or "")
if not produced: continue
subj = f"{r['corpus_name']}_{r['subject']}"
cid = f"{subj}_{s}_{e}"
try:
clip = cache_clip(r["audio_path"], clips_dir, "phonbank", cid, s, e)
except Exception:
continue
rows.append(Row(id=f"phonbank_{cid}", source="phonbank", population="child",
length_class="word", subject=subj, clip_path=str(clip),
produced=produced, duration_ms=e - s).asdict())
if limit and len(rows) >= limit: break
return rows
- [ ] Step 4: Run tests, verify pass
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k phonbank -v
Expected: PASS.
- [ ] Step 5: Commit
git add research/2026-06-06-audio-union/extract_phonbank.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): PhonBank child-word extractor"
Task 5: L2-ARCTIC extractor (in-hand)¶
Files:
- Create: research/2026-06-06-audio-union/extract_l2arctic.py
- Modify: tests/test_union.py
Lift the perceived-phone TextGrid parsing + word/sentence segmentation from research/2026-06-06-audio-triage/run_triage_l2libri.py (/Volumes/ExternalData2/audio-datasets/l2arctic/<SPK>/, annotation phones tier → perceived phone, skip deletions; word tier for l2_word; full sentence for l2_sentence; ARPABET→IPA→broad-40). population=l2, subject=speaker, two length_class values. 44.1k → resample via cache_clip.
- [ ] Step 1: Write the failing test
# append to tests/test_union.py
L2 = Path("/Volumes/ExternalData2/audio-datasets/l2arctic")
@pytest.mark.skipif(not L2.exists(), reason="ExternalData2 not mounted")
def test_l2arctic_yields_l2_rows_both_lengths_16k(tmp_path):
from extract_l2arctic import extract
rows = extract(tmp_path / "clips", speakers=["ABA"], max_utts=10)
assert any(r["length_class"] == "l2_sentence" for r in rows)
for r in rows:
assert r["source"] == "l2arctic" and r["population"] == "l2"
arr, sr = __import__("soundfile").read(r["clip_path"]); assert sr == 16000
- [ ] Step 2: Run it, verify it fails
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k l2arctic -v
Expected: FAIL — ModuleNotFoundError.
- [ ] Step 3: Implement
extract_l2arctic.py
Port run_triage_l2libri.py's L2-ARCTIC section. Map length_class to l2_sentence / l2_word (treated as sentence/word by the audit's length-diversity check — keep the L2-specific labels for provenance, normalize to sentence|word in build_union). Reuse to_broad40(..., arpabet=...) and cache_clip. (Code: lift verbatim from the triage harness's read_textgrid + perceived-phone extraction; do not rewrite the TextGrid parser.)
- [ ] Step 4: Run tests, verify pass
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k l2arctic -v
Expected: PASS.
- [ ] Step 5: Commit
git add research/2026-06-06-audio-union/extract_l2arctic.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): L2-ARCTIC perceived-phone extractor (word+sentence, 16k)"
Task 6: LibriSpeech extractor (in-hand)¶
Files:
- Create: research/2026-06-06-audio-union/extract_librispeech.py
- Modify: tests/test_union.py
Lift the LibriSpeech loader from run_triage_l2libri.py (hf-cache at /Volumes/ExternalData2/hf-cache, CMU-G2P of the transcript → broad-40 produced labels — canonical≈produced for clean speech; this is the connected-speech length anchor). population=clean, length_class=sentence, subject=speaker_id. 16k native (still route through cache_clip for uniformity).
- [ ] Step 1: Write the failing test
# append to tests/test_union.py
HF = Path("/Volumes/ExternalData2/hf-cache")
@pytest.mark.skipif(not HF.exists(), reason="hf-cache not mounted")
def test_librispeech_yields_clean_sentences_16k(tmp_path):
from extract_librispeech import extract
rows = extract(tmp_path / "clips", n=8)
assert rows and all(r["population"] == "clean" and r["length_class"] == "sentence" for r in rows)
assert max(r["n_phonemes"] for r in rows) > 15 # genuinely connected
- [ ] Step 2: Run it, verify it fails
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k librispeech -v
Expected: FAIL — ModuleNotFoundError.
- [ ] Step 3: Implement
extract_librispeech.py
Port the LibriSpeech + CMU-G2P section from run_triage_l2libri.py (uses phonolex_data.loaders.cmudict.load_cmudict; drop OOV words; HF_HOME env). Write each utterance's audio to a temp wav then cache_clip (it's already 16k; resample is a no-op). Record OOV rate in a module-level counter for the audit.
- [ ] Step 4: Run tests, verify pass
Run: HF_HOME=/Volumes/ExternalData2/hf-cache uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k librispeech -v
Expected: PASS.
- [ ] Step 5: Commit
git add research/2026-06-06-audio-union/extract_librispeech.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): LibriSpeech clean-sentence extractor (CMU-G2P labels)"
Task 7: Build + merge + subject-disjoint splits (build_union.py)¶
Files:
- Create: research/2026-06-06-audio-union/build_union.py
- Modify: tests/test_union.py
Orchestrate the in-hand extractors, normalize l2_sentence/l2_word → sentence/word, assign subject-disjoint 80/10/10 splits (a subject lands entirely in one split), write the manifest.
- [ ] Step 1: Write the failing test
# append to tests/test_union.py
def test_splits_are_subject_disjoint_and_length_diverse():
from build_union import assign_splits, normalize_length
rows = ([{"subject": f"s{i}", "length_class": "word"} for i in range(8)] +
[{"subject": f"s{i}", "length_class": "sentence"} for i in range(8, 16)])
out = assign_splits(rows, seed=13)
by_subj = {}
for r in out: by_subj.setdefault(r["subject"], set()).add(r["split"])
assert all(len(v) == 1 for v in by_subj.values()) # no subject leaks across splits
assert {"train", "val", "test"} <= {r["split"] for r in out}
assert normalize_length("l2_sentence") == "sentence"
- [ ] Step 2: Run it, verify it fails
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k splits -v
Expected: FAIL — ModuleNotFoundError: build_union.
- [ ] Step 3: Implement
build_union.py
# build_union.py
import argparse
from pathlib import Path
import numpy as np
from lib_manifest import rows_to_parquet
def normalize_length(lc: str) -> str:
return "sentence" if "sentence" in lc else "word"
def assign_splits(rows: list[dict], seed: int = 13) -> list[dict]:
subs = sorted({r["subject"] for r in rows})
rng = np.random.default_rng(seed); rng.shuffle(subs)
n = len(subs); val = set(subs[: max(1, n // 10)]); test = set(subs[n // 10: n // 5])
for r in rows:
r["split"] = "val" if r["subject"] in val else "test" if r["subject"] in test else "train"
return rows
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--out-dir", default="/Volumes/ExternalData1/audio-union")
ap.add_argument("--phonbank-limit", type=int, default=None)
ap.add_argument("--libri-n", type=int, default=2000)
ap.add_argument("--l2-max-utts", type=int, default=150)
args = ap.parse_args()
out = Path(args.out_dir); clips = out / "clips_16k"; clips.mkdir(parents=True, exist_ok=True)
from extract_phonbank import extract as ph
from extract_l2arctic import extract as l2
from extract_librispeech import extract as ls
rows = ph(clips, limit=args.phonbank_limit)
rows += l2(clips, max_utts=args.l2_max_utts)
rows += ls(clips, n=args.libri_n)
for r in rows: r["length_class"] = normalize_length(r["length_class"])
rows = assign_splits(rows)
rows_to_parquet(rows, out / "union_manifest.parquet")
print(f"[union] {len(rows)} rows -> {out/'union_manifest.parquet'}")
if __name__ == "__main__":
main()
- [ ] Step 4: Run tests, verify pass
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k splits -v
Expected: PASS.
- [ ] Step 5: Build the in-hand union for real, then commit the script
Run (long; checkpointed via clip cache — safe to rerun):
HF_HOME=/Volumes/ExternalData2/hf-cache uv run python research/2026-06-06-audio-union/build_union.py
Expected: prints [union] N rows with N in the tens of thousands.
git add research/2026-06-06-audio-union/build_union.py research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): build/merge + subject-disjoint splits (in-hand sources)"
Task 8: Audit gate (audit_union.py) — the pass/fail for the dataset¶
Files:
- Create: research/2026-06-06-audio-union/audit_union.py
- Create: research/2026-06-06-audio-union/UNION.md
- Modify: tests/test_union.py
The audit is the deliverable's gate. It asserts: every clip is 16k; ≥99% of produced tokens are in the broad-40 inventory; length diversity present (both word and sentence non-trivially represented — the load-bearing property); population diversity present; splits subject-disjoint. Emits counts + a length histogram to UNION.md.
- [ ] Step 1: Write the failing test
# append to tests/test_union.py
def test_audit_passes_on_a_synthetic_diverse_manifest(tmp_path):
from audit_union import audit
import polars as pl
rows = []
for i in range(50):
lc = "word" if i % 2 else "sentence"; pop = ["child","l2","clean"][i % 3]
rows.append(dict(id=f"x{i}", source="phonbank", population=pop, length_class=lc,
subject=f"s{i%10}", clip_path="x", produced=["k","æ","t"],
n_phonemes=3, duration_ms=600, split=["train","val","test"][i%3]))
p = tmp_path / "m.parquet"; pl.DataFrame(rows).write_parquet(p)
report = audit(p, check_audio=False)
assert report["pass"] is True
assert report["length_classes"]["word"] > 0 and report["length_classes"]["sentence"] > 0
- [ ] Step 2: Run it, verify it fails
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k audit -v
Expected: FAIL — ModuleNotFoundError: audit_union.
- [ ] Step 3: Implement
audit_union.py
# audit_union.py
import argparse
from pathlib import Path
import polars as pl, soundfile as sf
from lib_labels import INV
def audit(manifest_path, check_audio: bool = True) -> dict:
df = pl.read_parquet(manifest_path)
lc = dict(df.group_by("length_class").len().iter_rows())
pop = dict(df.group_by("population").len().iter_rows())
src = dict(df.group_by("source").len().iter_rows())
# subject-disjoint splits
leak = (df.group_by("subject").agg(pl.col("split").n_unique().alias("k"))
.filter(pl.col("k") > 1).height)
# label coverage
toks = df["produced"].explode().drop_nulls().to_list()
cov = sum(t in INV for t in toks) / max(1, len(toks))
# SR check on a sample
bad_sr = 0
if check_audio:
for p in df["clip_path"].sample(min(200, df.height), seed=1).to_list():
try:
if sf.info(p).samplerate != 16000: bad_sr += 1
except Exception: bad_sr += 1
report = {
"rows": df.height, "length_classes": lc, "populations": pop, "sources": src,
"subject_split_leaks": leak, "label_coverage": round(cov, 4), "bad_sr_sampled": bad_sr,
"pass": (leak == 0 and cov >= 0.99 and bad_sr == 0
and lc.get("word", 0) > 0 and lc.get("sentence", 0) > 0
and len(pop) >= 2),
}
return report
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--manifest", default="/Volumes/ExternalData1/audio-union/union_manifest.parquet")
args = ap.parse_args()
rep = audit(args.manifest)
import json; print(json.dumps(rep, indent=2))
Path("research/2026-06-06-audio-union/UNION.md").write_text(
"# Audio Training Union — audit\n\n```json\n" + json.dumps(rep, indent=2) + "\n```\n")
assert rep["pass"], f"AUDIT FAILED: {rep}"
if __name__ == "__main__":
main()
- [ ] Step 4: Run tests, verify pass; then run the audit on the real union
Run: uv run pytest research/2026-06-06-audio-union/tests/test_union.py -k audit -v → PASS.
Run: uv run python research/2026-06-06-audio-union/audit_union.py
Expected: prints the report, "pass": true, and writes UNION.md. If pass is false, the report names the failing gate (leak / coverage / SR / missing length or population) — fix the offending extractor, rerun build_union.py, re-audit.
- [ ] Step 5: Commit
git add research/2026-06-06-audio-union/audit_union.py research/2026-06-06-audio-union/UNION.md research/2026-06-06-audio-union/tests/test_union.py
git commit -m "feat(audio-union): dataset audit gate + UNION.md report"
Wave 2 — enrichment (download-dependent; do after the in-hand union passes audit)¶
These add accent/gender breadth (Common Voice) and the dysarthria population (TORGO). They are additive: the union is already trainable after Task 8. Each follows the same shape — extractor + test + add to build_union.py + re-audit.
Task 9: Common Voice extractor (CC0; accent + gender breadth)¶
Files: Create extract_commonvoice.py; modify build_union.py, tests/test_union.py.
- [ ] Step 1: Download a CC0 subset. Run (documents the source in UNION.md): pull a bounded English subset via
hf_hub_downloadper the LibriSpeech gotcha inresearch/2026-05-31-audio-data-reservoir/scripts/dl_librispeech_train.py(explicit-shard pattern,HF_HUB_DOWNLOAD_TIMEOUT=30). Targetmozilla-foundation/common_voice_*English validated split, ~20–40h, to/Volumes/ExternalData2/hf-cache. Cap before realize (random-sample-to-cap is the first gate, per [[feedback_cap_before_realize]]). - [ ] Step 2: Write the failing test —
extract_commonvoice.extractyields population=clean (or accent), length_class=sentence, 16k, labels =to_broad40(CMU-G2P(text)), and carriesaccent/gender/agemetadata fields (Common Voice provides them — they enable the future gender axis; store in the row even though the MVP ignores them). - [ ] Step 3: Implement the extractor (mirror
extract_librispeech— CV is 48k MP3 →cache_clipresamples; CMU-G2P labels; drop OOV). - [ ] Step 4: Run test → PASS; add
cv(...)tobuild_union.py; rerun build + audit → still"pass": true, now with acommonvoicesource and a largerclean/accent population. - [ ] Step 5: Commit
feat(audio-union): Common Voice CC0 extractor (accent/gender metadata).
Task 10: TORGO extractor (open download; dysarthria population)¶
Files: Create extract_torgo.py; modify build_union.py, tests/test_union.py.
- [ ] Step 1: Download TORGO. Direct
.tar.bz2from the U-Toronto page (no gate) to/Volumes/ExternalData2/audio-datasets/torgo/. Document the "academic non-profit" license caveat in UNION.md (flagged for the deploy-time license review, like the other Tier-B/NC sources). - [ ] Step 2: Write the failing test —
extract_torgo.extractyields population=dysarthria, both length_class values where available, 16k (TORGO is 44.1k →cache_clipresamples), labels from thephn_*TIMIT-phone transcriptions → broad-40 (TIMIT phone set → IPA →to_broad40). Skip control speakers OR tag them population=clean (decide in UNION.md; default: include dysarthric speakers only for the deviant signal). - [ ] Step 3: Implement the extractor (parse
phn_*alignment files for produced phones; map TIMIT→IPA;cache_clip). - [ ] Step 4: Run test → PASS; add
torgo(...)tobuild_union.py; rerun build + audit →"pass": true, now with adysarthriapopulation present. - [ ] Step 5: Commit
feat(audio-union): TORGO dysarthria extractor (TIMIT-phone labels, 16k).
Self-Review¶
1. Spec coverage (design §3, §9): The union mixes connected (LibriSpeech/L2-ARCTIC sentences) + word-level deviant (PhonBank) — the length-diversity property the existence proof requires (Task 7 normalizes length; Task 8 gates on both word and sentence present). Real-audio-only, produced labels, no synthetic — satisfied (no generator anywhere). Sources match §9 (LibriSpeech, Common Voice, PhonBank, TORGO, L2-ARCTIC). 16k resample enforced everywhere (Task 3 + audit SR check). Subject-disjoint splits (Task 7 + audit leak check). Gender metadata captured for the future axis (Task 9). Gap check: the design also names the retrain itself — that is deliberately a SEPARATE plan (Subsystem A continues after the union); this plan stops at a trainable, audited dataset. No other gaps.
2. Placeholder scan: No TBD/TODO. Wave-2 tasks (9, 10) compress the 5-step pattern into one line each because they are download-gated and structurally identical to Tasks 4–6 (which show full code) — the engineer mirrors those. Acceptable per "Similar to Task N" only because the template code is fully shown in 4–6; if executing 9–10 cold, copy the extract_librispeech/extract_phonbank shape.
3. Type consistency: Row.asdict() adds n_phonemes; to_broad40 signature (seq, arpabet=False) used consistently; cache_clip(src, clips_dir, source, clip_id, start_ms, end_ms) and audit(manifest_path, check_audio=True) stable across tasks; extract(...) returns list[dict] (manifest rows) everywhere.
Execution Handoff¶
Plan complete and saved to docs/superpowers/plans/2026-06-06-v6-audio-data-union.md. Two execution options:
1. Subagent-Driven (recommended) — I dispatch a fresh subagent per task, review between tasks, fast iteration. Well-suited here since each extractor is independent and the audit gate gives an objective per-task check.
2. Inline Execution — execute tasks in this session with checkpoints for review.
Which approach?