v6 Audio Trajectory Serving Harness Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Make the "all phones are trajectories" model serveable at all — one config-driven FastAPI host that loads the keeper emitter + full-63 Fisher refs + a baked attribution model and answers audio + canonical phones → per-position trajectory deviation + 4-way source attribution, runnable locally now and packageable for RunPod later.
Architecture: Lift the validated research scorer (score_trajectories.py) and attribution classifier (attribution_classifier.py) out of research/2026-06-06-audio-union/ into importable serving modules in packages/audio. The scorer is corpus-free at serve time. Attribution is split into a one-time offline bake (centroids + standardization + priors → attribution_model.json) and a cheap serve-time classify. Artifact paths + device come from env/flags so the identical harness runs on a laptop (MPS/CPU) or in a RunPod container (CUDA). No hosting is deployed in this plan — only made possible.
Tech Stack: Python 3.11, PyTorch (inference), transformers (wav2vec2-lv-60-espeak backbone), numpy, FastAPI/uvicorn, parselmouth (already a host dep). Frontend cleanup in React/TS.
Scope boundaries (explicit):
- IN: stripped serving artifact, scorer module, attribution bake + serve modules, /analyze endpoint, launch flags, local end-to-end smoke, RunPod packaging (buildable, not deployed), removal of the /dev/* pages from the shippable frontend.
- OUT (separate plans): the new product UI surface (a later reimagining — the /dev/* pages are scratch and are being removed, not productized); the worker /api/audio/analyze proxy route + canonical lookup (lands with the product UI); the PHON-151 app-wide vector reseed (gated on "happy").
Standing constraints (from research/2026-06-06-audio-union/README.md §7): dev/local only — no production, no remote reseed, no hosting deploy until the user says "happy." Never overwrite/rebuild the gitignored model weights on /Volumes/ExternalData1/audio-union/. The 63 inventory stays 63. Say "feature vectors," not "embeddings."
File Structure¶
| Path | Responsibility | Action |
|---|---|---|
packages/audio/scripts/strip_checkpoint.py |
3.5 GB training ckpt → ~1.26 GB serving weights | Created (Task 1, done) |
packages/audio/src/phonolex_audio/serving_config.py |
Resolve artifact paths + device from env/flags (local↔RunPod parity) | Create (Task 2) |
packages/audio/src/phonolex_audio/trajectory_scorer.py |
Pure trajectory scoring: align emission to canonical, per-position Fisher deviation + nearest-ref identity | Create (Task 3) |
packages/audio/src/phonolex_audio/analyzer.py |
TrajectoryAnalyzer: owns the emitter + refs + attribution model; analyze(audio, canon) -> dict |
Create (Task 4) |
packages/audio/scripts/bake_attribution_model.py |
One-time offline build of attribution_model.json (means/stds, per-source centroids, L1/Dev priors) |
Create (Task 5) |
packages/audio/src/phonolex_audio/attribution.py |
Serve-time: load baked model, compute 6 features for a clip, standardize, nearest-centroid → source | Create (Task 6) |
packages/audio/src/phonolex_audio/server.py:142 |
Add /analyze endpoint mirroring /feature-review |
Modify (Task 7) |
packages/audio/src/phonolex_audio/__main__.py |
Flags: --trajectory-refs, --attribution-model; register the analyzer |
Modify (Task 7) |
packages/audio/tests/test_trajectory_scorer.py |
Pure-function unit tests (synthetic arrays, no model) | Create (Task 3) |
packages/audio/tests/test_attribution.py |
Serve-time classify unit tests (synthetic features + tiny baked model) | Create (Task 6) |
packages/audio/tests/test_analyze_smoke.py |
@pytest.mark.slow end-to-end: real keeper + one clip |
Create (Task 8) |
packages/audio/deploy/Dockerfile + handler.py + README.md |
RunPod-buildable image (HF cache baked, artifacts mounted/copied, env-driven) | Create (Task 9) |
packages/web/frontend/src/main.tsx + components/tools/*Viewer* |
Remove /dev/* routes + viewer components + their tests/services |
Modify/Delete (Task 10) |
Phoneme-set note: trajectory_scorer and attribution import the canonical CONSONANTS / phone handling from one place. Define CONSONANTS once in trajectory_scorer.py and import it into attribution.py (do not duplicate the literal from attribution_classifier.py:40).
Task 1: Stripped serving artifact (DONE)¶
Files: Create packages/audio/scripts/strip_checkpoint.py
Already implemented and run. Measured: state.pt 3,752 MB → state_serve.pt 1,262 MB fp32 (427 model tensors; optimizer/step/epoch/history dropped). The output keeps the {"model": ...} key so FeatureEmitter(checkpoint=...) loads it unchanged. fp16 (--half) yields ~630 MB and is available if image size pressures it.
- [x] Step 1:
strip_checkpoint.pywritten. - [x] Step 2: Run produced
state_serve.pt(1,262 MB) on the drive (gitignored). - [ ] Step 3: Commit the utility
git add packages/audio/scripts/strip_checkpoint.py
git commit -m "feat(audio): checkpoint-strip utility — 3.5GB training ckpt -> 1.26GB serving artifact"
Task 2: Serving config (local↔RunPod parity)¶
Files:
- Create: packages/audio/src/phonolex_audio/serving_config.py
- Test: packages/audio/tests/test_serving_config.py
Resolve every serving artifact path and the device from env with explicit fallbacks, so the same code runs locally (drive paths, MPS/CPU) and in a container (copied paths, CUDA). No torch import here — pure path/string logic, fast to test.
- [ ] Step 1: Write the failing test
# packages/audio/tests/test_serving_config.py
import os
from phonolex_audio.serving_config import resolve_serving_config
def test_env_overrides_defaults(monkeypatch):
monkeypatch.setenv("PHONOLEX_AUDIO_CHECKPOINT", "/tmp/state_serve.pt")
monkeypatch.setenv("PHONOLEX_AUDIO_VECTORS", "/tmp/vectors.csv")
monkeypatch.setenv("PHONOLEX_AUDIO_TRAJ_REFS", "/tmp/refs_fisher.json")
monkeypatch.setenv("PHONOLEX_AUDIO_ATTRIBUTION", "/tmp/attribution_model.json")
monkeypatch.setenv("PHONOLEX_AUDIO_DEVICE", "cuda")
cfg = resolve_serving_config()
assert cfg.checkpoint == "/tmp/state_serve.pt"
assert cfg.vectors == "/tmp/vectors.csv"
assert cfg.traj_refs == "/tmp/refs_fisher.json"
assert cfg.attribution_model == "/tmp/attribution_model.json"
assert cfg.device == "cuda"
def test_flag_overrides_env(monkeypatch):
monkeypatch.setenv("PHONOLEX_AUDIO_DEVICE", "cpu")
cfg = resolve_serving_config(device="mps")
assert cfg.device == "mps" # explicit arg wins over env
- [ ] Step 2: Run to verify it fails
Run: uv run --extra dev pytest packages/audio/tests/test_serving_config.py -v
Expected: FAIL (module not found)
- [ ] Step 3: Implement
# packages/audio/src/phonolex_audio/serving_config.py
"""Resolve serving artifact paths + device from env/flags.
The same harness runs locally (external-drive paths, MPS/CPU) and in a RunPod
container (image-local paths, CUDA). Precedence: explicit arg > env var > default.
No torch import — device is resolved lazily by the caller when None.
"""
from __future__ import annotations
import os
from dataclasses import dataclass
DRIVE = "/Volumes/ExternalData1/audio-union"
_DEFAULTS = {
"checkpoint": f"{DRIVE}/model_feat_traj_target/state_serve.pt",
"vectors": f"{DRIVE}/model_feat_traj_target/vectors.csv",
"traj_refs": f"{DRIVE}/refs_fisher.json",
"attribution_model": f"{DRIVE}/attribution_model.json",
}
_ENV = {
"checkpoint": "PHONOLEX_AUDIO_CHECKPOINT",
"vectors": "PHONOLEX_AUDIO_VECTORS",
"traj_refs": "PHONOLEX_AUDIO_TRAJ_REFS",
"attribution_model": "PHONOLEX_AUDIO_ATTRIBUTION",
}
@dataclass(frozen=True)
class ServingConfig:
checkpoint: str
vectors: str
traj_refs: str
attribution_model: str
device: str | None # None = auto-detect at load (cuda>mps>cpu)
def _pick(arg, env_key, default):
if arg is not None:
return arg
return os.environ.get(env_key, default)
def resolve_serving_config(
*, checkpoint=None, vectors=None, traj_refs=None, attribution_model=None, device=None
) -> ServingConfig:
return ServingConfig(
checkpoint=_pick(checkpoint, _ENV["checkpoint"], _DEFAULTS["checkpoint"]),
vectors=_pick(vectors, _ENV["vectors"], _DEFAULTS["vectors"]),
traj_refs=_pick(traj_refs, _ENV["traj_refs"], _DEFAULTS["traj_refs"]),
attribution_model=_pick(
attribution_model, _ENV["attribution_model"], _DEFAULTS["attribution_model"]
),
device=_pick(device, "PHONOLEX_AUDIO_DEVICE", None),
)
- [ ] Step 4: Run to verify it passes
Run: uv run --extra dev pytest packages/audio/tests/test_serving_config.py -v
Expected: PASS (2 passed)
- [ ] Step 5: Commit
git add packages/audio/src/phonolex_audio/serving_config.py packages/audio/tests/test_serving_config.py
git commit -m "feat(audio): serving config — env/flag artifact paths for local<->runpod parity"
Task 3: Trajectory scorer module¶
Files:
- Create: packages/audio/src/phonolex_audio/trajectory_scorer.py
- Test: packages/audio/tests/test_trajectory_scorer.py
Lift the pure scoring math from research/.../score_trajectories.py (lev_align, resample, span-by-neighbour-midpoints, Fisher tdist, nearest-ref). Keep it model-free: functions take the already-emitted e[T,26], the aligned slot centers, the canonical phones, and the loaded refs/fisher. The emitter wiring lives in Task 4.
- [ ] Step 1: Write the failing tests (synthetic arrays — no torch, no model)
# packages/audio/tests/test_trajectory_scorer.py
import numpy as np
from phonolex_audio.trajectory_scorer import resample, lev_align, position_spans, tdist, K
def test_resample_to_K_timepoints():
seg = np.linspace(0, 1, 5)[:, None] * np.ones((5, 26))
out = resample(seg)
assert out.shape == (K, 26)
def test_resample_too_short_returns_none():
assert resample(np.zeros((1, 26))) is None
def test_lev_align_marks_substitution_and_match():
pairs = lev_align(["k", "a", "t"], ["k", "a", "p"])
assert pairs == [("k", "k"), ("a", "a"), ("t", "p")]
def test_position_spans_uses_neighbour_midpoints():
# centers at frames 10, 20, 30 -> middle slot span = midpoints (15, 25)
spans = position_spans([10.0, 20.0, 30.0], n_frames=40)
assert spans[1] == (15, 26) # hi is inclusive+1
def test_tdist_fisher_weighting():
P = np.zeros((K, 26)); R = np.ones((K, 26))
fisher_flat = np.ones((K, 26))
assert abs(tdist(P, R, fisher_flat) - 1.0) < 1e-9 # sqrt(26*1)/... mean over K -> sqrt(26)? check
Note: compute the expected tdist value precisely from the formula sqrt((fisher*(P-R)**2).sum(1)).mean() when writing the test — for P=0,R=1,fisher=1 that is sqrt(26) ≈ 5.099, not 1.0. Fix the assertion to abs(tdist(...) - np.sqrt(26)) < 1e-9.
- [ ] Step 2: Run to verify it fails
Run: uv run --extra dev pytest packages/audio/tests/test_trajectory_scorer.py -v
Expected: FAIL (module not found)
- [ ] Step 3: Implement (lifted verbatim from
score_trajectories.pylines 29–55, 82–83, 119–124, generalized)
# packages/audio/src/phonolex_audio/trajectory_scorer.py
"""Pure trajectory-to-trajectory scoring (no model, no corpus).
Given the emitter's per-frame features e[T,26], the force-aligned slot centers,
the canonical phone sequence, and the loaded full-63 Fisher refs, score each
canonical position: Fisher-weighted deviation to its reference trajectory + which
reference the produced sub-trajectory is actually NEAREST (error identity).
Lifted from research/2026-06-06-audio-union/score_trajectories.py (validated).
"""
from __future__ import annotations
import numpy as np
K = 8
# Single source of truth for the consonant set (imported by attribution.py).
CONSONANTS = set("p b t d k ɡ tʃ dʒ f v θ ð s z ʃ ʒ h m n ŋ l ɹ w j ʔ ɾ ɫ ʈ ɖ ɳ β ʋ ɬ c x ɥ".split())
def lev_align(ref, hyp):
n, m = len(ref), len(hyp)
D = [[0] * (m + 1) for _ in range(n + 1)]
for i in range(n + 1): D[i][0] = i
for j in range(m + 1): D[0][j] = j
for i in range(1, n + 1):
for j in range(1, m + 1):
D[i][j] = min(D[i-1][j]+1, D[i][j-1]+1, D[i-1][j-1]+(ref[i-1] != hyp[j-1]))
i, j, out = n, m, []
while i > 0 or j > 0:
if i > 0 and j > 0 and D[i][j] == D[i-1][j-1]+(ref[i-1] != hyp[j-1]):
out.append((ref[i-1], hyp[j-1])); i -= 1; j -= 1
elif i > 0 and D[i][j] == D[i-1][j]+1:
out.append((ref[i-1], None)); i -= 1
else:
out.append((None, hyp[j-1])); j -= 1
return out[::-1]
def resample(seg, k=K):
if len(seg) < 2:
return None
src = np.linspace(0, 1, len(seg)); dst = np.linspace(0, 1, k)
return np.stack([np.interp(dst, src, seg[:, d]) for d in range(seg.shape[1])], 1)
def position_spans(centers, n_frames):
"""For each slot center, the [lo, hi) frame span = inter-neighbour midpoints
(CTC is peaky). centers is a list of float|None. Returns list of (lo, hi)|None."""
spans = []
for i, c in enumerate(centers):
if c is None:
spans.append(None); continue
prev = next((centers[k] for k in range(i-1, -1, -1) if centers[k] is not None), None)
nxt = next((centers[k] for k in range(i+1, len(centers)) if centers[k] is not None), None)
lo = int((prev + c) / 2) if prev is not None else int(c - 3)
hi = int((c + nxt) / 2) if nxt is not None else int(c + 3)
spans.append((max(0, lo), min(n_frames, hi + 1)))
return spans
def tdist(P, R, fisher):
"""Fisher-weighted mean per-timepoint distance (fisher all-ones -> euclidean)."""
return float(np.sqrt((fisher * (P - R) ** 2).sum(1)).mean())
def nearest_ref(P, ref_stack, ref_names, fisher):
d = np.sqrt((fisher[None] * (P[None] - ref_stack) ** 2).sum(2)).mean(1)
return ref_names[int(d.argmin())]
- [ ] Step 4: Run to verify it passes
Run: uv run --extra dev pytest packages/audio/tests/test_trajectory_scorer.py -v
Expected: PASS (5 passed)
- [ ] Step 5: Commit
git add packages/audio/src/phonolex_audio/trajectory_scorer.py packages/audio/tests/test_trajectory_scorer.py
git commit -m "feat(audio): trajectory scorer module (pure) — lifted from validated research scorer"
Task 4: TrajectoryAnalyzer (emitter + refs wiring)¶
Files:
- Modify: packages/audio/src/phonolex_audio/analyzer.py (Create)
Owns the loaded FeatureEmitter and the parsed refs. Mirrors the existing emitter usage in score_trajectories.py (em._decode_audio → em._emit → em.ph2id → _forced_align_positions). Produces the per-position list. Attribution is attached in Task 6. This task has no fast unit test (it requires the real model); it is exercised by the Task 8 smoke. Keep it thin — all math is Task 3 functions.
- [ ] Step 1: Implement
# packages/audio/src/phonolex_audio/analyzer.py
"""TrajectoryAnalyzer: load the keeper emitter + full-63 Fisher refs once, then
analyze (audio, canonical_phones) -> per-position trajectory deviation. The
attribution read is attached by attribution.AttributionModel (Task 6).
Mirrors research/.../score_trajectories.py and attribution_classifier.py feature
extraction, but corpus-free at serve time.
"""
from __future__ import annotations
import json
from pathlib import Path
import numpy as np
from phonolex_audio import trajectory_scorer as ts
class TrajectoryAnalyzer:
def __init__(self, checkpoint: str, vectors: str, traj_refs: str, device: str | None = None):
from phonolex_audio.feature_emitter import FeatureEmitter
self.em = FeatureEmitter(checkpoint=checkpoint, vectors_csv=vectors, device=device)
raw = json.loads(Path(traj_refs).read_text())
if "_fisher" in raw:
self.fisher = np.array(raw["_fisher"], float)
src = raw["refs"]
else:
self.fisher = np.ones((ts.K, 26))
src = raw
self.refs = {p: np.array(v["traj"], float) for p, v in src.items() if v.get("traj")}
self.ref_names = list(self.refs)
self.ref_stack = np.stack([self.refs[n] for n in self.ref_names]) # [63,K,26]
def _emit_and_align(self, audio_bytes: bytes, canon: list[str]):
"""-> (e[T,26], centers[list], produced[list])."""
from phonolex_audio.feature_emitter import _forced_align_positions
arr = self.em._decode_audio(audio_bytes)
e, lp = self.em._emit(arr)
produced = self.em._decode_transcript(lp) if hasattr(self.em, "_decode_transcript") else []
tids, slot_tp = [], []
for p in canon:
aid = self.em.ph2id.get(p)
if aid is not None:
slot_tp.append(len(tids)); tids.append(aid)
else:
slot_tp.append(None)
if not tids:
return e, [None] * len(canon), produced
pos = _forced_align_positions(lp, tids)
centers = [
float(np.median(np.where(pos == tp)[0])) if (tp is not None and np.any(pos == tp)) else None
for tp in slot_tp
]
return e, centers, produced
def positions(self, e, centers, canon):
"""Per-canonical-position deviation + nearest reference."""
spans = ts.position_spans(centers, e.shape[0])
out = []
for i, ph in enumerate(canon):
span = spans[i]
if span is None or ph not in self.refs:
out.append({"phone": ph, "deviation": None, "nearest": None}); continue
P = ts.resample(e[span[0]:span[1]])
if P is None:
out.append({"phone": ph, "deviation": None, "nearest": None}); continue
out.append({
"phone": ph,
"deviation": ts.tdist(P, self.refs[ph], self.fisher),
"nearest": ts.nearest_ref(P, self.ref_stack, self.ref_names, self.fisher),
})
return out
Note for the implementer: Confirm the emitter's produced-transcript accessor name by reading feature_emitter.py (the research scorer used lp + lev_align(canon, prod) where prod came from the data row, not the emitter). If FeatureEmitter exposes no produced decode, derive produced from the run-coherence decode already in review() / _decode_trajectory, or pass produced through from the caller. The attribution accent-score (Task 6) needs produced; wire whichever path feature_emitter.py actually provides and adjust this stub to match — do not invent a method.
- [ ] Step 2: Commit (no test yet — covered by Task 8 smoke)
git add packages/audio/src/phonolex_audio/analyzer.py
git commit -m "feat(audio): TrajectoryAnalyzer — emitter + full-63 Fisher refs, per-position scoring"
Task 5: Bake the attribution model (offline)¶
Files:
- Create: packages/audio/scripts/bake_attribution_model.py
The research classifier (attribution_classifier.py) computes per-source centroids + feature standardization + categorical priors from the labeled corpora at runtime. Serving cannot do that. This script runs that computation once and writes attribution_model.json (small; committable or drive-stored): {feature_names, mean[6], std[6], centroids:{source:[6]}, priors:{l1:{...}, dev:{...}}}. It reuses attribution_classifier.py feature extraction and attribution_prototype.build_prior/loglik unchanged — this is a thin driver, not new logic.
- [ ] Step 1: Implement (driver over existing research code; run on the machine with the drive + corpora)
# packages/audio/scripts/bake_attribution_model.py
"""Bake the attribution serving model ONCE from the labeled union.
Reuses research/2026-06-06-audio-union/attribution_classifier.py feature
extraction + attribution_prototype priors. Output attribution_model.json is the
ONLY artifact the serving attribution module needs — no corpora at serve time.
cd research/2026-06-06-audio-union && PYTORCH_ENABLE_MPS_FALLBACK=1 uv run \
--with torch --with transformers --with librosa --with soundfile --with numpy \
python ../../packages/audio/scripts/bake_attribution_model.py \
--out /Volumes/ExternalData1/audio-union/attribution_model.json
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
import numpy as np
RESEARCH = Path(__file__).resolve().parents[3] / "research/2026-06-06-audio-union"
sys.path.insert(0, str(RESEARCH))
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--out", required=True)
args = ap.parse_args()
# Import the validated research harness and reuse its subject-feature builder.
import attribution_classifier as AC # noqa: E402
# AC.main() prints a confusion matrix; we need its intermediate (F, lab, priors).
# Refactor AC to expose build_subject_features() returning (F[N,6], labels[N],
# feature_names, priors_dict) — a pure extraction of AC.main lines 49–148 with
# the LOSO/print block removed. Call it here:
F, lab, feature_names, priors = AC.build_subject_features()
POP2SRC = AC.POP2SRC
mean = F.mean(0)
std = F.std(0) + 1e-9
Z = (F - mean) / std
centroids = {}
for pop in sorted(set(lab)):
m = lab == pop
centroids[POP2SRC[pop]] = Z[m].mean(0).tolist()
out = {
"feature_names": feature_names, # ["g","x","cg","cx","rate","a"]
"mean": mean.tolist(),
"std": std.tolist(),
"centroids": centroids, # source -> standardized 6-vector
"priors": priors, # {"l1": {...}, "dev": {...}} for accent-score
}
Path(args.out).write_text(json.dumps(out))
print(f"[bake] wrote {args.out}: sources={list(centroids)}, "
f"subjects={ {AC.POP2SRC[p]: int((lab==p).sum()) for p in set(lab)} }")
if __name__ == "__main__":
main()
-
[ ] Step 2: Refactor
attribution_classifier.pyto exposebuild_subject_features()(extractmainlines 49–148 into a function returning(F, lab, ["g","x","cg","cx","rate","a"], {"l1":..., "dev":...}); keepmain()calling it then doing the LOSO print). This keeps the research entrypoint working and gives the bake script a clean seam. -
[ ] Step 3: Run the bake
Run: the command in the script docstring.
Expected: attribution_model.json written (~a few KB), prints subject counts per source.
- [ ] Step 4: Commit (the script + the research refactor; the JSON if small enough — else document it stays on the drive like the weights)
git add packages/audio/scripts/bake_attribution_model.py research/2026-06-06-audio-union/attribution_classifier.py
git commit -m "feat(audio): bake attribution serving model (centroids+std+priors) from validated harness"
Task 6: Attribution serve-time module¶
Files:
- Create: packages/audio/src/phonolex_audio/attribution.py
- Test: packages/audio/tests/test_attribution.py
Loads attribution_model.json; given the per-position arrays already computed by the analyzer (gradient g, excursion x, consonant-restricted cg/cx, rate, and the produced/canonical phone pair list for accent-score), produces {source, distances:{source:dist}} by standardize → nearest centroid. The 6-feature extraction logic is lifted from attribution_classifier.py:114–137; accent-score from :68–76 using the baked priors (no corpora).
- [ ] Step 1: Write the failing test (synthetic features + a tiny in-memory baked model)
# packages/audio/tests/test_attribution.py
import numpy as np
from phonolex_audio.attribution import AttributionModel
def _tiny_model():
# two sources, 6 features; 'motor' centroid far in feature space
return {
"feature_names": ["g", "x", "cg", "cx", "rate", "a"],
"mean": [0.0] * 6, "std": [1.0] * 6,
"centroids": {"typical": [0, 0, 0, 0, 0, 0], "motor": [3, -3, 3, -3, 3, 0]},
"priors": {"l1": {}, "dev": {}},
}
def test_classify_nearest_centroid(tmp_path):
import json
p = tmp_path / "m.json"; p.write_text(json.dumps(_tiny_model()))
am = AttributionModel(str(p))
feats = np.array([3, -3, 3, -3, 3, 0], float) # right on 'motor'
res = am.classify(feats)
assert res["source"] == "motor"
assert set(res["distances"]) == {"typical", "motor"}
- [ ] Step 2: Run to verify it fails
Run: uv run --extra dev pytest packages/audio/tests/test_attribution.py -v
Expected: FAIL (module not found)
- [ ] Step 3: Implement
# packages/audio/src/phonolex_audio/attribution.py
"""Serve-time 4-way source attribution (typical/accent/developmental/motor).
Loads the baked attribution_model.json (Task 5). classify() standardizes a clip's
6-feature vector with the baked mean/std and returns the nearest source centroid.
Feature *extraction* (g/x/cg/cx/rate/a) is done by the analyzer using these helpers
so the math matches attribution_classifier.py exactly.
"""
from __future__ import annotations
import json
from pathlib import Path
import numpy as np
from phonolex_audio.trajectory_scorer import CONSONANTS # single source of truth
class AttributionModel:
def __init__(self, path: str):
m = json.loads(Path(path).read_text())
self.feature_names = m["feature_names"]
self.mean = np.array(m["mean"], float)
self.std = np.array(m["std"], float)
self.centroids = {k: np.array(v, float) for k, v in m["centroids"].items()}
self.priors = m["priors"]
def classify(self, feats: np.ndarray) -> dict:
z = (feats - self.mean) / self.std
dist = {s: float(np.linalg.norm(z - c)) for s, c in self.centroids.items()}
return {"source": min(dist, key=dist.get), "distances": dist}
(The g/x/cg/cx/rate/a extraction + accent_score(priors, subs) helper are lifted into attribution.py as module functions and called by analyzer.analyze(); port them verbatim from attribution_classifier.py:114–137 and :68–76, swapping the runtime build_prior calls for loglik against self.priors. Add a unit test for accent_score with a tiny prior dict.)
- [ ] Step 4: Run to verify it passes
Run: uv run --extra dev pytest packages/audio/tests/test_attribution.py -v
Expected: PASS
- [ ] Step 5: Wire into
TrajectoryAnalyzer.analyze()— add toanalyzer.py:
def analyze(self, audio_bytes: bytes, canon: list[str]) -> dict:
e, centers, produced = self._emit_and_align(audio_bytes, canon)
positions = self.positions(e, centers, canon)
result = {"positions": positions}
if self.attribution is not None:
feats = self._attribution_features(e, centers, canon, produced)
result["attribution"] = self.attribution.classify(feats)
return result
(Implement _attribution_features using the ported g/x/cg/cx/rate/accent helpers; self.attribution is an optional AttributionModel passed to __init__.)
- [ ] Step 6: Commit
git add packages/audio/src/phonolex_audio/attribution.py packages/audio/src/phonolex_audio/analyzer.py packages/audio/tests/test_attribution.py
git commit -m "feat(audio): serve-time 4-way attribution (baked centroids) wired into analyzer"
Task 7: /analyze endpoint + launch flags¶
Files:
- Modify: packages/audio/src/phonolex_audio/server.py (add endpoint after /feature-review)
- Modify: packages/audio/src/phonolex_audio/__main__.py (flags + analyzer registration)
- [ ] Step 1: Add the endpoint (mirrors
/feature-review:canonicalis a JSON phoneme array)
@app.post("/analyze")
async def analyze(
audio: UploadFile = File(...),
canonical: str = Form(...),
) -> dict:
"""Trajectory analysis: audio + canonical (JSON phoneme array) ->
per-position Fisher trajectory deviation + 4-way source attribution.
Requires the trajectory analyzer to be loaded (--trajectory-refs)."""
import json as _json
analyzer = getattr(app.state, "analyzer", None)
if analyzer is None:
raise HTTPException(status_code=400, detail="Trajectory analyzer not loaded")
try:
canon = _json.loads(canonical)
except _json.JSONDecodeError:
raise HTTPException(status_code=400, detail="canonical must be a JSON array of phonemes")
if not isinstance(canon, list) or not all(isinstance(x, str) for x in canon):
raise HTTPException(status_code=400, detail="canonical must be a JSON array of phoneme strings")
raw = await audio.read()
return analyzer.analyze(raw, canon)
Set app.state.analyzer in build_app from an optional analyzer=None kwarg, and add "analyze": app.state.analyzer is not None to /health.
- [ ] Step 2: Add launch flags in
__main__.py:
ap.add_argument("--trajectory-refs", default=None,
help="full-63 Fisher refs JSON (refs_fisher.json); enables /analyze")
ap.add_argument("--attribution-model", default=None,
help="baked attribution_model.json; enables the attribution read in /analyze")
After building transcribers, when --feature-checkpoint and --trajectory-refs are both given, construct TrajectoryAnalyzer(checkpoint, vectors, traj_refs, device) (+ optional AttributionModel) and pass it into build_app(..., analyzer=...). Use resolve_serving_config() so env vars work when flags are omitted.
- [ ] Step 3: Run the existing host tests (ensure no regression to /transcribe, /compare, /feature-review, /acoustic)
Run: uv run --extra dev --extra inference pytest packages/audio/tests/test_server.py -v
Expected: PASS (existing tests green; analyzer is optional so stub-built apps are unaffected)
- [ ] Step 4: Commit
git add packages/audio/src/phonolex_audio/server.py packages/audio/src/phonolex_audio/__main__.py
git commit -m "feat(audio): /analyze endpoint + launch flags (trajectory refs + attribution)"
Task 8: Local end-to-end smoke (the "watch it run")¶
Files:
- Create: packages/audio/tests/test_analyze_smoke.py (@pytest.mark.slow, local-only)
Loads the real keeper via state_serve.pt and scores one short clip from the union test set, asserting structure + a sanity signal (a clearly-correct production has a low deviation at its matched position). This is the proof the harness runs end-to-end.
- [ ] Step 1: Implement
# packages/audio/tests/test_analyze_smoke.py
import json
import os
from pathlib import Path
import pytest
DRIVE = Path("/Volumes/ExternalData1/audio-union")
pytestmark = pytest.mark.slow
@pytest.mark.skipif(not (DRIVE / "model_feat_traj_target/state_serve.pt").exists(),
reason="keeper serving artifact not on this machine")
def test_analyze_runs_end_to_end():
from phonolex_audio.analyzer import TrajectoryAnalyzer
analyzer = TrajectoryAnalyzer(
checkpoint=str(DRIVE / "model_feat_traj_target/state_serve.pt"),
vectors=str(DRIVE / "model_feat_traj_target/vectors.csv"),
traj_refs=str(DRIVE / "refs_fisher.json"),
)
row = next(json.loads(l) for l in open(DRIVE / "train_pkg/test.jsonl"))
audio = (DRIVE / "train_pkg/wavroot" / row["wav"]).read_bytes()
out = analyzer.analyze(audio, row.get("canonical") or [])
assert "positions" in out and isinstance(out["positions"], list)
scored = [p for p in out["positions"] if p["deviation"] is not None]
assert scored, "expected at least one scored position"
for p in scored:
assert p["deviation"] >= 0 and p["nearest"] is not None
- [ ] Step 2: Run the smoke
Run: PYTORCH_ENABLE_MPS_FALLBACK=1 uv run --extra dev --extra inference pytest packages/audio/tests/test_analyze_smoke.py -v -m slow
Expected: PASS (loads keeper, scores a clip). This is the milestone to show the user.
- [ ] Step 3: Commit
git add packages/audio/tests/test_analyze_smoke.py
git commit -m "test(audio): end-to-end /analyze smoke on the real keeper (slow, local-only)"
Task 9: RunPod packaging (buildable, NOT deployed)¶
Files:
- Create: packages/audio/deploy/Dockerfile, packages/audio/deploy/handler.py, packages/audio/deploy/README.md
Adapt the archived pattern (git show archive/csp-generation-v5.2:packages/generation/server/Dockerfile and the 2026-04-16-runpod-serverless-deployment.md plan) to this host. The image bakes the HF base model cache (so Wav2Vec2Model.from_pretrained is offline), copies the serving artifacts (or mounts a network volume), and runs the FastAPI host (or a serverless handler.py wrapping analyzer.analyze). All paths via the Task 2 env vars. Do not deploy — this task ends at a buildable image + a README documenting docker build and the env contract for staging vs prod endpoints.
- [ ] Step 1: Write
Dockerfile(python:3.11-slim base,uv pip installtheinferenceextra,HF_HOMEpre-warmed withfacebook/wav2vec2-lv-60-espeak-cv-ft,ENV PHONOLEX_AUDIO_DEVICE=cuda). - [ ] Step 2: Write
handler.py(RunPod serverless entry: cold-loadTrajectoryAnalyzeronce,def handler(event): return analyzer.analyze(b64decode(event["audio"]), event["canonical"])). - [ ] Step 3: Write
README.md—docker build, the env contract (PHONOLEX_AUDIO_*), and the note that staging and prod are two separate endpoints selected by the Worker'sAUDIO_INFERENCE_URLper environment. No deploy step. - [ ] Step 4: Commit
git add packages/audio/deploy/
git commit -m "build(audio): RunPod-buildable serving image + serverless handler (env-driven; not deployed)"
Task 10: Remove the /dev/* pages from the shippable frontend¶
Files:
- Modify: packages/web/frontend/src/main.tsx (drop the three /dev/* routes + their imports)
- Delete: components/tools/{AudioTranscribeViewer,PronunciationViewer,AcousticViewer}.tsx + their .test.tsx
- Delete: now-unused frontend services (services/acousticApi.ts, the /pronounce//compare paths in services/audioApi.ts if unused elsewhere) — verify with grep before deleting.
The dev pages are scratch and must not ship. Remove the routes and components so they cannot appear in a build. Worker-side audio routes are left for the (later) product surface; this task is frontend-only.
- [ ] Step 1: Grep for residual references before deleting
Run: grep -rn "AudioTranscribeViewer\|PronunciationViewer\|AcousticViewer\|acousticApi\|/dev/" packages/web/frontend/src
Expected: references only in main.tsx + the components/tests themselves.
-
[ ] Step 2: Remove routes + imports from
main.tsx(delete lines 9–11 imports and 30–32 routes). -
[ ] Step 3: Delete the components + tests + dead services (only those grep proved unused elsewhere).
-
[ ] Step 4: Run the frontend test + build gates
Run: cd packages/web/frontend && npm run test && npm run build
Expected: PASS (no dangling imports; bundle builds without the dev pages).
- [ ] Step 5: Commit
git add -A packages/web/frontend
git commit -m "chore(frontend): remove /dev/* scratch pages from the shippable build (not product)"
Self-Review Notes (gaps the implementer must close, not skip)¶
- Emitter produced-transcript path (Task 4 note): the research scorer got
producedfrom the data row, not the emitter. Serving must derive it from the emission. Readfeature_emitter.pyreview()/_decode_trajectoryand wire the actual decode; do not invent a method. Accent-score depends on it. build_subject_features()refactor (Task 5): the bake script assumesattribution_classifier.pyis refactored to expose this seam. That refactor is Step 2 of Task 5 — do it first.- Artifact commit policy:
state_serve.pt(1.26 GB) is NEVER committed (gitignored, drive-only).attribution_model.json,full63_trajectories.json,refs_fisher.json,vectors.csvare small — decide per file whether to commit intopackages/audio/or keep on the drive referenced byserving_config. Default: keep on the drive for now (dev-only), commit later when a hosting target is chosen. - No hosting deploy in this plan. Task 9 produces a buildable image only. The local↔RunPod capability is satisfied; the decision of where to serve is the user's, post-smoke. ```