PHON-154 Variant-Aware Matching — Phase 4a: Audio host + worker (per-variant /analyze)¶
For agentic workers: REQUIRED SUB-SKILL: superpowers:subagent-driven-development or superpowers:executing-plans. Steps use checkbox (
- [ ]).
Goal: /api/audio/analyze scores a production against EVERY attested pronunciation of the target word and returns a per-variant array, so a speaker who produces a valid variant isn't scored as deviating. The expensive emitter forward pass runs ONCE; only the cheap align+score loop repeats per variant.
Architecture: In the host TrajectoryAnalyzer, split _emit_and_align into _emit_audio (forward pass, once) + _align (per canonical), add _score_one (build one variant's result) and analyze_variants(audio, canons). The server /analyze accepts an optional canonicals JSON field (list of phoneme-lists) → analyze_variants, falling back to the legacy single canonical. The Worker /api/audio/analyze fetches the primary + all variant pronunciations from D1, sends canonicals (primary first, de-duped), and returns { produced, variants: [...] }.
Tech Stack: Python (FastAPI/NumPy host), TypeScript (Hono Worker), pytest + Vitest.
Spec: docs/superpowers/specs/2026-06-15-phon-154-variant-aware-matching-design.md. Depends on: Phase 1 (variants column populated — already true pre-PHON-154; words.variants JSON has full per-variant phonemes).
Note on reseed: worker integration of the variant fetch needs a local D1 that has the words.variants column (it always has — variants predate PHON-154) — so the worker change is testable against the existing local seed; only the Phase-1 matching columns are missing pre-reseed, which this route doesn't use. Session-attribution best-variant selection is Phase 4b (frontend).
Reference (read before editing)¶
packages/audio/src/phonolex_audio/analyzer.py:_emit_and_align(audio_bytes, canon)(lines 53-74) decodes audio,e, lp = self.em._emit(arr)(the expensive forward pass — depends ONLY on audio), buildsproduced, then force-alignslptocanon→centers.positions(e, centers, canon)and_attribution_features(e, centers, canon, produced)are canon-specific + cheap.analyze(audio_bytes, canon)(125-140) assembles{canonical, produced, positions, attribution, features}.packages/audio/src/phonolex_audio/server.py:/analyze(167-186) readscanonical: str = Form(...), JSON-parses tocanon, returnsapp.state.analyzer.analyze(raw, canon).packages/web/workers/src/routes/audio.ts:/analyze(~386-427) fetchesSELECT phonemes FROM words(primary only), sendscanonical: JSON.stringify(canonical)to the host, returns the host body.
Task 1: Host — refactor analyzer + add analyze_variants¶
Files:
- Modify: packages/audio/src/phonolex_audio/analyzer.py
- Test: packages/audio/tests/test_analyzer_variants.py (create)
- [ ] Step 1: Write the failing test (orchestration-only, no real model)
Create packages/audio/tests/test_analyzer_variants.py. It bypasses __init__ (which loads the model) and stubs the emit/align/score internals to verify analyze_variants emits ONCE and scores each canonical:
from phonolex_audio.analyzer import TrajectoryAnalyzer
def test_analyze_variants_emits_once_scores_each():
a = TrajectoryAnalyzer.__new__(TrajectoryAnalyzer) # bypass model load
calls = {"emit": 0, "align": []}
def fake_emit_audio(audio_bytes):
calls["emit"] += 1
return "E", "LP", ["p", "ɹ"] # e, lp, produced
def fake_align(lp, canon):
calls["align"].append(list(canon))
return [0.0] * len(canon)
def fake_score_one(e, centers, canon, produced):
return {"canonical": list(canon), "produced": produced,
"positions": [], "attribution": None, "features": None}
a._emit_audio = fake_emit_audio
a._align = fake_align
a._score_one = fake_score_one
out = a.analyze_variants(b"audio", [["k", "æ", "t"], ["k", "ɛ", "t"]])
assert calls["emit"] == 1, "forward pass must run once, not per variant"
assert calls["align"] == [["k", "æ", "t"], ["k", "ɛ", "t"]]
assert out["produced"] == ["p", "ɹ"]
assert [v["canonical"] for v in out["variants"]] == [["k", "æ", "t"], ["k", "ɛ", "t"]]
def test_analyze_single_still_works():
"""The legacy single-canonical analyze() must keep its flat shape."""
a = TrajectoryAnalyzer.__new__(TrajectoryAnalyzer)
a._emit_audio = lambda b: ("E", "LP", ["k"])
a._align = lambda lp, canon: [0.0] * len(canon)
a._score_one = lambda e, c, canon, p: {"canonical": list(canon), "produced": p,
"positions": [], "attribution": None, "features": None}
out = a.analyze(b"audio", ["k", "æ", "t"])
assert out["canonical"] == ["k", "æ", "t"]
assert "variants" not in out # flat, not wrapped
- [ ] Step 2: Run it — confirm fail
Run: cd /Users/jneumann/Repos/PhonoLex && uv run --with torch --with transformers --with librosa --with soundfile python -m pytest packages/audio/tests/test_analyzer_variants.py -v
Expected: FAIL (analyze_variants / _emit_audio / _align / _score_one don't exist).
(If the heavy --with deps make collection slow, that's fine; the test itself never loads the model.)
- [ ] Step 3: Refactor analyzer.py
Replace _emit_and_align (lines 53-74) with three methods, and refactor analyze (125-140) to use them, then add analyze_variants:
def _emit_audio(self, audio_bytes: bytes):
"""-> (e[T,26], lp, produced[list[str]]). The expensive forward pass —
depends only on the audio, so it runs ONCE per clip across all variants."""
arr = self.em._decode_audio(audio_bytes)
e, lp = self.em._emit(arr)
produced = self._produced_from_lp(lp)
return e, lp, produced
def _align(self, lp, canon: list[str]):
"""-> centers[list of float|None]: force-align the produced trajectory to
a canonical phone sequence. Cheap; called once per attested pronunciation."""
from phonolex_audio.feature_emitter import _forced_align_positions
tids, slot_tp = [], []
for p in canon:
aid = self.em.ph2id.get(p)
if aid is not None:
slot_tp.append(len(tids)); tids.append(aid)
else:
slot_tp.append(None)
if not tids:
return [None] * len(canon)
pos = _forced_align_positions(lp, tids)
return [
float(np.median(np.where(pos == tp)[0])) if (tp is not None and np.any(pos == tp)) else None
for tp in slot_tp
]
def _emit_and_align(self, audio_bytes: bytes, canon: list[str]):
"""-> (e, centers, produced). Back-compat wrapper used by analyze()."""
e, lp, produced = self._emit_audio(audio_bytes)
return e, self._align(lp, canon), produced
def _score_one(self, e, centers, canon: list[str], produced: list[str]) -> dict:
"""Assemble one pronunciation's result: positions + attribution + features."""
result = {
"canonical": list(canon),
"produced": produced,
"positions": self.positions(e, centers, canon),
"attribution": None,
"features": None,
}
if self.attribution is not None:
feats = self._attribution_features(e, centers, canon, produced)
if feats is not None:
result["features"] = [float(x) for x in feats]
result["attribution"] = self.attribution.classify(feats)
return result
Replace the body of analyze with:
def analyze(self, audio_bytes: bytes, canon: list[str]) -> dict:
"""audio + one canonical -> flat {canonical, produced, positions, attribution, features}."""
e, lp, produced = self._emit_audio(audio_bytes)
return self._score_one(e, self._align(lp, canon), canon, produced)
def analyze_variants(self, audio_bytes: bytes, canons: list[list[str]]) -> dict:
"""audio + all attested pronunciations -> {produced, variants:[per-canon result]}.
The emitter forward pass runs ONCE; only align+score repeats per canonical."""
e, lp, produced = self._emit_audio(audio_bytes)
variants = [self._score_one(e, self._align(lp, canon), canon, produced) for canon in canons]
return {"produced": produced, "variants": variants}
(Keep positions, _attribution_features, _produced_from_lp unchanged.)
- [ ] Step 4: Run — confirm pass
Run: same pytest command. Expected: both tests PASS.
- [ ] Step 5: Commit
git add packages/audio/src/phonolex_audio/analyzer.py packages/audio/tests/test_analyzer_variants.py
git commit -m "feat(phon-154): analyzer.analyze_variants — emit once, score each pronunciation"
Task 2: Host server — /analyze accepts canonicals¶
Files:
- Modify: packages/audio/src/phonolex_audio/server.py (/analyze, ~167-186)
- [ ] Step 1: Update the endpoint
Replace the /analyze handler body so it accepts an optional canonicals JSON field (list of phoneme-lists); when present, return analyze_variants, else keep the legacy single-canonical behavior:
@app.post("/analyze")
async def analyze(
audio: UploadFile = File(...),
canonical: str = Form(None),
canonicals: str = Form(None),
) -> dict:
if app.state.analyzer is None:
raise HTTPException(status_code=400, detail="Trajectory analyzer not loaded")
raw = await audio.read()
if canonicals is not None:
try:
canons = json.loads(canonicals)
except Exception:
raise HTTPException(status_code=400, detail="canonicals must be a JSON array of phoneme arrays")
if not isinstance(canons, list) or not all(
isinstance(c, list) and all(isinstance(p, str) for p in c) for c in canons
):
raise HTTPException(status_code=400, detail="canonicals must be a JSON array of phoneme arrays")
return app.state.analyzer.analyze_variants(raw, canons)
if canonical is None:
raise HTTPException(status_code=400, detail="canonical or canonicals required")
try:
canon = json.loads(canonical)
except Exception:
raise HTTPException(status_code=400, detail="canonical must be a JSON array of phonemes")
if not isinstance(canon, list) or not all(isinstance(p, str) for p in canon):
raise HTTPException(status_code=400, detail="canonical must be a JSON array of phoneme strings")
return app.state.analyzer.analyze(raw, canon)
(Keep the existing imports; Form/File/UploadFile/HTTPException/json are already used in this file.)
- [ ] Step 2: Sanity check (no model needed)
Run: cd /Users/jneumann/Repos/PhonoLex && uv run --with fastapi python -c "import ast; ast.parse(open('packages/audio/src/phonolex_audio/server.py').read()); print('parse ok')"
Expected: parse ok. (Full endpoint behavior is verified in the local-host smoke during Phase 4b verification.)
- [ ] Step 3: Commit
git add packages/audio/src/phonolex_audio/server.py
git commit -m "feat(phon-154): /analyze accepts canonicals (per-variant) with legacy canonical fallback"
Task 3: Worker — fetch variants, send canonicals, return per-variant¶
Files:
- Modify: packages/web/workers/src/routes/audio.ts (/analyze, ~386-427)
- Test: packages/web/workers/src/__tests__/audio.test.ts
- [ ] Step 1: Write/extend the failing test
Add to audio.test.ts (the POST /api/audio/analyze describe block). This asserts that when D1 is seeded the route sends a canonicals field to the host and returns the {produced, variants} shape; when unseeded it still fails structured (404/500). Use the existing analyzeForm helper + fetchMock pattern:
it('forwards canonicals (primary + variants) and returns the per-variant shape when seeded', async () => {
const seeded = (await SELF.fetch('http://localhost/api/words/cat')).status === 200;
if (!seeded) return; // covered by the unseeded structured-failure test below
let seenBody = '';
fetchMock.get('http://127.0.0.1:8000').intercept({
path: '/analyze', method: 'POST',
body: (b) => { seenBody = typeof b === 'string' ? b : ''; return true; },
}).reply(200, { produced: ['k','æ','t'], variants: [
{ canonical: ['k','æ','t'], produced: ['k','æ','t'], positions: [], attribution: null, features: null },
]});
const res = await SELF.fetch('http://localhost/api/audio/analyze', { method: 'POST', body: analyzeForm('cat') });
expect(res.status).toBe(200);
expect(seenBody).toContain('name="canonicals"');
const body = await res.json() as { produced: string[]; variants: unknown[] };
expect(Array.isArray(body.variants)).toBe(true);
});
- [ ] Step 2: Run — confirm fail (or skip if unseeded)
Run: cd packages/web/workers && npx vitest run src/__tests__/audio.test.ts -t "forwards canonicals"
Expected: with a seeded local D1, FAIL (route still sends canonical, not canonicals). If the local D1 is unseeded the test early-returns — in that case rely on Step 4's manual local check.
- [ ] Step 3: Update the route
In routes/audio.ts /analyze, change the D1 fetch + host forward. Replace the canonical lookup + fwd construction so it fetches variants too, builds the de-duped canonical list (primary first), and sends canonicals:
// canonical + variant pronunciations from D1 (host aligns the production
// against each so a valid variant isn't scored as a deviation).
const row = await c.env.DB.prepare('SELECT phonemes, variants FROM words WHERE word = ? LIMIT 1')
.bind(target.trim().toLowerCase())
.first<{ phonemes: string | null; variants: string | null }>();
if (!row || !row.phonemes) {
return c.json({ detail: `Word not in lexicon: ${target}` }, 404);
}
const primary = JSON.parse(row.phonemes) as string[];
const canons: string[][] = [primary];
if (row.variants) {
try {
const vs = JSON.parse(row.variants) as Array<{ phonemes?: string[] }>;
for (const v of vs) {
if (Array.isArray(v.phonemes) && v.phonemes.length
&& !canons.some((c2) => c2.length === v.phonemes!.length && c2.every((p, i) => p === v.phonemes![i]))) {
canons.push(v.phonemes);
}
}
} catch { /* malformed variants → primary only */ }
}
const fwd = new FormData();
fwd.append('audio', file, file.name || 'clip');
fwd.append('canonicals', JSON.stringify(canons));
(Leave the audioFetch/warming/error/JSON handling below unchanged — it already returns the host body verbatim, which is now {produced, variants}.)
- [ ] Step 4: Run tests + type-check
Run: cd packages/web/workers && npm run type-check && npm test
Expected: type-check clean; suite green. The new test passes when seeded; the existing unseeded /analyze tests still pass (404/500 path). If the local D1 is unseeded and you want to confirm the seeded path, do a local-only reseed (Phase 1 Task 5 steps, NOT committed) — optional here, fully exercised in Phase 4b verification.
- [ ] Step 5: Commit
git add packages/web/workers/src/routes/audio.ts packages/web/workers/src/__tests__/audio.test.ts
git commit -m "feat(phon-154): /api/audio/analyze fetches variants + forwards canonicals"
Phase 4a done — exit criteria¶
- Host
analyze_variantsemits once, scores each pronunciation;/analyzeacceptscanonicals; legacy single-canonicalstill works. - Worker
/api/audio/analyzefetches primary + variants, forwardscanonicals, returns{produced, variants}. - Host orchestration unit tests + worker tests green; type-check clean.
Next: Phase 4b (frontend)¶
audioAnalysisApitypes →{ produced, variants: VariantResult[] };AudioAnalysisToolpicks the best-matching variant (lowest mean deviation) per production to feedattributeSession;ProductionCard/DeviationOverlayrender per-variant rows + scores; Lookup displays all variant pronunciations; superscripthas_variantsflag on result rows.