Skip to content

PHON-154 Variant-Aware Matching — Phase 4a: Audio host + worker (per-variant /analyze)

For agentic workers: REQUIRED SUB-SKILL: superpowers:subagent-driven-development or superpowers:executing-plans. Steps use checkbox (- [ ]).

Goal: /api/audio/analyze scores a production against EVERY attested pronunciation of the target word and returns a per-variant array, so a speaker who produces a valid variant isn't scored as deviating. The expensive emitter forward pass runs ONCE; only the cheap align+score loop repeats per variant.

Architecture: In the host TrajectoryAnalyzer, split _emit_and_align into _emit_audio (forward pass, once) + _align (per canonical), add _score_one (build one variant's result) and analyze_variants(audio, canons). The server /analyze accepts an optional canonicals JSON field (list of phoneme-lists) → analyze_variants, falling back to the legacy single canonical. The Worker /api/audio/analyze fetches the primary + all variant pronunciations from D1, sends canonicals (primary first, de-duped), and returns { produced, variants: [...] }.

Tech Stack: Python (FastAPI/NumPy host), TypeScript (Hono Worker), pytest + Vitest.

Spec: docs/superpowers/specs/2026-06-15-phon-154-variant-aware-matching-design.md. Depends on: Phase 1 (variants column populated — already true pre-PHON-154; words.variants JSON has full per-variant phonemes).

Note on reseed: worker integration of the variant fetch needs a local D1 that has the words.variants column (it always has — variants predate PHON-154) — so the worker change is testable against the existing local seed; only the Phase-1 matching columns are missing pre-reseed, which this route doesn't use. Session-attribution best-variant selection is Phase 4b (frontend).


Reference (read before editing)

  • packages/audio/src/phonolex_audio/analyzer.py: _emit_and_align(audio_bytes, canon) (lines 53-74) decodes audio, e, lp = self.em._emit(arr) (the expensive forward pass — depends ONLY on audio), builds produced, then force-aligns lp to canoncenters. positions(e, centers, canon) and _attribution_features(e, centers, canon, produced) are canon-specific + cheap. analyze(audio_bytes, canon) (125-140) assembles {canonical, produced, positions, attribution, features}.
  • packages/audio/src/phonolex_audio/server.py: /analyze (167-186) reads canonical: str = Form(...), JSON-parses to canon, returns app.state.analyzer.analyze(raw, canon).
  • packages/web/workers/src/routes/audio.ts: /analyze (~386-427) fetches SELECT phonemes FROM words (primary only), sends canonical: JSON.stringify(canonical) to the host, returns the host body.

Task 1: Host — refactor analyzer + add analyze_variants

Files: - Modify: packages/audio/src/phonolex_audio/analyzer.py - Test: packages/audio/tests/test_analyzer_variants.py (create)

  • [ ] Step 1: Write the failing test (orchestration-only, no real model)

Create packages/audio/tests/test_analyzer_variants.py. It bypasses __init__ (which loads the model) and stubs the emit/align/score internals to verify analyze_variants emits ONCE and scores each canonical:

from phonolex_audio.analyzer import TrajectoryAnalyzer


def test_analyze_variants_emits_once_scores_each():
    a = TrajectoryAnalyzer.__new__(TrajectoryAnalyzer)  # bypass model load
    calls = {"emit": 0, "align": []}

    def fake_emit_audio(audio_bytes):
        calls["emit"] += 1
        return "E", "LP", ["p", "ɹ"]  # e, lp, produced

    def fake_align(lp, canon):
        calls["align"].append(list(canon))
        return [0.0] * len(canon)

    def fake_score_one(e, centers, canon, produced):
        return {"canonical": list(canon), "produced": produced,
                "positions": [], "attribution": None, "features": None}

    a._emit_audio = fake_emit_audio
    a._align = fake_align
    a._score_one = fake_score_one

    out = a.analyze_variants(b"audio", [["k", "æ", "t"], ["k", "ɛ", "t"]])
    assert calls["emit"] == 1, "forward pass must run once, not per variant"
    assert calls["align"] == [["k", "æ", "t"], ["k", "ɛ", "t"]]
    assert out["produced"] == ["p", "ɹ"]
    assert [v["canonical"] for v in out["variants"]] == [["k", "æ", "t"], ["k", "ɛ", "t"]]


def test_analyze_single_still_works():
    """The legacy single-canonical analyze() must keep its flat shape."""
    a = TrajectoryAnalyzer.__new__(TrajectoryAnalyzer)
    a._emit_audio = lambda b: ("E", "LP", ["k"])
    a._align = lambda lp, canon: [0.0] * len(canon)
    a._score_one = lambda e, c, canon, p: {"canonical": list(canon), "produced": p,
                                            "positions": [], "attribution": None, "features": None}
    out = a.analyze(b"audio", ["k", "æ", "t"])
    assert out["canonical"] == ["k", "æ", "t"]
    assert "variants" not in out  # flat, not wrapped
  • [ ] Step 2: Run it — confirm fail

Run: cd /Users/jneumann/Repos/PhonoLex && uv run --with torch --with transformers --with librosa --with soundfile python -m pytest packages/audio/tests/test_analyzer_variants.py -v Expected: FAIL (analyze_variants / _emit_audio / _align / _score_one don't exist). (If the heavy --with deps make collection slow, that's fine; the test itself never loads the model.)

  • [ ] Step 3: Refactor analyzer.py

Replace _emit_and_align (lines 53-74) with three methods, and refactor analyze (125-140) to use them, then add analyze_variants:

    def _emit_audio(self, audio_bytes: bytes):
        """-> (e[T,26], lp, produced[list[str]]). The expensive forward pass —
        depends only on the audio, so it runs ONCE per clip across all variants."""
        arr = self.em._decode_audio(audio_bytes)
        e, lp = self.em._emit(arr)
        produced = self._produced_from_lp(lp)
        return e, lp, produced

    def _align(self, lp, canon: list[str]):
        """-> centers[list of float|None]: force-align the produced trajectory to
        a canonical phone sequence. Cheap; called once per attested pronunciation."""
        from phonolex_audio.feature_emitter import _forced_align_positions
        tids, slot_tp = [], []
        for p in canon:
            aid = self.em.ph2id.get(p)
            if aid is not None:
                slot_tp.append(len(tids)); tids.append(aid)
            else:
                slot_tp.append(None)
        if not tids:
            return [None] * len(canon)
        pos = _forced_align_positions(lp, tids)
        return [
            float(np.median(np.where(pos == tp)[0])) if (tp is not None and np.any(pos == tp)) else None
            for tp in slot_tp
        ]

    def _emit_and_align(self, audio_bytes: bytes, canon: list[str]):
        """-> (e, centers, produced). Back-compat wrapper used by analyze()."""
        e, lp, produced = self._emit_audio(audio_bytes)
        return e, self._align(lp, canon), produced

    def _score_one(self, e, centers, canon: list[str], produced: list[str]) -> dict:
        """Assemble one pronunciation's result: positions + attribution + features."""
        result = {
            "canonical": list(canon),
            "produced": produced,
            "positions": self.positions(e, centers, canon),
            "attribution": None,
            "features": None,
        }
        if self.attribution is not None:
            feats = self._attribution_features(e, centers, canon, produced)
            if feats is not None:
                result["features"] = [float(x) for x in feats]
                result["attribution"] = self.attribution.classify(feats)
        return result

Replace the body of analyze with:

    def analyze(self, audio_bytes: bytes, canon: list[str]) -> dict:
        """audio + one canonical -> flat {canonical, produced, positions, attribution, features}."""
        e, lp, produced = self._emit_audio(audio_bytes)
        return self._score_one(e, self._align(lp, canon), canon, produced)

    def analyze_variants(self, audio_bytes: bytes, canons: list[list[str]]) -> dict:
        """audio + all attested pronunciations -> {produced, variants:[per-canon result]}.
        The emitter forward pass runs ONCE; only align+score repeats per canonical."""
        e, lp, produced = self._emit_audio(audio_bytes)
        variants = [self._score_one(e, self._align(lp, canon), canon, produced) for canon in canons]
        return {"produced": produced, "variants": variants}

(Keep positions, _attribution_features, _produced_from_lp unchanged.)

  • [ ] Step 4: Run — confirm pass

Run: same pytest command. Expected: both tests PASS.

  • [ ] Step 5: Commit
git add packages/audio/src/phonolex_audio/analyzer.py packages/audio/tests/test_analyzer_variants.py
git commit -m "feat(phon-154): analyzer.analyze_variants — emit once, score each pronunciation"

Task 2: Host server — /analyze accepts canonicals

Files: - Modify: packages/audio/src/phonolex_audio/server.py (/analyze, ~167-186)

  • [ ] Step 1: Update the endpoint

Replace the /analyze handler body so it accepts an optional canonicals JSON field (list of phoneme-lists); when present, return analyze_variants, else keep the legacy single-canonical behavior:

    @app.post("/analyze")
    async def analyze(
        audio: UploadFile = File(...),
        canonical: str = Form(None),
        canonicals: str = Form(None),
    ) -> dict:
        if app.state.analyzer is None:
            raise HTTPException(status_code=400, detail="Trajectory analyzer not loaded")
        raw = await audio.read()
        if canonicals is not None:
            try:
                canons = json.loads(canonicals)
            except Exception:
                raise HTTPException(status_code=400, detail="canonicals must be a JSON array of phoneme arrays")
            if not isinstance(canons, list) or not all(
                isinstance(c, list) and all(isinstance(p, str) for p in c) for c in canons
            ):
                raise HTTPException(status_code=400, detail="canonicals must be a JSON array of phoneme arrays")
            return app.state.analyzer.analyze_variants(raw, canons)
        if canonical is None:
            raise HTTPException(status_code=400, detail="canonical or canonicals required")
        try:
            canon = json.loads(canonical)
        except Exception:
            raise HTTPException(status_code=400, detail="canonical must be a JSON array of phonemes")
        if not isinstance(canon, list) or not all(isinstance(p, str) for p in canon):
            raise HTTPException(status_code=400, detail="canonical must be a JSON array of phoneme strings")
        return app.state.analyzer.analyze(raw, canon)

(Keep the existing imports; Form/File/UploadFile/HTTPException/json are already used in this file.)

  • [ ] Step 2: Sanity check (no model needed)

Run: cd /Users/jneumann/Repos/PhonoLex && uv run --with fastapi python -c "import ast; ast.parse(open('packages/audio/src/phonolex_audio/server.py').read()); print('parse ok')" Expected: parse ok. (Full endpoint behavior is verified in the local-host smoke during Phase 4b verification.)

  • [ ] Step 3: Commit
git add packages/audio/src/phonolex_audio/server.py
git commit -m "feat(phon-154): /analyze accepts canonicals (per-variant) with legacy canonical fallback"

Task 3: Worker — fetch variants, send canonicals, return per-variant

Files: - Modify: packages/web/workers/src/routes/audio.ts (/analyze, ~386-427) - Test: packages/web/workers/src/__tests__/audio.test.ts

  • [ ] Step 1: Write/extend the failing test

Add to audio.test.ts (the POST /api/audio/analyze describe block). This asserts that when D1 is seeded the route sends a canonicals field to the host and returns the {produced, variants} shape; when unseeded it still fails structured (404/500). Use the existing analyzeForm helper + fetchMock pattern:

  it('forwards canonicals (primary + variants) and returns the per-variant shape when seeded', async () => {
    const seeded = (await SELF.fetch('http://localhost/api/words/cat')).status === 200;
    if (!seeded) return; // covered by the unseeded structured-failure test below
    let seenBody = '';
    fetchMock.get('http://127.0.0.1:8000').intercept({
      path: '/analyze', method: 'POST',
      body: (b) => { seenBody = typeof b === 'string' ? b : ''; return true; },
    }).reply(200, { produced: ['k','æ','t'], variants: [
      { canonical: ['k','æ','t'], produced: ['k','æ','t'], positions: [], attribution: null, features: null },
    ]});
    const res = await SELF.fetch('http://localhost/api/audio/analyze', { method: 'POST', body: analyzeForm('cat') });
    expect(res.status).toBe(200);
    expect(seenBody).toContain('name="canonicals"');
    const body = await res.json() as { produced: string[]; variants: unknown[] };
    expect(Array.isArray(body.variants)).toBe(true);
  });
  • [ ] Step 2: Run — confirm fail (or skip if unseeded)

Run: cd packages/web/workers && npx vitest run src/__tests__/audio.test.ts -t "forwards canonicals" Expected: with a seeded local D1, FAIL (route still sends canonical, not canonicals). If the local D1 is unseeded the test early-returns — in that case rely on Step 4's manual local check.

  • [ ] Step 3: Update the route

In routes/audio.ts /analyze, change the D1 fetch + host forward. Replace the canonical lookup + fwd construction so it fetches variants too, builds the de-duped canonical list (primary first), and sends canonicals:

  // canonical + variant pronunciations from D1 (host aligns the production
  // against each so a valid variant isn't scored as a deviation).
  const row = await c.env.DB.prepare('SELECT phonemes, variants FROM words WHERE word = ? LIMIT 1')
    .bind(target.trim().toLowerCase())
    .first<{ phonemes: string | null; variants: string | null }>();
  if (!row || !row.phonemes) {
    return c.json({ detail: `Word not in lexicon: ${target}` }, 404);
  }
  const primary = JSON.parse(row.phonemes) as string[];
  const canons: string[][] = [primary];
  if (row.variants) {
    try {
      const vs = JSON.parse(row.variants) as Array<{ phonemes?: string[] }>;
      for (const v of vs) {
        if (Array.isArray(v.phonemes) && v.phonemes.length
            && !canons.some((c2) => c2.length === v.phonemes!.length && c2.every((p, i) => p === v.phonemes![i]))) {
          canons.push(v.phonemes);
        }
      }
    } catch { /* malformed variants → primary only */ }
  }

  const fwd = new FormData();
  fwd.append('audio', file, file.name || 'clip');
  fwd.append('canonicals', JSON.stringify(canons));

(Leave the audioFetch/warming/error/JSON handling below unchanged — it already returns the host body verbatim, which is now {produced, variants}.)

  • [ ] Step 4: Run tests + type-check

Run: cd packages/web/workers && npm run type-check && npm test Expected: type-check clean; suite green. The new test passes when seeded; the existing unseeded /analyze tests still pass (404/500 path). If the local D1 is unseeded and you want to confirm the seeded path, do a local-only reseed (Phase 1 Task 5 steps, NOT committed) — optional here, fully exercised in Phase 4b verification.

  • [ ] Step 5: Commit
git add packages/web/workers/src/routes/audio.ts packages/web/workers/src/__tests__/audio.test.ts
git commit -m "feat(phon-154): /api/audio/analyze fetches variants + forwards canonicals"

Phase 4a done — exit criteria

  • Host analyze_variants emits once, scores each pronunciation; /analyze accepts canonicals; legacy single-canonical still works.
  • Worker /api/audio/analyze fetches primary + variants, forwards canonicals, returns {produced, variants}.
  • Host orchestration unit tests + worker tests green; type-check clean.

Next: Phase 4b (frontend)

  • audioAnalysisApi types → { produced, variants: VariantResult[] }; AudioAnalysisTool picks the best-matching variant (lowest mean deviation) per production to feed attributeSession; ProductionCard/DeviationOverlay render per-variant rows + scores; Lookup displays all variant pronunciations; superscript has_variants flag on result rows.