Skip to content

v6.0 Audio Transcribe Viewer Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: A dev-gated viewer that turns recorded/uploaded audio into a broad-phoneme IPA transcript with per-phoneme confidence and honest coverage caveats, backed by a local Python inference server proxied through a Worker route.

Architecture: Three units against a frozen JSON contract — a phonolex_audio FastAPI inference server (wav2vec2-espeak CTC, --checkpoint swappable, server-side eSpeak→PhonoLex mapping), a thin /api/audio/transcribe Worker proxy (env AUDIO_INFERENCE_URL, cold-start aware), and an unlinked /dev/audio React viewer. The Python track (Phase 0/A) and the TS track (Phase B/C) are independent once the §6 contract is frozen and can be built in parallel.

Tech Stack: Python (FastAPI, transformers, torch, librosa) · Hono on Cloudflare Workers · React + TypeScript + MUI · pytest · vitest (cloudflare:test + vitest browser/jsdom).

Spec: docs/superpowers/specs/2026-06-02-v6-audio-transcribe-viewer-design.md

Frozen API contract (§6 of the spec)

POST /transcribe (inference server) and POST /api/audio/transcribe (Worker, identical body) return:

{
  "phonemes": ["k", "æ", "t"],
  "confidences": [0.98, 0.91, 0.95],
  "duration_ms": 1230,
  "coverage": "broad-phoneme",
  "limitations": [
    "Broad-phoneme transcription only; distortions and covert contrast are not represented.",
    "Not validated on disordered speech."
  ]
}

phonemes and confidences are equal-length and positionally aligned. coverage and limitations[] are always present.

File structure

File Responsibility
packages/audio/pyproject.toml New phonolex-audio package metadata + deps
packages/audio/src/phonolex_audio/__init__.py Package marker
packages/audio/src/phonolex_audio/mapping.py eSpeak→PhonoLex broad-phoneme projection (promoted from PHON-128)
packages/audio/src/phonolex_audio/transcribe.py Model load + CTC decode + confidence + mapping → contract dict
packages/audio/src/phonolex_audio/server.py FastAPI app: POST /transcribe, GET /health
packages/audio/src/phonolex_audio/__main__.py CLI: python -m phonolex_audio --checkpoint ... --port 8000
packages/audio/tests/test_mapping.py Mapping unit tests
packages/audio/tests/test_transcribe.py Transcribe smoke (local-only) + confidence alignment
packages/audio/tests/test_server.py FastAPI route tests (mocked transcribe)
packages/audio/README.md Dev run instructions + --checkpoint harness note
pyproject.toml Add packages/audio to the uv workspace
packages/web/workers/src/types.ts Add AUDIO_INFERENCE_URL to Env
packages/web/workers/src/routes/audio.ts POST /api/audio/transcribe proxy
packages/web/workers/src/index.ts Mount the audio route
packages/web/workers/wrangler.toml AUDIO_INFERENCE_URL var (dev default)
packages/web/workers/.dev.vars Local AUDIO_INFERENCE_URL override
packages/web/workers/src/__tests__/audio.test.ts Worker route tests
packages/web/frontend/src/services/audioApi.ts transcribeAudio(blob, language?) (FormData fetch)
packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx The viewer component
packages/web/frontend/src/components/tools/AudioTranscribeViewer.test.tsx Component tests
packages/web/frontend/src/main.tsx Register unlinked /dev/audio route

Phase 0 — Package scaffold

Task 0: Create the phonolex_audio package and register it in the workspace

Files: - Create: packages/audio/pyproject.toml - Create: packages/audio/src/phonolex_audio/__init__.py - Create: packages/audio/tests/test_smoke.py - Modify: pyproject.toml:8-16 (workspace members + sources)

  • [ ] Step 1: Write the failing test

packages/audio/tests/test_smoke.py:

def test_package_imports():
    import phonolex_audio
    assert phonolex_audio.__name__ == "phonolex_audio"

  • [ ] Step 2: Run it to verify it fails

Run: uv run python -m pytest packages/audio/tests/test_smoke.py -v Expected: FAIL — ModuleNotFoundError: No module named 'phonolex_audio'

  • [ ] Step 3: Create the package files

packages/audio/pyproject.toml:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "phonolex-audio"
version = "0.1.0"
description = "Audio → broad-phoneme transcription inference server for PhonoLex (v6 Model #1)"
license = "LicenseRef-Proprietary"
requires-python = ">=3.11"
dependencies = [
    "phonolex-data",
    "fastapi>=0.115",
    "uvicorn>=0.30",
    "python-multipart>=0.0.9",
    "transformers>=4.40",
    "torch>=2.2",
    "librosa>=0.10",
    "soundfile>=0.12",
    "huggingface_hub>=0.24",
    "numpy>=1.24",
]

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
    "httpx>=0.27",
    "ruff>=0.4",
]

[tool.hatch.build.targets.wheel]
packages = ["src/phonolex_audio"]

[tool.ruff]
target-version = "py311"
line-length = 100

[tool.pytest.ini_options]
testpaths = ["tests"]

packages/audio/src/phonolex_audio/__init__.py:

"""PhonoLex audio inference — v6 Model #1: audio → broad-phoneme transcript."""

Modify pyproject.toml (root) — add the member and source:

[tool.uv.workspace]
members = [
    "packages/data",
    "packages/features",
    "packages/audio",
]

[tool.uv.sources]
phonolex-data = { workspace = true }
phonolex-features = { workspace = true }
phonolex-audio = { workspace = true }

  • [ ] Step 4: Sync and run the test

Run: uv sync --all-packages && uv run python -m pytest packages/audio/tests/test_smoke.py -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/audio/pyproject.toml packages/audio/src/phonolex_audio/__init__.py packages/audio/tests/test_smoke.py pyproject.toml
git commit -m "feat(audio): scaffold phonolex_audio package (PHON-128)"

Phase A — Python inference server

Task A1: Promote the eSpeak→PhonoLex mapping into the package

The PHON-128 mapping lives at research/2026-05-31-phon-128-audio-transcript/scripts/lib_mapping.py and path-walks to arpa_to_ipa.json via parents[3]. Promoting it: load the target inventory through the phonolex_data package (root-cause-correct — no cross-package path walking) instead.

Files: - Create: packages/audio/src/phonolex_audio/mapping.py - Create: packages/audio/tests/test_mapping.py

  • [ ] Step 1: Write the failing test

packages/audio/tests/test_mapping.py:

from collections import Counter

from phonolex_audio.mapping import map_sequence, map_token, target_inventory


def test_target_inventory_is_broad_phoneme_set():
    inv = target_inventory()
    # Broad CMU→IPA inventory is ~39 phonemes.
    assert 35 <= len(inv) <= 45
    assert "k" in inv and "æ" in inv and "ɹ" in inv


def test_identity_tokens_pass_through():
    assert map_sequence(["k", "æ", "t"]) == ["k", "æ", "t"]


def test_length_marks_stripped():
    assert map_token("iː") == ["i"]
    assert map_token("ɔː") == ["ɔ"]


def test_rhotic_combo_expands():
    assert map_token("ɑːɹ") == ["ɑ", "ɹ"]


def test_flap_maps_to_t():
    assert map_token("ɾ") == ["t"]


def test_glottal_stop_dropped():
    assert map_token("ʔ") == []


def test_no_unmapped_on_native_english_token_set():
    # Tokens the model actually emits on native English (from the PHON-128
    # token histogram); projection must be 0% unmapped.
    tokens = ["k", "æ", "t", "ð", "ə", "ɹ", "iː", "ɑːɹ", "ɾ", "n̩", "əl", "ɝ"]
    unmapped: Counter = Counter()
    map_sequence(tokens, unmapped)
    assert sum(unmapped.values()) == 0

  • [ ] Step 2: Run it to verify it fails

Run: uv run python -m pytest packages/audio/tests/test_mapping.py -v Expected: FAIL — ModuleNotFoundError: No module named 'phonolex_audio.mapping'

  • [ ] Step 3: Create mapping.py

Copy the promoted module. The ONLY change from the research original is how target_inventory() locates arpa_to_ipa.json — via importlib.resources on phonolex_data, not a parents[n] walk.

packages/audio/src/phonolex_audio/mapping.py:

"""eSpeak/phonemizer-IPA -> PhonoLex broad-phoneme-IPA mapping.

Promoted from PHON-128's lib_mapping.py. `facebook/wav2vec2-lv-60-espeak-cv-ft`
emits the full multilingual eSpeak IPA set; PhonoLex's typed-input path speaks a
39-phone broad CMU->IPA inventory. This projects model output onto that inventory
so a transcript is interchangeable with typed input.

The target inventory is loaded from phonolex_data's arpa_to_ipa.json (single
source of truth) via importlib.resources — no cross-package path walking.
"""
from __future__ import annotations

import json
import unicodedata
from collections import Counter
from functools import lru_cache
from importlib.resources import files


@lru_cache(maxsize=1)
def target_inventory() -> frozenset[str]:
    """The PhonoLex broad-phoneme IPA inventory (the legal output alphabet)."""
    data = files("phonolex_data.mappings").joinpath("arpa_to_ipa.json").read_text()
    return frozenset(json.loads(data).values())


# Explicit eSpeak-token -> [PhonoLex IPA tokens]. [] means "drop".
ESPEAK_MAP: dict[str, list[str]] = {
    "iː": ["i"], "uː": ["u"], "ɑː": ["ɑ"], "ɔː": ["ɔ"], "ɛː": ["ɛ"],
    "aː": ["ɑ"], "yː": ["u"], "øː": ["oʊ"], "ɜː": ["ɝ"], "ɜ": ["ɝ"],
    "oː": ["oʊ"], "eː": ["eɪ"],
    "e": ["eɪ"], "o": ["oʊ"], "ø": ["oʊ"], "œ": ["ɛ"], "ɒ": ["ɑ"], "ä": ["ɑ"],
    "ɐ": ["ə"], "ʉ": ["u"], "ɨ": ["ɪ"], "ɯ": ["u"], "ᵻ": ["ɪ"], "ɵ": ["oʊ"], "a": ["ɑ"],
    "e̞": ["eɪ"], "o̞": ["oʊ"], "y": ["u"],
    "ɑːɹ": ["ɑ", "ɹ"], "ɔːɹ": ["ɔ", "ɹ"], "oːɹ": ["ɔ", "ɹ"], "ɛɹ": ["ɛ", "ɹ"],
    "ɪɹ": ["ɪ", "ɹ"], "ʊɹ": ["ʊ", "ɹ"], "ɔɹ": ["ɔ", "ɹ"], "ɑɹ": ["ɑ", "ɹ"],
    "aɪɚ": ["aɪ", "ɚ"], "aɪə": ["aɪ", "ə"], "aʊɚ": ["aʊ", "ɚ"], "ɚɹ": ["ɚ"],
    "ai": ["aɪ"], "ei": ["eɪ"], "au": ["aʊ"], "ɑu": ["aʊ"], "oi": ["ɔɪ"], "oɪ": ["ɔɪ"],
    "ou": ["oʊ"], "iə": ["ɪ", "ə"], "uə": ["u", "ə"],
    "ər": ["ɚ"], "əɹ": ["ɚ"],
    "əl": ["ə", "l"], "l̩": ["ə", "l"], "n̩": ["ə", "n"], "m̩": ["ə", "m"],
    "ən": ["ə", "n"], "əm": ["ə", "m"],
    "ɾ": ["t"],
    "ʔ": [],
    "r": ["ɹ"], "ʁ": ["ɹ"], "ʀ": ["ɹ"], "ɻ": ["ɹ"], "ɽ": ["ɹ"],
    "x": ["k"], "ç": ["h"], "χ": ["k"], "ɣ": ["ɡ"], "β": ["b"], "ɸ": ["f"],
    "ʎ": ["l"], "ɲ": ["n", "j"], "ɳ": ["n"], "ɭ": ["l"], "ʋ": ["v"], "ɬ": ["l"],
    "ʝ": ["j"], "ɕ": ["ʃ"], "ʑ": ["ʒ"], "tɕ": ["tʃ"], "dʑ": ["dʒ"], "ts": ["t", "s"],
    "c": ["k"], "ɟ": ["ɡ"], "ʈ": ["t"], "ɖ": ["d"], "q": ["k"], "ɫ": ["l"],
    "ʲ": [], "ˤ": [],
}

_STRIP = {"ː", "ˑ", "ʲ", "ʷ", "ˤ", "ˈ", "ˌ", "̩", "̃", "̪", "̝", "̞", "̥", "̊", "."}
_DIGITS = set("0123456789")


def _strip_marks(tok: str) -> str:
    s = "".join(ch for ch in tok if ch not in _STRIP and ch not in _DIGITS)
    s = "".join(ch for ch in unicodedata.normalize("NFD", s)
                if not unicodedata.combining(ch))
    return s


def map_token(tok: str, unmapped: Counter | None = None) -> list[str]:
    """Project one eSpeak token onto the PhonoLex inventory (possibly 0 or >1)."""
    target = target_inventory()
    if tok in target:
        return [tok]
    if tok in ESPEAK_MAP:
        return ESPEAK_MAP[tok]
    s = _strip_marks(tok)
    if s in target:
        return [s]
    if s in ESPEAK_MAP:
        return ESPEAK_MAP[s]
    if len(s) == 1 and s in target:
        return [s]
    if unmapped is not None:
        unmapped[tok] += 1
    return []


def map_sequence(tokens: list[str], unmapped: Counter | None = None) -> list[str]:
    out: list[str] = []
    for t in tokens:
        out.extend(map_token(t, unmapped))
    return out

  • [ ] Step 4: Run the test

Run: uv run python -m pytest packages/audio/tests/test_mapping.py -v Expected: PASS (all 7 tests)

  • [ ] Step 5: Commit
git add packages/audio/src/phonolex_audio/mapping.py packages/audio/tests/test_mapping.py
git commit -m "feat(audio): promote eSpeak->PhonoLex mapping into phonolex_audio (PHON-128)"

Task A2: Transcriber core — CTC decode + per-phoneme confidence + mapping

Wraps the wav2vec2-espeak CTC model. Decodes audio bytes → broad-phoneme contract dict. Confidence is the max-softmax of the CTC frame that emitted each token, carried through the mapping (a token that splits into two outputs replicates its confidence; a dropped token contributes nothing).

Files: - Create: packages/audio/src/phonolex_audio/transcribe.py - Create: packages/audio/tests/test_transcribe.py

  • [ ] Step 1: Write the failing tests

packages/audio/tests/test_transcribe.py:

import math

import numpy as np
import pytest

from phonolex_audio.transcribe import _decode_logits_to_phonemes


def test_decode_aligns_phonemes_and_confidences():
    # 3 vocab tokens: 0=<pad>, 1="k", 2="æ". Two clear frames -> "k","æ".
    id2tok = {0: "<pad>", 1: "k", 2: "æ"}
    # logits: frame0 -> token 1, frame1 -> token 2 (high confidence)
    logits = np.array([[0.0, 10.0, 0.0], [0.0, 0.0, 10.0]], dtype=np.float32)
    phonemes, confidences = _decode_logits_to_phonemes(logits, id2tok)
    assert phonemes == ["k", "æ"]
    assert len(confidences) == len(phonemes)
    assert all(0.0 <= c <= 1.0 for c in confidences)
    assert confidences[0] > 0.99  # softmax of a 10-vs-0 logit


def test_ctc_collapses_repeats_and_drops_pad():
    id2tok = {0: "<pad>", 1: "k"}
    # k, k (repeat -> collapse), pad, k (new run)
    logits = np.array(
        [[0.0, 9.0], [0.0, 9.0], [9.0, 0.0], [0.0, 9.0]], dtype=np.float32
    )
    phonemes, confidences = _decode_logits_to_phonemes(logits, id2tok)
    assert phonemes == ["k", "k"]
    assert len(confidences) == 2


@pytest.mark.slow
def test_transcribe_wav_smoke():
    # Local-only: downloads the model. Skipped in CI (no network / heavy).
    pytest.importorskip("torch")
    from phonolex_audio.transcribe import Transcriber

    # 0.5s of silence at 16k -> model returns *something* with aligned arrays.
    wav = (np.zeros(8000, dtype=np.float32))
    import io, soundfile as sf
    buf = io.BytesIO()
    sf.write(buf, wav, 16000, format="WAV")
    t = Transcriber()  # default off-the-shelf checkpoint
    result = t.transcribe(buf.getvalue())
    assert set(result) == {"phonemes", "confidences", "duration_ms", "coverage", "limitations"}
    assert len(result["phonemes"]) == len(result["confidences"])
    assert result["coverage"] == "broad-phoneme"
    assert isinstance(result["limitations"], list) and result["limitations"]
    assert result["duration_ms"] == 500

  • [ ] Step 2: Run to verify failure

Run: uv run python -m pytest packages/audio/tests/test_transcribe.py -v -m "not slow" Expected: FAIL — ImportError: cannot import name '_decode_logits_to_phonemes'

  • [ ] Step 3: Implement transcribe.py

packages/audio/src/phonolex_audio/transcribe.py:

"""Audio bytes -> broad-phoneme transcript contract dict.

CTC decode of wav2vec2-espeak, per-phoneme confidence from frame softmax,
projected onto the PhonoLex inventory via mapping.map_sequence.
"""
from __future__ import annotations

import io
import json
from collections import Counter

import librosa
import numpy as np

from .mapping import map_token

MODEL_ID = "facebook/wav2vec2-lv-60-espeak-cv-ft"
TARGET_SR = 16000
SPECIAL = {"<s>", "</s>", "<unk>", "<pad>", "|", " ", ""}

COVERAGE = "broad-phoneme"
LIMITATIONS = [
    "Broad-phoneme transcription only; distortions and covert contrast are not represented.",
    "Not validated on disordered speech.",
]


def _softmax_max(row: np.ndarray) -> float:
    m = float(row.max())
    e = np.exp(row - m)
    return float(e.max() / e.sum())


def _decode_logits_to_phonemes(
    logits: np.ndarray, id2tok: dict[int, str]
) -> tuple[list[str], list[float]]:
    """CTC greedy decode -> (phonemes, confidences), projected + aligned.

    For each emitted (non-special, repeat-collapsed) frame we take the raw
    eSpeak token and its frame softmax-max confidence, then project through the
    mapping. A token that maps to N outputs replicates its confidence N times;
    a dropped token contributes nothing.
    """
    ids = logits.argmax(axis=-1).tolist()
    phonemes: list[str] = []
    confidences: list[float] = []
    unmapped: Counter[str] = Counter()
    prev = None
    for frame_idx, tok_id in enumerate(ids):
        if tok_id == prev:
            continue
        prev = tok_id
        tok = id2tok.get(tok_id, "<unk>")
        if tok in SPECIAL:
            continue
        conf = _softmax_max(logits[frame_idx])
        mapped = map_token(tok, unmapped)
        for ph in mapped:
            phonemes.append(ph)
            confidences.append(round(conf, 4))
    return phonemes, confidences


class Transcriber:
    """Lazily-loaded wav2vec2-espeak CTC transcriber."""

    def __init__(self, checkpoint: str = MODEL_ID, device: str | None = None):
        import torch
        from huggingface_hub import hf_hub_download
        from transformers import AutoModelForCTC, Wav2Vec2FeatureExtractor

        if device is None:
            device = "mps" if torch.backends.mps.is_available() else "cpu"
        self.device = device
        self.fe = Wav2Vec2FeatureExtractor.from_pretrained(checkpoint)
        self.model = AutoModelForCTC.from_pretrained(checkpoint).to(device)
        self.model.eval()
        tok2id = json.loads(open(hf_hub_download(checkpoint, "vocab.json")).read())
        self.id2tok = {int(i): t for t, i in tok2id.items()}

    def transcribe(self, audio_bytes: bytes, language: str | None = None) -> dict:
        import torch

        arr, _ = librosa.load(io.BytesIO(audio_bytes), sr=TARGET_SR, mono=True)
        duration_ms = round(len(arr) / TARGET_SR * 1000)
        inputs = self.fe(arr, sampling_rate=TARGET_SR, return_tensors="pt")
        with torch.no_grad():
            logits = self.model(inputs.input_values.to(self.device)).logits
        logits_np = logits[0].cpu().numpy()
        phonemes, confidences = _decode_logits_to_phonemes(logits_np, self.id2tok)
        return {
            "phonemes": phonemes,
            "confidences": confidences,
            "duration_ms": duration_ms,
            "coverage": COVERAGE,
            "limitations": LIMITATIONS,
        }

  • [ ] Step 4: Run the fast tests

Run: uv run python -m pytest packages/audio/tests/test_transcribe.py -v -m "not slow" Expected: PASS (2 decode tests). The slow smoke test is run manually once locally.

  • [ ] Step 5: Register the slow marker to avoid pytest warnings.

Append to packages/audio/pyproject.toml:

[tool.pytest.ini_options]
testpaths = ["tests"]
markers = ["slow: requires model download / heavy compute (local-only)"]
(Replace the [tool.pytest.ini_options] block created in Task 0.)

  • [ ] Step 6: Commit
git add packages/audio/src/phonolex_audio/transcribe.py packages/audio/tests/test_transcribe.py packages/audio/pyproject.toml
git commit -m "feat(audio): CTC decode + per-phoneme confidence + mapping projection (PHON-128)"

Task A3: FastAPI server — POST /transcribe, GET /health

Files: - Create: packages/audio/src/phonolex_audio/server.py - Create: packages/audio/tests/test_server.py

  • [ ] Step 1: Write the failing tests

packages/audio/tests/test_server.py:

import io

from fastapi.testclient import TestClient

from phonolex_audio.server import build_app


class FakeTranscriber:
    def transcribe(self, audio_bytes: bytes, language=None) -> dict:
        return {
            "phonemes": ["k", "æ", "t"],
            "confidences": [0.98, 0.91, 0.95],
            "duration_ms": 1230,
            "coverage": "broad-phoneme",
            "limitations": ["x"],
        }


def client() -> TestClient:
    return TestClient(build_app(transcriber=FakeTranscriber()))


def test_health_ok():
    r = client().get("/health")
    assert r.status_code == 200
    assert r.json()["status"] == "ok"


def test_transcribe_returns_contract():
    files = {"audio": ("clip.wav", io.BytesIO(b"RIFFfake"), "audio/wav")}
    r = client().post("/transcribe", files=files)
    assert r.status_code == 200
    body = r.json()
    assert body["phonemes"] == ["k", "æ", "t"]
    assert len(body["phonemes"]) == len(body["confidences"])
    assert body["coverage"] == "broad-phoneme"


def test_transcribe_rejects_missing_audio():
    r = client().post("/transcribe")
    assert r.status_code == 422  # FastAPI validation: required field missing

  • [ ] Step 2: Run to verify failure

Run: uv run python -m pytest packages/audio/tests/test_server.py -v Expected: FAIL — ImportError: cannot import name 'build_app'

  • [ ] Step 3: Implement server.py

packages/audio/src/phonolex_audio/server.py:

"""FastAPI inference host for v6 Model #1.

build_app(transcriber=...) lets tests inject a fake. The production entry
(__main__) builds a real Transcriber at the requested --checkpoint.
"""
from __future__ import annotations

from fastapi import FastAPI, File, Form, UploadFile
from fastapi.middleware.cors import CORSMiddleware


def build_app(transcriber) -> FastAPI:
    app = FastAPI(title="PhonoLex Audio Inference", version="0.1.0")
    # The Worker proxies to this host server-side; CORS is permissive because it
    # is never browser-facing directly (localhost in dev, RunPod private later).
    app.add_middleware(
        CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"]
    )

    @app.get("/health")
    def health() -> dict:
        return {"status": "ok"}

    @app.post("/transcribe")
    async def transcribe(
        audio: UploadFile = File(...), language: str | None = Form(default=None)
    ) -> dict:
        raw = await audio.read()
        return transcriber.transcribe(raw, language=language)

    return app

  • [ ] Step 4: Run the tests

Run: uv run python -m pytest packages/audio/tests/test_server.py -v Expected: PASS (3 tests)

  • [ ] Step 5: Commit
git add packages/audio/src/phonolex_audio/server.py packages/audio/tests/test_server.py
git commit -m "feat(audio): FastAPI /transcribe + /health (PHON-128)"

Task A4: CLI entry with --checkpoint (the research-harness seam)

Files: - Create: packages/audio/src/phonolex_audio/__main__.py - Create: packages/audio/README.md

  • [ ] Step 1: Implement __main__.py

packages/audio/src/phonolex_audio/__main__.py:

"""Run the inference host:  uv run python -m phonolex_audio --checkpoint <path>

--checkpoint swaps off-the-shelf wav2vec2-espeak for a PHON-139 fine-tuned
checkpoint without touching the Worker or the viewer — the research-harness seam.
"""
from __future__ import annotations

import argparse

import uvicorn

from .server import build_app
from .transcribe import MODEL_ID, Transcriber


def main() -> None:
    ap = argparse.ArgumentParser(prog="phonolex_audio")
    ap.add_argument("--checkpoint", default=MODEL_ID,
                    help="HF id or local path; default = off-the-shelf wav2vec2-espeak")
    ap.add_argument("--host", default="127.0.0.1")
    ap.add_argument("--port", type=int, default=8000)
    args = ap.parse_args()

    print(f"Loading checkpoint: {args.checkpoint}")
    app = build_app(transcriber=Transcriber(checkpoint=args.checkpoint))
    uvicorn.run(app, host=args.host, port=args.port)


if __name__ == "__main__":
    main()

  • [ ] Step 2: Verify the CLI parses (no model load)

Run: uv run python -m phonolex_audio --help Expected: usage text listing --checkpoint, --host, --port.

  • [ ] Step 3: Write README.md

packages/audio/README.md:

# phonolex_audio — v6 Model #1 inference host

Audio → broad-phoneme transcript. FastAPI host that the Worker proxies to.

## Run locally
```bash
uv run python -m phonolex_audio --checkpoint facebook/wav2vec2-lv-60-espeak-cv-ft --port 8000
First run downloads the model (~1 GB). Runs on Apple MPS if available, else CPU. ffmpeg must be on PATH to decode mic recordings (webm/opus); WAV/MP3/FLAC need only soundfile.

Research harness (PHON-139)

Point --checkpoint at a fine-tuned checkpoint to see its transcript in the same /dev/audio viewer — off-the-shelf vs FT, A/B, no other change.

API

POST /transcribe (multipart: audio, optional language) → {phonemes[], confidences[], duration_ms, coverage, limitations[]}. GET /health{status: "ok"}.

- [ ] **Step 4: Commit**

```bash
git add packages/audio/src/phonolex_audio/__main__.py packages/audio/README.md
git commit -m "feat(audio): CLI entry with --checkpoint + README (PHON-128)"


Phase B — Worker proxy route

Task B1: Add AUDIO_INFERENCE_URL to the Worker env

Files: - Modify: packages/web/workers/src/types.ts:9-11 - Modify: packages/web/workers/wrangler.toml:33-34 (and :61-62 staging) - Create: packages/web/workers/.dev.vars

  • [ ] Step 1: Extend Env

packages/web/workers/src/types.ts — replace the Env interface:

export interface Env {
  DB: D1Database;
  /** Base URL of the audio inference host. localhost in dev, RunPod later. */
  AUDIO_INFERENCE_URL?: string;
}

  • [ ] Step 2: Add the dev-default var + local override

packages/web/workers/wrangler.toml — under the existing [vars] line (33):

[vars]
AUDIO_INFERENCE_URL = "http://127.0.0.1:8000"
Leave [env.staging.vars] empty (staging/prod get the RunPod URL at ship time — out of scope here).

Create packages/web/workers/.dev.vars:

AUDIO_INFERENCE_URL="http://127.0.0.1:8000"

  • [ ] Step 3: Verify .dev.vars is gitignored or intentionally tracked

Run: cd packages/web/workers && git check-ignore .dev.vars; echo "exit=$?" If exit=0 (ignored), that's fine — it's a local-only default. If not ignored, add .dev.vars to packages/web/workers/.gitignore. Do NOT commit secrets; this file holds only a localhost URL, but keep it untracked for hygiene.

  • [ ] Step 4: Commit (types + wrangler only)
git add packages/web/workers/src/types.ts packages/web/workers/wrangler.toml
git commit -m "feat(audio): AUDIO_INFERENCE_URL env binding (PHON-128)"

Task B2: /api/audio/transcribe proxy route

Thin proxy: validate the multipart audio part, forward it to AUDIO_INFERENCE_URL, pass the contract through, translate cold-start/errors.

Files: - Create: packages/web/workers/src/routes/audio.ts - Modify: packages/web/workers/src/index.ts:20,105 (import + mount) - Create: packages/web/workers/src/__tests__/audio.test.ts

  • [ ] Step 1: Write the failing tests

packages/web/workers/src/__tests__/audio.test.ts:

import { describe, it, expect, beforeAll, afterEach } from 'vitest';
import { SELF, fetchMock } from 'cloudflare:test';

beforeAll(() => {
  fetchMock.activate();
  fetchMock.disableNetConnect();
});
afterEach(() => fetchMock.assertNoPendingInterceptors());

function audioForm(): FormData {
  const fd = new FormData();
  fd.append('audio', new Blob([new Uint8Array([1, 2, 3])], { type: 'audio/wav' }), 'clip.wav');
  return fd;
}

describe('POST /api/audio/transcribe', () => {
  it('returns 400 when no audio part is present', async () => {
    const res = await SELF.fetch('http://localhost/api/audio/transcribe', {
      method: 'POST',
      body: new FormData(),
    });
    expect(res.status).toBe(400);
    const body = await res.json() as Record<string, unknown>;
    expect(body).toHaveProperty('detail');
  });

  it('returns 400 when the uploaded part is not audio/*', async () => {
    const fd = new FormData();
    fd.append('audio', new Blob(['hello'], { type: 'text/plain' }), 'note.txt');
    const res = await SELF.fetch('http://localhost/api/audio/transcribe', {
      method: 'POST',
      body: fd,
    });
    expect(res.status).toBe(400);
  });

  it('proxies the contract through on success', async () => {
    fetchMock
      .get('http://127.0.0.1:8000')
      .intercept({ path: '/transcribe', method: 'POST' })
      .reply(200, {
        phonemes: ['k', 'æ', 't'],
        confidences: [0.98, 0.91, 0.95],
        duration_ms: 1230,
        coverage: 'broad-phoneme',
        limitations: ['x'],
      });

    const res = await SELF.fetch('http://localhost/api/audio/transcribe', {
      method: 'POST',
      body: audioForm(),
    });
    expect(res.status).toBe(200);
    const body = await res.json() as { phonemes: string[]; confidences: number[] };
    expect(body.phonemes).toEqual(['k', 'æ', 't']);
    expect(body.confidences.length).toBe(body.phonemes.length);
  });

  it('returns a warming-up state when the host is unavailable (503)', async () => {
    fetchMock
      .get('http://127.0.0.1:8000')
      .intercept({ path: '/transcribe', method: 'POST' })
      .reply(503, 'cold');

    const res = await SELF.fetch('http://localhost/api/audio/transcribe', {
      method: 'POST',
      body: audioForm(),
    });
    expect(res.status).toBe(503);
    const body = await res.json() as { warming: boolean; detail: string };
    expect(body.warming).toBe(true);
  });
});

  • [ ] Step 2: Run to verify failure

Run: cd packages/web/workers && npm test -- audio.test.ts Expected: FAIL — route not mounted (404), assertions fail.

  • [ ] Step 3: Implement routes/audio.ts

packages/web/workers/src/routes/audio.ts:

/**
 * Audio transcription proxy — v6 Model #1.
 *
 * Thin proxy: validate the multipart `audio` part, forward to the inference
 * host (AUDIO_INFERENCE_URL), pass the broad-phoneme contract straight through.
 * Cold-start aware: a 503 / network failure from the host becomes a structured
 * { warming: true } so the viewer can show a warm-up state instead of an error.
 */
import { Hono } from 'hono';
import type { Env } from '../types';

const audio = new Hono<{ Bindings: Env }>();

const MAX_BYTES = 10 * 1024 * 1024; // 10 MB upload cap

audio.post('/transcribe', async (c) => {
  const form = await c.req.formData().catch(() => null);
  const file = form?.get('audio');
  if (!form || !(file instanceof File)) {
    return c.json({ detail: 'Missing required multipart field: audio' }, 400);
  }
  if (file.size > MAX_BYTES) {
    return c.json({ detail: `Audio exceeds ${MAX_BYTES} byte limit` }, 400);
  }
  // Reject obviously non-audio uploads. File.type can be empty for some blobs;
  // only reject when a type IS present and is not audio/*.
  if (file.type && !file.type.startsWith('audio/')) {
    return c.json({ detail: `Unsupported content type: ${file.type}` }, 400);
  }

  const base = c.env.AUDIO_INFERENCE_URL;
  if (!base) {
    return c.json({ detail: 'Audio inference host not configured' }, 503);
  }

  // Re-pack into a fresh multipart body to forward.
  const fwd = new FormData();
  fwd.append('audio', file, file.name || 'clip');
  const language = form.get('language');
  if (typeof language === 'string') fwd.append('language', language);

  let upstream: Response;
  try {
    upstream = await fetch(`${base}/transcribe`, { method: 'POST', body: fwd });
  } catch {
    return c.json({ warming: true, detail: 'Inference host is warming up. Retry shortly.' }, 503);
  }

  if (upstream.status === 503) {
    return c.json({ warming: true, detail: 'Inference host is warming up. Retry shortly.' }, 503);
  }
  if (!upstream.ok) {
    return c.json({ detail: `Inference host error (${upstream.status})` }, 502);
  }

  const body = await upstream.json();
  return c.json(body);
});

export default audio;

  • [ ] Step 4: Mount the route

packages/web/workers/src/index.ts — add the import after line 20:

import audio from './routes/audio';
And mount after line 105 (app.route('/api/sentences', sentences);):
app.route('/api/audio', audio);

  • [ ] Step 5: Run the tests

Run: cd packages/web/workers && npm test -- audio.test.ts Expected: PASS (3 tests)

  • [ ] Step 6: Run the full Worker suite (no regression)

Run: cd packages/web/workers && npm test Expected: all existing tests still PASS.

  • [ ] Step 7: Commit
git add packages/web/workers/src/routes/audio.ts packages/web/workers/src/index.ts packages/web/workers/src/__tests__/audio.test.ts
git commit -m "feat(audio): /api/audio/transcribe proxy route + tests (PHON-128)"

Phase C — Dev-gated viewer

Task C1: audioApi.transcribeAudio (FormData fetch)

The existing apiClient always sets Content-Type: application/json, which breaks multipart uploads (the browser must set the boundary). A dedicated small service uses raw fetch with FormData.

Files: - Create: packages/web/frontend/src/services/audioApi.ts

  • [ ] Step 1: Implement audioApi.ts

packages/web/frontend/src/services/audioApi.ts:

/**
 * Audio transcription client (v6 Model #1).
 *
 * Multipart upload — NOT the JSON apiClient (which forces application/json and
 * would break the multipart boundary). Returns the broad-phoneme contract.
 */
import { freshRequestId } from '../lib/logger';

export interface TranscriptResult {
  phonemes: string[];
  confidences: number[];
  duration_ms: number;
  coverage: string;
  limitations: string[];
}

export class TranscriberWarmingError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'TranscriberWarmingError';
  }
}

const baseUrl = import.meta.env.VITE_API_URL || '';

export async function transcribeAudio(blob: Blob, language?: string): Promise<TranscriptResult> {
  const fd = new FormData();
  fd.append('audio', blob, 'recording');
  if (language) fd.append('language', language);

  const res = await fetch(`${baseUrl}/api/audio/transcribe`, {
    method: 'POST',
    headers: { 'X-Request-ID': freshRequestId() },
    body: fd,
  });

  if (res.status === 503) {
    const body = await res.json().catch(() => ({ detail: 'Warming up' })) as { detail?: string };
    throw new TranscriberWarmingError(body.detail || 'Inference host is warming up.');
  }
  if (!res.ok) {
    const detail = await res.text().catch(() => res.statusText);
    throw new Error(`Transcription failed (${res.status}): ${detail}`);
  }
  return res.json();
}

  • [ ] Step 2: Type-check

Run: cd packages/web/frontend && npx tsc --noEmit Expected: no errors.

  • [ ] Step 3: Commit
git add packages/web/frontend/src/services/audioApi.ts
git commit -m "feat(audio): transcribeAudio multipart client (PHON-128)"

Task C2: AudioTranscribeViewer component

Files: - Create: packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx - Create: packages/web/frontend/src/components/tools/AudioTranscribeViewer.test.tsx

  • [ ] Step 1: Write the failing test

packages/web/frontend/src/components/tools/AudioTranscribeViewer.test.tsx:

import { describe, it, expect, vi, beforeEach } from 'vitest';
import { render, screen, fireEvent, waitFor } from '@testing-library/react';
import AudioTranscribeViewer from './AudioTranscribeViewer';
import * as audioApi from '../../services/audioApi';

vi.mock('../../services/audioApi', async (orig) => {
  const actual = await orig() as typeof audioApi;
  return { ...actual, transcribeAudio: vi.fn() };
});

const mockedTranscribe = audioApi.transcribeAudio as unknown as ReturnType<typeof vi.fn>;

function selectFile() {
  const input = screen.getByTestId('audio-file-input') as HTMLInputElement;
  const file = new File([new Uint8Array([1, 2, 3])], 'clip.wav', { type: 'audio/wav' });
  fireEvent.change(input, { target: { files: [file] } });
}

describe('AudioTranscribeViewer', () => {
  beforeEach(() => mockedTranscribe.mockReset());

  it('renders the transcript and the coverage caveats on success', async () => {
    mockedTranscribe.mockResolvedValue({
      phonemes: ['k', 'æ', 't'],
      confidences: [0.98, 0.5, 0.95],
      duration_ms: 1230,
      coverage: 'broad-phoneme',
      limitations: ['Broad-phoneme transcription only; distortions and covert contrast are not represented.'],
    });
    render(<AudioTranscribeViewer />);
    selectFile();
    fireEvent.click(screen.getByRole('button', { name: /transcribe/i }));
    await waitFor(() => expect(screen.getByTestId('transcript')).toBeInTheDocument());
    expect(screen.getByTestId('transcript').textContent).toContain('k');
    expect(screen.getByTestId('transcript').textContent).toContain('æ');
    // caveats always present
    expect(screen.getByText(/Broad-phoneme transcription only/)).toBeInTheDocument();
  });

  it('shows a warming-up notice when the host is cold', async () => {
    mockedTranscribe.mockRejectedValue(new audioApi.TranscriberWarmingError('warming'));
    render(<AudioTranscribeViewer />);
    selectFile();
    fireEvent.click(screen.getByRole('button', { name: /transcribe/i }));
    await waitFor(() => expect(screen.getByTestId('warming-notice')).toBeInTheDocument());
  });
});

  • [ ] Step 2: Run to verify failure

Run: cd packages/web/frontend && npx vitest run src/components/tools/AudioTranscribeViewer.test.tsx Expected: FAIL — component file does not exist.

  • [ ] Step 3: Implement the component

packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx:

/**
 * v6 Model #1 — dev-gated audio transcription viewer.
 *
 * Audio in (mic record + file upload) → broad-phoneme IPA transcript with
 * per-phoneme confidence (opacity) and unavoidable coverage caveats. Not in
 * main nav; reachable only at /dev/audio.
 */
import { useRef, useState } from 'react';
import { Box, Button, Stack, Typography, Alert, Paper } from '@mui/material';
import { transcribeAudio, TranscriberWarmingError, type TranscriptResult } from '../../services/audioApi';

type State =
  | { kind: 'idle' }
  | { kind: 'loading' }
  | { kind: 'warming' }
  | { kind: 'error'; message: string }
  | { kind: 'done'; result: TranscriptResult };

export default function AudioTranscribeViewer() {
  const [blob, setBlob] = useState<Blob | null>(null);
  const [state, setState] = useState<State>({ kind: 'idle' });
  const mediaRecorder = useRef<MediaRecorder | null>(null);
  const chunks = useRef<Blob[]>([]);
  const [recording, setRecording] = useState(false);

  async function startRecording() {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    const mr = new MediaRecorder(stream);
    chunks.current = [];
    mr.ondataavailable = (e) => chunks.current.push(e.data);
    mr.onstop = () => {
      setBlob(new Blob(chunks.current, { type: mr.mimeType }));
      stream.getTracks().forEach((t) => t.stop());
    };
    mediaRecorder.current = mr;
    mr.start();
    setRecording(true);
  }

  function stopRecording() {
    mediaRecorder.current?.stop();
    setRecording(false);
  }

  function onFile(e: React.ChangeEvent<HTMLInputElement>) {
    const f = e.target.files?.[0];
    if (f) setBlob(f);
  }

  async function run() {
    if (!blob) return;
    setState({ kind: 'loading' });
    try {
      const result = await transcribeAudio(blob);
      setState({ kind: 'done', result });
    } catch (err) {
      if (err instanceof TranscriberWarmingError) setState({ kind: 'warming' });
      else setState({ kind: 'error', message: (err as Error).message });
    }
  }

  return (
    <Box sx={{ p: 3, maxWidth: 720, mx: 'auto' }}>
      <Typography variant="h5" gutterBottom>Audio  Phoneme Transcript (dev)</Typography>

      <Stack direction="row" spacing={2} alignItems="center" sx={{ mb: 2 }}>
        {recording
          ? <Button variant="outlined" color="error" onClick={stopRecording}>Stop</Button>
          : <Button variant="outlined" onClick={startRecording}>Record</Button>}
        <Button component="label" variant="outlined">
          Upload
          <input data-testid="audio-file-input" hidden type="file" accept="audio/*" onChange={onFile} />
        </Button>
        <Button variant="contained" disabled={!blob || state.kind === 'loading'} onClick={run}>
          Transcribe
        </Button>
      </Stack>

      {blob && <Typography variant="caption" color="text.secondary">Audio ready ({Math.round(blob.size / 1024)} KB)</Typography>}

      {state.kind === 'warming' && (
        <Alert data-testid="warming-notice" severity="info" sx={{ mt: 2 }}>
          The transcription model is warming up (first request can take ~60s). Retry shortly.
        </Alert>
      )}
      {state.kind === 'error' && (
        <Alert severity="error" sx={{ mt: 2 }}>{state.message}</Alert>
      )}

      {state.kind === 'done' && (
        <Paper variant="outlined" sx={{ mt: 2, p: 2 }}>
          <Typography variant="overline" color="text.secondary">Transcript</Typography>
          <Box data-testid="transcript" sx={{ fontSize: 28, lineHeight: 1.8, fontFamily: 'serif' }}>
            {state.result.phonemes.map((p, i) => (
              <Box key={i} component="span" sx={{ mx: 0.5, opacity: 0.35 + 0.65 * state.result.confidences[i] }}>
                {p}
              </Box>
            ))}
          </Box>
          <Typography variant="caption" color="text.secondary">
            {state.result.duration_ms} ms · coverage: {state.result.coverage}
          </Typography>
          <Box sx={{ mt: 1.5 }}>
            {state.result.limitations.map((lim, i) => (
              <Typography key={i} variant="caption" display="block" color="text.secondary"> {lim}</Typography>
            ))}
          </Box>
        </Paper>
      )}
    </Box>
  );
}

  • [ ] Step 4: Run the test

Run: cd packages/web/frontend && npx vitest run src/components/tools/AudioTranscribeViewer.test.tsx Expected: PASS (2 tests)

  • [ ] Step 5: Commit
git add packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx packages/web/frontend/src/components/tools/AudioTranscribeViewer.test.tsx
git commit -m "feat(audio): AudioTranscribeViewer component + tests (PHON-128)"

Task C3: Register the unlinked /dev/audio route

Files: - Modify: packages/web/frontend/src/main.tsx

  • [ ] Step 1: Add the route

packages/web/frontend/src/main.tsx — add the import alongside the page imports:

import AudioTranscribeViewer from './components/tools/AudioTranscribeViewer.tsx';
And add a route inside <Routes> (after the /terms route). It is intentionally NOT linked from any nav — reachable only by typing the URL:
                <Route path="/dev/audio" element={<AudioTranscribeViewer />} />

  • [ ] Step 2: Type-check + build

Run: cd packages/web/frontend && npx tsc --noEmit && npm run build Expected: build succeeds.

  • [ ] Step 3: Commit
git add packages/web/frontend/src/main.tsx
git commit -m "feat(audio): mount unlinked /dev/audio viewer route (PHON-128)"

Phase D — End-to-end verification + harness demo

Task D1: Manual end-to-end run (and --checkpoint swap)

This is a manual verification task — no code, but it proves the rung and demonstrates the PHON-139 harness property.

  • [ ] Step 1: Start the inference host

Run: uv run python -m phonolex_audio --port 8000 Expected: model downloads (first run), then Uvicorn running on http://127.0.0.1:8000. Verify: curl http://127.0.0.1:8000/health{"status":"ok"}.

  • [ ] Step 2: Start the Worker

Run: cd packages/web/workers && npx wrangler dev Expected: Worker on http://localhost:8787, AUDIO_INFERENCE_URL picked up from .dev.vars.

  • [ ] Step 3: Smoke the endpoint with a real wav

Run: curl -s -F "audio=@<some.wav>" http://localhost:8787/api/audio/transcribe | jq Expected: a JSON object with phonemes[], confidences[] (same length), coverage: "broad-phoneme", limitations[].

  • [ ] Step 4: Start the frontend and open the viewer

Run: cd packages/web/frontend && npm run dev, then open http://localhost:5173/dev/audio. Verify: upload a wav (or record) → Transcribe → the IPA transcript renders with confidence opacity and the caveats are visible. Confirm /dev/audio is reachable only by URL (not in nav).

  • [ ] Step 5: Demonstrate the harness seam

Restart the host with a different checkpoint, e.g. uv run python -m phonolex_audio --checkpoint <local-ft-path> --port 8000, and re-transcribe the same clip in the viewer. Confirm the transcript/confidence change with no other edits — this is the PHON-139 A/B harness.

  • [ ] Step 6: Run the full local CI-equivalent suite before any push

Run:

cd packages/web/workers && npm test && npx tsc --noEmit
cd ../frontend && npx vitest run && npx tsc --noEmit && npm run build
cd ../../.. && uv run python -m pytest packages/audio/tests/ -m "not slow"
Expected: all green. (The slow transcribe smoke is verified once in Step 3/4; CI does not run it — no model in CI.)

  • [ ] Step 7: Push the branch
git push -u origin feature/phon-128-audio-transcribe-viewer

Notes for the implementer

  • CI awareness: CI has no model and no network for HF. All committed tests run without the model (-m "not slow", mocked transcriber, mocked fetchMock/apiClient). Never add a committed test that downloads the checkpoint.
  • Contract is frozen: if you change the response shape, change it in the spec §6, transcribe.py (COVERAGE/LIMITATIONS + dict), audioApi.ts (TranscriptResult), and the Worker test fixture together — the three units only agree because the contract is identical.
  • Out of scope (do not add): threading the transcript into Constraint[]/the five tools, a nav tab, RunPod stand-up, fine-tuning. Those are later rungs / PHON-139.
  • ffmpeg: mic recordings are webm/opus; librosa needs ffmpeg on PATH to decode them. WAV uploads decode via soundfile with no ffmpeg. Document this if a recording fails to decode locally. ```