v6.0 Audio Transcribe Viewer Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: A dev-gated viewer that turns recorded/uploaded audio into a broad-phoneme IPA transcript with per-phoneme confidence and honest coverage caveats, backed by a local Python inference server proxied through a Worker route.
Architecture: Three units against a frozen JSON contract — a phonolex_audio FastAPI inference server (wav2vec2-espeak CTC, --checkpoint swappable, server-side eSpeak→PhonoLex mapping), a thin /api/audio/transcribe Worker proxy (env AUDIO_INFERENCE_URL, cold-start aware), and an unlinked /dev/audio React viewer. The Python track (Phase 0/A) and the TS track (Phase B/C) are independent once the §6 contract is frozen and can be built in parallel.
Tech Stack: Python (FastAPI, transformers, torch, librosa) · Hono on Cloudflare Workers · React + TypeScript + MUI · pytest · vitest (cloudflare:test + vitest browser/jsdom).
Spec: docs/superpowers/specs/2026-06-02-v6-audio-transcribe-viewer-design.md
Frozen API contract (§6 of the spec)¶
POST /transcribe (inference server) and POST /api/audio/transcribe (Worker, identical body) return:
{
"phonemes": ["k", "æ", "t"],
"confidences": [0.98, 0.91, 0.95],
"duration_ms": 1230,
"coverage": "broad-phoneme",
"limitations": [
"Broad-phoneme transcription only; distortions and covert contrast are not represented.",
"Not validated on disordered speech."
]
}
phonemes and confidences are equal-length and positionally aligned. coverage and limitations[] are always present.
File structure¶
| File | Responsibility |
|---|---|
packages/audio/pyproject.toml |
New phonolex-audio package metadata + deps |
packages/audio/src/phonolex_audio/__init__.py |
Package marker |
packages/audio/src/phonolex_audio/mapping.py |
eSpeak→PhonoLex broad-phoneme projection (promoted from PHON-128) |
packages/audio/src/phonolex_audio/transcribe.py |
Model load + CTC decode + confidence + mapping → contract dict |
packages/audio/src/phonolex_audio/server.py |
FastAPI app: POST /transcribe, GET /health |
packages/audio/src/phonolex_audio/__main__.py |
CLI: python -m phonolex_audio --checkpoint ... --port 8000 |
packages/audio/tests/test_mapping.py |
Mapping unit tests |
packages/audio/tests/test_transcribe.py |
Transcribe smoke (local-only) + confidence alignment |
packages/audio/tests/test_server.py |
FastAPI route tests (mocked transcribe) |
packages/audio/README.md |
Dev run instructions + --checkpoint harness note |
pyproject.toml |
Add packages/audio to the uv workspace |
packages/web/workers/src/types.ts |
Add AUDIO_INFERENCE_URL to Env |
packages/web/workers/src/routes/audio.ts |
POST /api/audio/transcribe proxy |
packages/web/workers/src/index.ts |
Mount the audio route |
packages/web/workers/wrangler.toml |
AUDIO_INFERENCE_URL var (dev default) |
packages/web/workers/.dev.vars |
Local AUDIO_INFERENCE_URL override |
packages/web/workers/src/__tests__/audio.test.ts |
Worker route tests |
packages/web/frontend/src/services/audioApi.ts |
transcribeAudio(blob, language?) (FormData fetch) |
packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx |
The viewer component |
packages/web/frontend/src/components/tools/AudioTranscribeViewer.test.tsx |
Component tests |
packages/web/frontend/src/main.tsx |
Register unlinked /dev/audio route |
Phase 0 — Package scaffold¶
Task 0: Create the phonolex_audio package and register it in the workspace¶
Files:
- Create: packages/audio/pyproject.toml
- Create: packages/audio/src/phonolex_audio/__init__.py
- Create: packages/audio/tests/test_smoke.py
- Modify: pyproject.toml:8-16 (workspace members + sources)
- [ ] Step 1: Write the failing test
packages/audio/tests/test_smoke.py:
def test_package_imports():
import phonolex_audio
assert phonolex_audio.__name__ == "phonolex_audio"
- [ ] Step 2: Run it to verify it fails
Run: uv run python -m pytest packages/audio/tests/test_smoke.py -v
Expected: FAIL — ModuleNotFoundError: No module named 'phonolex_audio'
- [ ] Step 3: Create the package files
packages/audio/pyproject.toml:
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "phonolex-audio"
version = "0.1.0"
description = "Audio → broad-phoneme transcription inference server for PhonoLex (v6 Model #1)"
license = "LicenseRef-Proprietary"
requires-python = ">=3.11"
dependencies = [
"phonolex-data",
"fastapi>=0.115",
"uvicorn>=0.30",
"python-multipart>=0.0.9",
"transformers>=4.40",
"torch>=2.2",
"librosa>=0.10",
"soundfile>=0.12",
"huggingface_hub>=0.24",
"numpy>=1.24",
]
[project.optional-dependencies]
dev = [
"pytest>=7.0",
"httpx>=0.27",
"ruff>=0.4",
]
[tool.hatch.build.targets.wheel]
packages = ["src/phonolex_audio"]
[tool.ruff]
target-version = "py311"
line-length = 100
[tool.pytest.ini_options]
testpaths = ["tests"]
packages/audio/src/phonolex_audio/__init__.py:
"""PhonoLex audio inference — v6 Model #1: audio → broad-phoneme transcript."""
Modify pyproject.toml (root) — add the member and source:
[tool.uv.workspace]
members = [
"packages/data",
"packages/features",
"packages/audio",
]
[tool.uv.sources]
phonolex-data = { workspace = true }
phonolex-features = { workspace = true }
phonolex-audio = { workspace = true }
- [ ] Step 4: Sync and run the test
Run: uv sync --all-packages && uv run python -m pytest packages/audio/tests/test_smoke.py -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/audio/pyproject.toml packages/audio/src/phonolex_audio/__init__.py packages/audio/tests/test_smoke.py pyproject.toml
git commit -m "feat(audio): scaffold phonolex_audio package (PHON-128)"
Phase A — Python inference server¶
Task A1: Promote the eSpeak→PhonoLex mapping into the package¶
The PHON-128 mapping lives at research/2026-05-31-phon-128-audio-transcript/scripts/lib_mapping.py and path-walks to arpa_to_ipa.json via parents[3]. Promoting it: load the target inventory through the phonolex_data package (root-cause-correct — no cross-package path walking) instead.
Files:
- Create: packages/audio/src/phonolex_audio/mapping.py
- Create: packages/audio/tests/test_mapping.py
- [ ] Step 1: Write the failing test
packages/audio/tests/test_mapping.py:
from collections import Counter
from phonolex_audio.mapping import map_sequence, map_token, target_inventory
def test_target_inventory_is_broad_phoneme_set():
inv = target_inventory()
# Broad CMU→IPA inventory is ~39 phonemes.
assert 35 <= len(inv) <= 45
assert "k" in inv and "æ" in inv and "ɹ" in inv
def test_identity_tokens_pass_through():
assert map_sequence(["k", "æ", "t"]) == ["k", "æ", "t"]
def test_length_marks_stripped():
assert map_token("iː") == ["i"]
assert map_token("ɔː") == ["ɔ"]
def test_rhotic_combo_expands():
assert map_token("ɑːɹ") == ["ɑ", "ɹ"]
def test_flap_maps_to_t():
assert map_token("ɾ") == ["t"]
def test_glottal_stop_dropped():
assert map_token("ʔ") == []
def test_no_unmapped_on_native_english_token_set():
# Tokens the model actually emits on native English (from the PHON-128
# token histogram); projection must be 0% unmapped.
tokens = ["k", "æ", "t", "ð", "ə", "ɹ", "iː", "ɑːɹ", "ɾ", "n̩", "əl", "ɝ"]
unmapped: Counter = Counter()
map_sequence(tokens, unmapped)
assert sum(unmapped.values()) == 0
- [ ] Step 2: Run it to verify it fails
Run: uv run python -m pytest packages/audio/tests/test_mapping.py -v
Expected: FAIL — ModuleNotFoundError: No module named 'phonolex_audio.mapping'
- [ ] Step 3: Create
mapping.py
Copy the promoted module. The ONLY change from the research original is how target_inventory() locates arpa_to_ipa.json — via importlib.resources on phonolex_data, not a parents[n] walk.
packages/audio/src/phonolex_audio/mapping.py:
"""eSpeak/phonemizer-IPA -> PhonoLex broad-phoneme-IPA mapping.
Promoted from PHON-128's lib_mapping.py. `facebook/wav2vec2-lv-60-espeak-cv-ft`
emits the full multilingual eSpeak IPA set; PhonoLex's typed-input path speaks a
39-phone broad CMU->IPA inventory. This projects model output onto that inventory
so a transcript is interchangeable with typed input.
The target inventory is loaded from phonolex_data's arpa_to_ipa.json (single
source of truth) via importlib.resources — no cross-package path walking.
"""
from __future__ import annotations
import json
import unicodedata
from collections import Counter
from functools import lru_cache
from importlib.resources import files
@lru_cache(maxsize=1)
def target_inventory() -> frozenset[str]:
"""The PhonoLex broad-phoneme IPA inventory (the legal output alphabet)."""
data = files("phonolex_data.mappings").joinpath("arpa_to_ipa.json").read_text()
return frozenset(json.loads(data).values())
# Explicit eSpeak-token -> [PhonoLex IPA tokens]. [] means "drop".
ESPEAK_MAP: dict[str, list[str]] = {
"iː": ["i"], "uː": ["u"], "ɑː": ["ɑ"], "ɔː": ["ɔ"], "ɛː": ["ɛ"],
"aː": ["ɑ"], "yː": ["u"], "øː": ["oʊ"], "ɜː": ["ɝ"], "ɜ": ["ɝ"],
"oː": ["oʊ"], "eː": ["eɪ"],
"e": ["eɪ"], "o": ["oʊ"], "ø": ["oʊ"], "œ": ["ɛ"], "ɒ": ["ɑ"], "ä": ["ɑ"],
"ɐ": ["ə"], "ʉ": ["u"], "ɨ": ["ɪ"], "ɯ": ["u"], "ᵻ": ["ɪ"], "ɵ": ["oʊ"], "a": ["ɑ"],
"e̞": ["eɪ"], "o̞": ["oʊ"], "y": ["u"],
"ɑːɹ": ["ɑ", "ɹ"], "ɔːɹ": ["ɔ", "ɹ"], "oːɹ": ["ɔ", "ɹ"], "ɛɹ": ["ɛ", "ɹ"],
"ɪɹ": ["ɪ", "ɹ"], "ʊɹ": ["ʊ", "ɹ"], "ɔɹ": ["ɔ", "ɹ"], "ɑɹ": ["ɑ", "ɹ"],
"aɪɚ": ["aɪ", "ɚ"], "aɪə": ["aɪ", "ə"], "aʊɚ": ["aʊ", "ɚ"], "ɚɹ": ["ɚ"],
"ai": ["aɪ"], "ei": ["eɪ"], "au": ["aʊ"], "ɑu": ["aʊ"], "oi": ["ɔɪ"], "oɪ": ["ɔɪ"],
"ou": ["oʊ"], "iə": ["ɪ", "ə"], "uə": ["u", "ə"],
"ər": ["ɚ"], "əɹ": ["ɚ"],
"əl": ["ə", "l"], "l̩": ["ə", "l"], "n̩": ["ə", "n"], "m̩": ["ə", "m"],
"ən": ["ə", "n"], "əm": ["ə", "m"],
"ɾ": ["t"],
"ʔ": [],
"r": ["ɹ"], "ʁ": ["ɹ"], "ʀ": ["ɹ"], "ɻ": ["ɹ"], "ɽ": ["ɹ"],
"x": ["k"], "ç": ["h"], "χ": ["k"], "ɣ": ["ɡ"], "β": ["b"], "ɸ": ["f"],
"ʎ": ["l"], "ɲ": ["n", "j"], "ɳ": ["n"], "ɭ": ["l"], "ʋ": ["v"], "ɬ": ["l"],
"ʝ": ["j"], "ɕ": ["ʃ"], "ʑ": ["ʒ"], "tɕ": ["tʃ"], "dʑ": ["dʒ"], "ts": ["t", "s"],
"c": ["k"], "ɟ": ["ɡ"], "ʈ": ["t"], "ɖ": ["d"], "q": ["k"], "ɫ": ["l"],
"ʲ": [], "ˤ": [],
}
_STRIP = {"ː", "ˑ", "ʲ", "ʷ", "ˤ", "ˈ", "ˌ", "̩", "̃", "̪", "̝", "̞", "̥", "̊", "."}
_DIGITS = set("0123456789")
def _strip_marks(tok: str) -> str:
s = "".join(ch for ch in tok if ch not in _STRIP and ch not in _DIGITS)
s = "".join(ch for ch in unicodedata.normalize("NFD", s)
if not unicodedata.combining(ch))
return s
def map_token(tok: str, unmapped: Counter | None = None) -> list[str]:
"""Project one eSpeak token onto the PhonoLex inventory (possibly 0 or >1)."""
target = target_inventory()
if tok in target:
return [tok]
if tok in ESPEAK_MAP:
return ESPEAK_MAP[tok]
s = _strip_marks(tok)
if s in target:
return [s]
if s in ESPEAK_MAP:
return ESPEAK_MAP[s]
if len(s) == 1 and s in target:
return [s]
if unmapped is not None:
unmapped[tok] += 1
return []
def map_sequence(tokens: list[str], unmapped: Counter | None = None) -> list[str]:
out: list[str] = []
for t in tokens:
out.extend(map_token(t, unmapped))
return out
- [ ] Step 4: Run the test
Run: uv run python -m pytest packages/audio/tests/test_mapping.py -v
Expected: PASS (all 7 tests)
- [ ] Step 5: Commit
git add packages/audio/src/phonolex_audio/mapping.py packages/audio/tests/test_mapping.py
git commit -m "feat(audio): promote eSpeak->PhonoLex mapping into phonolex_audio (PHON-128)"
Task A2: Transcriber core — CTC decode + per-phoneme confidence + mapping¶
Wraps the wav2vec2-espeak CTC model. Decodes audio bytes → broad-phoneme contract dict. Confidence is the max-softmax of the CTC frame that emitted each token, carried through the mapping (a token that splits into two outputs replicates its confidence; a dropped token contributes nothing).
Files:
- Create: packages/audio/src/phonolex_audio/transcribe.py
- Create: packages/audio/tests/test_transcribe.py
- [ ] Step 1: Write the failing tests
packages/audio/tests/test_transcribe.py:
import math
import numpy as np
import pytest
from phonolex_audio.transcribe import _decode_logits_to_phonemes
def test_decode_aligns_phonemes_and_confidences():
# 3 vocab tokens: 0=<pad>, 1="k", 2="æ". Two clear frames -> "k","æ".
id2tok = {0: "<pad>", 1: "k", 2: "æ"}
# logits: frame0 -> token 1, frame1 -> token 2 (high confidence)
logits = np.array([[0.0, 10.0, 0.0], [0.0, 0.0, 10.0]], dtype=np.float32)
phonemes, confidences = _decode_logits_to_phonemes(logits, id2tok)
assert phonemes == ["k", "æ"]
assert len(confidences) == len(phonemes)
assert all(0.0 <= c <= 1.0 for c in confidences)
assert confidences[0] > 0.99 # softmax of a 10-vs-0 logit
def test_ctc_collapses_repeats_and_drops_pad():
id2tok = {0: "<pad>", 1: "k"}
# k, k (repeat -> collapse), pad, k (new run)
logits = np.array(
[[0.0, 9.0], [0.0, 9.0], [9.0, 0.0], [0.0, 9.0]], dtype=np.float32
)
phonemes, confidences = _decode_logits_to_phonemes(logits, id2tok)
assert phonemes == ["k", "k"]
assert len(confidences) == 2
@pytest.mark.slow
def test_transcribe_wav_smoke():
# Local-only: downloads the model. Skipped in CI (no network / heavy).
pytest.importorskip("torch")
from phonolex_audio.transcribe import Transcriber
# 0.5s of silence at 16k -> model returns *something* with aligned arrays.
wav = (np.zeros(8000, dtype=np.float32))
import io, soundfile as sf
buf = io.BytesIO()
sf.write(buf, wav, 16000, format="WAV")
t = Transcriber() # default off-the-shelf checkpoint
result = t.transcribe(buf.getvalue())
assert set(result) == {"phonemes", "confidences", "duration_ms", "coverage", "limitations"}
assert len(result["phonemes"]) == len(result["confidences"])
assert result["coverage"] == "broad-phoneme"
assert isinstance(result["limitations"], list) and result["limitations"]
assert result["duration_ms"] == 500
- [ ] Step 2: Run to verify failure
Run: uv run python -m pytest packages/audio/tests/test_transcribe.py -v -m "not slow"
Expected: FAIL — ImportError: cannot import name '_decode_logits_to_phonemes'
- [ ] Step 3: Implement
transcribe.py
packages/audio/src/phonolex_audio/transcribe.py:
"""Audio bytes -> broad-phoneme transcript contract dict.
CTC decode of wav2vec2-espeak, per-phoneme confidence from frame softmax,
projected onto the PhonoLex inventory via mapping.map_sequence.
"""
from __future__ import annotations
import io
import json
from collections import Counter
import librosa
import numpy as np
from .mapping import map_token
MODEL_ID = "facebook/wav2vec2-lv-60-espeak-cv-ft"
TARGET_SR = 16000
SPECIAL = {"<s>", "</s>", "<unk>", "<pad>", "|", " ", ""}
COVERAGE = "broad-phoneme"
LIMITATIONS = [
"Broad-phoneme transcription only; distortions and covert contrast are not represented.",
"Not validated on disordered speech.",
]
def _softmax_max(row: np.ndarray) -> float:
m = float(row.max())
e = np.exp(row - m)
return float(e.max() / e.sum())
def _decode_logits_to_phonemes(
logits: np.ndarray, id2tok: dict[int, str]
) -> tuple[list[str], list[float]]:
"""CTC greedy decode -> (phonemes, confidences), projected + aligned.
For each emitted (non-special, repeat-collapsed) frame we take the raw
eSpeak token and its frame softmax-max confidence, then project through the
mapping. A token that maps to N outputs replicates its confidence N times;
a dropped token contributes nothing.
"""
ids = logits.argmax(axis=-1).tolist()
phonemes: list[str] = []
confidences: list[float] = []
unmapped: Counter[str] = Counter()
prev = None
for frame_idx, tok_id in enumerate(ids):
if tok_id == prev:
continue
prev = tok_id
tok = id2tok.get(tok_id, "<unk>")
if tok in SPECIAL:
continue
conf = _softmax_max(logits[frame_idx])
mapped = map_token(tok, unmapped)
for ph in mapped:
phonemes.append(ph)
confidences.append(round(conf, 4))
return phonemes, confidences
class Transcriber:
"""Lazily-loaded wav2vec2-espeak CTC transcriber."""
def __init__(self, checkpoint: str = MODEL_ID, device: str | None = None):
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCTC, Wav2Vec2FeatureExtractor
if device is None:
device = "mps" if torch.backends.mps.is_available() else "cpu"
self.device = device
self.fe = Wav2Vec2FeatureExtractor.from_pretrained(checkpoint)
self.model = AutoModelForCTC.from_pretrained(checkpoint).to(device)
self.model.eval()
tok2id = json.loads(open(hf_hub_download(checkpoint, "vocab.json")).read())
self.id2tok = {int(i): t for t, i in tok2id.items()}
def transcribe(self, audio_bytes: bytes, language: str | None = None) -> dict:
import torch
arr, _ = librosa.load(io.BytesIO(audio_bytes), sr=TARGET_SR, mono=True)
duration_ms = round(len(arr) / TARGET_SR * 1000)
inputs = self.fe(arr, sampling_rate=TARGET_SR, return_tensors="pt")
with torch.no_grad():
logits = self.model(inputs.input_values.to(self.device)).logits
logits_np = logits[0].cpu().numpy()
phonemes, confidences = _decode_logits_to_phonemes(logits_np, self.id2tok)
return {
"phonemes": phonemes,
"confidences": confidences,
"duration_ms": duration_ms,
"coverage": COVERAGE,
"limitations": LIMITATIONS,
}
- [ ] Step 4: Run the fast tests
Run: uv run python -m pytest packages/audio/tests/test_transcribe.py -v -m "not slow"
Expected: PASS (2 decode tests). The slow smoke test is run manually once locally.
- [ ] Step 5: Register the
slowmarker to avoid pytest warnings.
Append to packages/audio/pyproject.toml:
[tool.pytest.ini_options]
testpaths = ["tests"]
markers = ["slow: requires model download / heavy compute (local-only)"]
[tool.pytest.ini_options] block created in Task 0.)
- [ ] Step 6: Commit
git add packages/audio/src/phonolex_audio/transcribe.py packages/audio/tests/test_transcribe.py packages/audio/pyproject.toml
git commit -m "feat(audio): CTC decode + per-phoneme confidence + mapping projection (PHON-128)"
Task A3: FastAPI server — POST /transcribe, GET /health¶
Files:
- Create: packages/audio/src/phonolex_audio/server.py
- Create: packages/audio/tests/test_server.py
- [ ] Step 1: Write the failing tests
packages/audio/tests/test_server.py:
import io
from fastapi.testclient import TestClient
from phonolex_audio.server import build_app
class FakeTranscriber:
def transcribe(self, audio_bytes: bytes, language=None) -> dict:
return {
"phonemes": ["k", "æ", "t"],
"confidences": [0.98, 0.91, 0.95],
"duration_ms": 1230,
"coverage": "broad-phoneme",
"limitations": ["x"],
}
def client() -> TestClient:
return TestClient(build_app(transcriber=FakeTranscriber()))
def test_health_ok():
r = client().get("/health")
assert r.status_code == 200
assert r.json()["status"] == "ok"
def test_transcribe_returns_contract():
files = {"audio": ("clip.wav", io.BytesIO(b"RIFFfake"), "audio/wav")}
r = client().post("/transcribe", files=files)
assert r.status_code == 200
body = r.json()
assert body["phonemes"] == ["k", "æ", "t"]
assert len(body["phonemes"]) == len(body["confidences"])
assert body["coverage"] == "broad-phoneme"
def test_transcribe_rejects_missing_audio():
r = client().post("/transcribe")
assert r.status_code == 422 # FastAPI validation: required field missing
- [ ] Step 2: Run to verify failure
Run: uv run python -m pytest packages/audio/tests/test_server.py -v
Expected: FAIL — ImportError: cannot import name 'build_app'
- [ ] Step 3: Implement
server.py
packages/audio/src/phonolex_audio/server.py:
"""FastAPI inference host for v6 Model #1.
build_app(transcriber=...) lets tests inject a fake. The production entry
(__main__) builds a real Transcriber at the requested --checkpoint.
"""
from __future__ import annotations
from fastapi import FastAPI, File, Form, UploadFile
from fastapi.middleware.cors import CORSMiddleware
def build_app(transcriber) -> FastAPI:
app = FastAPI(title="PhonoLex Audio Inference", version="0.1.0")
# The Worker proxies to this host server-side; CORS is permissive because it
# is never browser-facing directly (localhost in dev, RunPod private later).
app.add_middleware(
CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"]
)
@app.get("/health")
def health() -> dict:
return {"status": "ok"}
@app.post("/transcribe")
async def transcribe(
audio: UploadFile = File(...), language: str | None = Form(default=None)
) -> dict:
raw = await audio.read()
return transcriber.transcribe(raw, language=language)
return app
- [ ] Step 4: Run the tests
Run: uv run python -m pytest packages/audio/tests/test_server.py -v
Expected: PASS (3 tests)
- [ ] Step 5: Commit
git add packages/audio/src/phonolex_audio/server.py packages/audio/tests/test_server.py
git commit -m "feat(audio): FastAPI /transcribe + /health (PHON-128)"
Task A4: CLI entry with --checkpoint (the research-harness seam)¶
Files:
- Create: packages/audio/src/phonolex_audio/__main__.py
- Create: packages/audio/README.md
- [ ] Step 1: Implement
__main__.py
packages/audio/src/phonolex_audio/__main__.py:
"""Run the inference host: uv run python -m phonolex_audio --checkpoint <path>
--checkpoint swaps off-the-shelf wav2vec2-espeak for a PHON-139 fine-tuned
checkpoint without touching the Worker or the viewer — the research-harness seam.
"""
from __future__ import annotations
import argparse
import uvicorn
from .server import build_app
from .transcribe import MODEL_ID, Transcriber
def main() -> None:
ap = argparse.ArgumentParser(prog="phonolex_audio")
ap.add_argument("--checkpoint", default=MODEL_ID,
help="HF id or local path; default = off-the-shelf wav2vec2-espeak")
ap.add_argument("--host", default="127.0.0.1")
ap.add_argument("--port", type=int, default=8000)
args = ap.parse_args()
print(f"Loading checkpoint: {args.checkpoint}")
app = build_app(transcriber=Transcriber(checkpoint=args.checkpoint))
uvicorn.run(app, host=args.host, port=args.port)
if __name__ == "__main__":
main()
- [ ] Step 2: Verify the CLI parses (no model load)
Run: uv run python -m phonolex_audio --help
Expected: usage text listing --checkpoint, --host, --port.
- [ ] Step 3: Write
README.md
packages/audio/README.md:
# phonolex_audio — v6 Model #1 inference host
Audio → broad-phoneme transcript. FastAPI host that the Worker proxies to.
## Run locally
```bash
uv run python -m phonolex_audio --checkpoint facebook/wav2vec2-lv-60-espeak-cv-ft --port 8000
ffmpeg must be on PATH to decode mic recordings (webm/opus); WAV/MP3/FLAC need
only soundfile.
Research harness (PHON-139)¶
Point --checkpoint at a fine-tuned checkpoint to see its transcript in the same
/dev/audio viewer — off-the-shelf vs FT, A/B, no other change.
API¶
POST /transcribe (multipart: audio, optional language) →
{phonemes[], confidences[], duration_ms, coverage, limitations[]}.
GET /health → {status: "ok"}.
- [ ] **Step 4: Commit**
```bash
git add packages/audio/src/phonolex_audio/__main__.py packages/audio/README.md
git commit -m "feat(audio): CLI entry with --checkpoint + README (PHON-128)"
Phase B — Worker proxy route¶
Task B1: Add AUDIO_INFERENCE_URL to the Worker env¶
Files:
- Modify: packages/web/workers/src/types.ts:9-11
- Modify: packages/web/workers/wrangler.toml:33-34 (and :61-62 staging)
- Create: packages/web/workers/.dev.vars
- [ ] Step 1: Extend
Env
packages/web/workers/src/types.ts — replace the Env interface:
export interface Env {
DB: D1Database;
/** Base URL of the audio inference host. localhost in dev, RunPod later. */
AUDIO_INFERENCE_URL?: string;
}
- [ ] Step 2: Add the dev-default var + local override
packages/web/workers/wrangler.toml — under the existing [vars] line (33):
[vars]
AUDIO_INFERENCE_URL = "http://127.0.0.1:8000"
[env.staging.vars] empty (staging/prod get the RunPod URL at ship time — out of scope here).
Create packages/web/workers/.dev.vars:
AUDIO_INFERENCE_URL="http://127.0.0.1:8000"
- [ ] Step 3: Verify
.dev.varsis gitignored or intentionally tracked
Run: cd packages/web/workers && git check-ignore .dev.vars; echo "exit=$?"
If exit=0 (ignored), that's fine — it's a local-only default. If not ignored, add .dev.vars to packages/web/workers/.gitignore. Do NOT commit secrets; this file holds only a localhost URL, but keep it untracked for hygiene.
- [ ] Step 4: Commit (types + wrangler only)
git add packages/web/workers/src/types.ts packages/web/workers/wrangler.toml
git commit -m "feat(audio): AUDIO_INFERENCE_URL env binding (PHON-128)"
Task B2: /api/audio/transcribe proxy route¶
Thin proxy: validate the multipart audio part, forward it to AUDIO_INFERENCE_URL, pass the contract through, translate cold-start/errors.
Files:
- Create: packages/web/workers/src/routes/audio.ts
- Modify: packages/web/workers/src/index.ts:20,105 (import + mount)
- Create: packages/web/workers/src/__tests__/audio.test.ts
- [ ] Step 1: Write the failing tests
packages/web/workers/src/__tests__/audio.test.ts:
import { describe, it, expect, beforeAll, afterEach } from 'vitest';
import { SELF, fetchMock } from 'cloudflare:test';
beforeAll(() => {
fetchMock.activate();
fetchMock.disableNetConnect();
});
afterEach(() => fetchMock.assertNoPendingInterceptors());
function audioForm(): FormData {
const fd = new FormData();
fd.append('audio', new Blob([new Uint8Array([1, 2, 3])], { type: 'audio/wav' }), 'clip.wav');
return fd;
}
describe('POST /api/audio/transcribe', () => {
it('returns 400 when no audio part is present', async () => {
const res = await SELF.fetch('http://localhost/api/audio/transcribe', {
method: 'POST',
body: new FormData(),
});
expect(res.status).toBe(400);
const body = await res.json() as Record<string, unknown>;
expect(body).toHaveProperty('detail');
});
it('returns 400 when the uploaded part is not audio/*', async () => {
const fd = new FormData();
fd.append('audio', new Blob(['hello'], { type: 'text/plain' }), 'note.txt');
const res = await SELF.fetch('http://localhost/api/audio/transcribe', {
method: 'POST',
body: fd,
});
expect(res.status).toBe(400);
});
it('proxies the contract through on success', async () => {
fetchMock
.get('http://127.0.0.1:8000')
.intercept({ path: '/transcribe', method: 'POST' })
.reply(200, {
phonemes: ['k', 'æ', 't'],
confidences: [0.98, 0.91, 0.95],
duration_ms: 1230,
coverage: 'broad-phoneme',
limitations: ['x'],
});
const res = await SELF.fetch('http://localhost/api/audio/transcribe', {
method: 'POST',
body: audioForm(),
});
expect(res.status).toBe(200);
const body = await res.json() as { phonemes: string[]; confidences: number[] };
expect(body.phonemes).toEqual(['k', 'æ', 't']);
expect(body.confidences.length).toBe(body.phonemes.length);
});
it('returns a warming-up state when the host is unavailable (503)', async () => {
fetchMock
.get('http://127.0.0.1:8000')
.intercept({ path: '/transcribe', method: 'POST' })
.reply(503, 'cold');
const res = await SELF.fetch('http://localhost/api/audio/transcribe', {
method: 'POST',
body: audioForm(),
});
expect(res.status).toBe(503);
const body = await res.json() as { warming: boolean; detail: string };
expect(body.warming).toBe(true);
});
});
- [ ] Step 2: Run to verify failure
Run: cd packages/web/workers && npm test -- audio.test.ts
Expected: FAIL — route not mounted (404), assertions fail.
- [ ] Step 3: Implement
routes/audio.ts
packages/web/workers/src/routes/audio.ts:
/**
* Audio transcription proxy — v6 Model #1.
*
* Thin proxy: validate the multipart `audio` part, forward to the inference
* host (AUDIO_INFERENCE_URL), pass the broad-phoneme contract straight through.
* Cold-start aware: a 503 / network failure from the host becomes a structured
* { warming: true } so the viewer can show a warm-up state instead of an error.
*/
import { Hono } from 'hono';
import type { Env } from '../types';
const audio = new Hono<{ Bindings: Env }>();
const MAX_BYTES = 10 * 1024 * 1024; // 10 MB upload cap
audio.post('/transcribe', async (c) => {
const form = await c.req.formData().catch(() => null);
const file = form?.get('audio');
if (!form || !(file instanceof File)) {
return c.json({ detail: 'Missing required multipart field: audio' }, 400);
}
if (file.size > MAX_BYTES) {
return c.json({ detail: `Audio exceeds ${MAX_BYTES} byte limit` }, 400);
}
// Reject obviously non-audio uploads. File.type can be empty for some blobs;
// only reject when a type IS present and is not audio/*.
if (file.type && !file.type.startsWith('audio/')) {
return c.json({ detail: `Unsupported content type: ${file.type}` }, 400);
}
const base = c.env.AUDIO_INFERENCE_URL;
if (!base) {
return c.json({ detail: 'Audio inference host not configured' }, 503);
}
// Re-pack into a fresh multipart body to forward.
const fwd = new FormData();
fwd.append('audio', file, file.name || 'clip');
const language = form.get('language');
if (typeof language === 'string') fwd.append('language', language);
let upstream: Response;
try {
upstream = await fetch(`${base}/transcribe`, { method: 'POST', body: fwd });
} catch {
return c.json({ warming: true, detail: 'Inference host is warming up. Retry shortly.' }, 503);
}
if (upstream.status === 503) {
return c.json({ warming: true, detail: 'Inference host is warming up. Retry shortly.' }, 503);
}
if (!upstream.ok) {
return c.json({ detail: `Inference host error (${upstream.status})` }, 502);
}
const body = await upstream.json();
return c.json(body);
});
export default audio;
- [ ] Step 4: Mount the route
packages/web/workers/src/index.ts — add the import after line 20:
import audio from './routes/audio';
app.route('/api/sentences', sentences);):
app.route('/api/audio', audio);
- [ ] Step 5: Run the tests
Run: cd packages/web/workers && npm test -- audio.test.ts
Expected: PASS (3 tests)
- [ ] Step 6: Run the full Worker suite (no regression)
Run: cd packages/web/workers && npm test
Expected: all existing tests still PASS.
- [ ] Step 7: Commit
git add packages/web/workers/src/routes/audio.ts packages/web/workers/src/index.ts packages/web/workers/src/__tests__/audio.test.ts
git commit -m "feat(audio): /api/audio/transcribe proxy route + tests (PHON-128)"
Phase C — Dev-gated viewer¶
Task C1: audioApi.transcribeAudio (FormData fetch)¶
The existing apiClient always sets Content-Type: application/json, which breaks multipart uploads (the browser must set the boundary). A dedicated small service uses raw fetch with FormData.
Files:
- Create: packages/web/frontend/src/services/audioApi.ts
- [ ] Step 1: Implement
audioApi.ts
packages/web/frontend/src/services/audioApi.ts:
/**
* Audio transcription client (v6 Model #1).
*
* Multipart upload — NOT the JSON apiClient (which forces application/json and
* would break the multipart boundary). Returns the broad-phoneme contract.
*/
import { freshRequestId } from '../lib/logger';
export interface TranscriptResult {
phonemes: string[];
confidences: number[];
duration_ms: number;
coverage: string;
limitations: string[];
}
export class TranscriberWarmingError extends Error {
constructor(message: string) {
super(message);
this.name = 'TranscriberWarmingError';
}
}
const baseUrl = import.meta.env.VITE_API_URL || '';
export async function transcribeAudio(blob: Blob, language?: string): Promise<TranscriptResult> {
const fd = new FormData();
fd.append('audio', blob, 'recording');
if (language) fd.append('language', language);
const res = await fetch(`${baseUrl}/api/audio/transcribe`, {
method: 'POST',
headers: { 'X-Request-ID': freshRequestId() },
body: fd,
});
if (res.status === 503) {
const body = await res.json().catch(() => ({ detail: 'Warming up' })) as { detail?: string };
throw new TranscriberWarmingError(body.detail || 'Inference host is warming up.');
}
if (!res.ok) {
const detail = await res.text().catch(() => res.statusText);
throw new Error(`Transcription failed (${res.status}): ${detail}`);
}
return res.json();
}
- [ ] Step 2: Type-check
Run: cd packages/web/frontend && npx tsc --noEmit
Expected: no errors.
- [ ] Step 3: Commit
git add packages/web/frontend/src/services/audioApi.ts
git commit -m "feat(audio): transcribeAudio multipart client (PHON-128)"
Task C2: AudioTranscribeViewer component¶
Files:
- Create: packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx
- Create: packages/web/frontend/src/components/tools/AudioTranscribeViewer.test.tsx
- [ ] Step 1: Write the failing test
packages/web/frontend/src/components/tools/AudioTranscribeViewer.test.tsx:
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { render, screen, fireEvent, waitFor } from '@testing-library/react';
import AudioTranscribeViewer from './AudioTranscribeViewer';
import * as audioApi from '../../services/audioApi';
vi.mock('../../services/audioApi', async (orig) => {
const actual = await orig() as typeof audioApi;
return { ...actual, transcribeAudio: vi.fn() };
});
const mockedTranscribe = audioApi.transcribeAudio as unknown as ReturnType<typeof vi.fn>;
function selectFile() {
const input = screen.getByTestId('audio-file-input') as HTMLInputElement;
const file = new File([new Uint8Array([1, 2, 3])], 'clip.wav', { type: 'audio/wav' });
fireEvent.change(input, { target: { files: [file] } });
}
describe('AudioTranscribeViewer', () => {
beforeEach(() => mockedTranscribe.mockReset());
it('renders the transcript and the coverage caveats on success', async () => {
mockedTranscribe.mockResolvedValue({
phonemes: ['k', 'æ', 't'],
confidences: [0.98, 0.5, 0.95],
duration_ms: 1230,
coverage: 'broad-phoneme',
limitations: ['Broad-phoneme transcription only; distortions and covert contrast are not represented.'],
});
render(<AudioTranscribeViewer />);
selectFile();
fireEvent.click(screen.getByRole('button', { name: /transcribe/i }));
await waitFor(() => expect(screen.getByTestId('transcript')).toBeInTheDocument());
expect(screen.getByTestId('transcript').textContent).toContain('k');
expect(screen.getByTestId('transcript').textContent).toContain('æ');
// caveats always present
expect(screen.getByText(/Broad-phoneme transcription only/)).toBeInTheDocument();
});
it('shows a warming-up notice when the host is cold', async () => {
mockedTranscribe.mockRejectedValue(new audioApi.TranscriberWarmingError('warming'));
render(<AudioTranscribeViewer />);
selectFile();
fireEvent.click(screen.getByRole('button', { name: /transcribe/i }));
await waitFor(() => expect(screen.getByTestId('warming-notice')).toBeInTheDocument());
});
});
- [ ] Step 2: Run to verify failure
Run: cd packages/web/frontend && npx vitest run src/components/tools/AudioTranscribeViewer.test.tsx
Expected: FAIL — component file does not exist.
- [ ] Step 3: Implement the component
packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx:
/**
* v6 Model #1 — dev-gated audio transcription viewer.
*
* Audio in (mic record + file upload) → broad-phoneme IPA transcript with
* per-phoneme confidence (opacity) and unavoidable coverage caveats. Not in
* main nav; reachable only at /dev/audio.
*/
import { useRef, useState } from 'react';
import { Box, Button, Stack, Typography, Alert, Paper } from '@mui/material';
import { transcribeAudio, TranscriberWarmingError, type TranscriptResult } from '../../services/audioApi';
type State =
| { kind: 'idle' }
| { kind: 'loading' }
| { kind: 'warming' }
| { kind: 'error'; message: string }
| { kind: 'done'; result: TranscriptResult };
export default function AudioTranscribeViewer() {
const [blob, setBlob] = useState<Blob | null>(null);
const [state, setState] = useState<State>({ kind: 'idle' });
const mediaRecorder = useRef<MediaRecorder | null>(null);
const chunks = useRef<Blob[]>([]);
const [recording, setRecording] = useState(false);
async function startRecording() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const mr = new MediaRecorder(stream);
chunks.current = [];
mr.ondataavailable = (e) => chunks.current.push(e.data);
mr.onstop = () => {
setBlob(new Blob(chunks.current, { type: mr.mimeType }));
stream.getTracks().forEach((t) => t.stop());
};
mediaRecorder.current = mr;
mr.start();
setRecording(true);
}
function stopRecording() {
mediaRecorder.current?.stop();
setRecording(false);
}
function onFile(e: React.ChangeEvent<HTMLInputElement>) {
const f = e.target.files?.[0];
if (f) setBlob(f);
}
async function run() {
if (!blob) return;
setState({ kind: 'loading' });
try {
const result = await transcribeAudio(blob);
setState({ kind: 'done', result });
} catch (err) {
if (err instanceof TranscriberWarmingError) setState({ kind: 'warming' });
else setState({ kind: 'error', message: (err as Error).message });
}
}
return (
<Box sx={{ p: 3, maxWidth: 720, mx: 'auto' }}>
<Typography variant="h5" gutterBottom>Audio → Phoneme Transcript (dev)</Typography>
<Stack direction="row" spacing={2} alignItems="center" sx={{ mb: 2 }}>
{recording
? <Button variant="outlined" color="error" onClick={stopRecording}>Stop</Button>
: <Button variant="outlined" onClick={startRecording}>Record</Button>}
<Button component="label" variant="outlined">
Upload
<input data-testid="audio-file-input" hidden type="file" accept="audio/*" onChange={onFile} />
</Button>
<Button variant="contained" disabled={!blob || state.kind === 'loading'} onClick={run}>
Transcribe
</Button>
</Stack>
{blob && <Typography variant="caption" color="text.secondary">Audio ready ({Math.round(blob.size / 1024)} KB)</Typography>}
{state.kind === 'warming' && (
<Alert data-testid="warming-notice" severity="info" sx={{ mt: 2 }}>
The transcription model is warming up (first request can take ~60s). Retry shortly.
</Alert>
)}
{state.kind === 'error' && (
<Alert severity="error" sx={{ mt: 2 }}>{state.message}</Alert>
)}
{state.kind === 'done' && (
<Paper variant="outlined" sx={{ mt: 2, p: 2 }}>
<Typography variant="overline" color="text.secondary">Transcript</Typography>
<Box data-testid="transcript" sx={{ fontSize: 28, lineHeight: 1.8, fontFamily: 'serif' }}>
{state.result.phonemes.map((p, i) => (
<Box key={i} component="span" sx={{ mx: 0.5, opacity: 0.35 + 0.65 * state.result.confidences[i] }}>
{p}
</Box>
))}
</Box>
<Typography variant="caption" color="text.secondary">
{state.result.duration_ms} ms · coverage: {state.result.coverage}
</Typography>
<Box sx={{ mt: 1.5 }}>
{state.result.limitations.map((lim, i) => (
<Typography key={i} variant="caption" display="block" color="text.secondary">⚠ {lim}</Typography>
))}
</Box>
</Paper>
)}
</Box>
);
}
- [ ] Step 4: Run the test
Run: cd packages/web/frontend && npx vitest run src/components/tools/AudioTranscribeViewer.test.tsx
Expected: PASS (2 tests)
- [ ] Step 5: Commit
git add packages/web/frontend/src/components/tools/AudioTranscribeViewer.tsx packages/web/frontend/src/components/tools/AudioTranscribeViewer.test.tsx
git commit -m "feat(audio): AudioTranscribeViewer component + tests (PHON-128)"
Task C3: Register the unlinked /dev/audio route¶
Files:
- Modify: packages/web/frontend/src/main.tsx
- [ ] Step 1: Add the route
packages/web/frontend/src/main.tsx — add the import alongside the page imports:
import AudioTranscribeViewer from './components/tools/AudioTranscribeViewer.tsx';
<Routes> (after the /terms route). It is intentionally NOT linked from any nav — reachable only by typing the URL:
<Route path="/dev/audio" element={<AudioTranscribeViewer />} />
- [ ] Step 2: Type-check + build
Run: cd packages/web/frontend && npx tsc --noEmit && npm run build
Expected: build succeeds.
- [ ] Step 3: Commit
git add packages/web/frontend/src/main.tsx
git commit -m "feat(audio): mount unlinked /dev/audio viewer route (PHON-128)"
Phase D — End-to-end verification + harness demo¶
Task D1: Manual end-to-end run (and --checkpoint swap)¶
This is a manual verification task — no code, but it proves the rung and demonstrates the PHON-139 harness property.
- [ ] Step 1: Start the inference host
Run: uv run python -m phonolex_audio --port 8000
Expected: model downloads (first run), then Uvicorn running on http://127.0.0.1:8000.
Verify: curl http://127.0.0.1:8000/health → {"status":"ok"}.
- [ ] Step 2: Start the Worker
Run: cd packages/web/workers && npx wrangler dev
Expected: Worker on http://localhost:8787, AUDIO_INFERENCE_URL picked up from .dev.vars.
- [ ] Step 3: Smoke the endpoint with a real wav
Run: curl -s -F "audio=@<some.wav>" http://localhost:8787/api/audio/transcribe | jq
Expected: a JSON object with phonemes[], confidences[] (same length), coverage: "broad-phoneme", limitations[].
- [ ] Step 4: Start the frontend and open the viewer
Run: cd packages/web/frontend && npm run dev, then open http://localhost:5173/dev/audio.
Verify: upload a wav (or record) → Transcribe → the IPA transcript renders with confidence opacity and the caveats are visible. Confirm /dev/audio is reachable only by URL (not in nav).
- [ ] Step 5: Demonstrate the harness seam
Restart the host with a different checkpoint, e.g. uv run python -m phonolex_audio --checkpoint <local-ft-path> --port 8000, and re-transcribe the same clip in the viewer. Confirm the transcript/confidence change with no other edits — this is the PHON-139 A/B harness.
- [ ] Step 6: Run the full local CI-equivalent suite before any push
Run:
cd packages/web/workers && npm test && npx tsc --noEmit
cd ../frontend && npx vitest run && npx tsc --noEmit && npm run build
cd ../../.. && uv run python -m pytest packages/audio/tests/ -m "not slow"
slow transcribe smoke is verified once in Step 3/4; CI does not run it — no model in CI.)
- [ ] Step 7: Push the branch
git push -u origin feature/phon-128-audio-transcribe-viewer
Notes for the implementer¶
- CI awareness: CI has no model and no network for HF. All committed tests run without the model (
-m "not slow", mocked transcriber, mockedfetchMock/apiClient). Never add a committed test that downloads the checkpoint. - Contract is frozen: if you change the response shape, change it in the spec §6,
transcribe.py(COVERAGE/LIMITATIONS+ dict),audioApi.ts(TranscriptResult), and the Worker test fixture together — the three units only agree because the contract is identical. - Out of scope (do not add): threading the transcript into
Constraint[]/the five tools, a nav tab, RunPod stand-up, fine-tuning. Those are later rungs / PHON-139. - ffmpeg: mic recordings are webm/opus;
librosaneedsffmpegon PATH to decode them. WAV uploads decode viasoundfilewith no ffmpeg. Document this if a recording fails to decode locally. ```