v6.0 — Audio Transcribe Viewer (Model #1, dev-gated)¶

Date: 2026-06-02 Ticket: PHON-128 (implementation) · parent PHON-44 (Audio workstream) Related: PHON-139 (transcriber fine-tune research, separate track) · docs/superpowers/specs/2026-05-30-v6-audio-support-design.md (the broader v6 architecture) Branch: feature/phon-128-audio-transcribe-viewer off develop

1. Goal¶

A dev-gated viewer that turns uploaded or recorded audio into a broad-phoneme IPA transcript and shows it — with per-phoneme confidence and honest coverage caveats — so we can see the transcriber working and iterate the model behind it.

This is the first shippable rung of the v6 audio pillar. It is built as the real product surface (real Worker route, real React component, real inference server), run locally now, productionized (RunPod) later via a single config swap. Same playbook as the PHON-134 nonce generator's local-demo-first shipping pattern.

1.1 Why viewer-first (non-goal for v1)¶

Threading the transcript into Constraint[] → the existing five tools ("audio as if typed") is deferred to a later productization rung. It carries its own UX decisions (which tools, entry point, how a transcript becomes a query) that deserve a dedicated design pass. v1's job is visibility + serving as the research harness for PHON-139, both fully met by the viewer alone.

1.2 Why dev-gated, not a nav tab¶

The eventual home is a sixth tab alongside the five tools, but that tab's shape is broader than this rung (it will eventually host more than Model #1's transcript). A dev-gated standalone page lets the transcriber earn its place without committing the nav surface prematurely.

2. Architecture¶

React AudioInput (mic record + file upload)
  → POST /api/audio/transcribe         Worker — thin proxy, env AUDIO_INFERENCE_URL
  → POST /transcribe                    phonolex_audio server (local now, RunPod later)
       · load wav2vec2-espeak @ --checkpoint   (off-the-shelf | PHON-139 FT)
       · CTC decode → eSpeak IPA + per-frame posteriors
       · lib_mapping: eSpeak → PhonoLex 39 broad phonemes (server-side)
       → { phonemes[], confidences[], duration_ms, coverage, limitations[] }
  ← Worker passes the JSON straight through
← Viewer renders: IPA transcript · per-phoneme confidence heat · coverage/limitations caveats

The localhost→RunPod transition is a single env var (AUDIO_INFERENCE_URL). Nothing else changes at ship time.

3. Unit 1 — Inference server (`packages/audio/`)¶

A new Python package phonolex_audio. Not a research throwaway — this same FastAPI handler becomes the RunPod serverless handler at ship time.

Model loading: --checkpoint <path> flag; defaults to the off-the-shelf HF id facebook/wav2vec2-lv-60-espeak-cv-ft. Pointing it at a PHON-139 fine-tuned checkpoint is how the viewer doubles as the FT research harness — swap the checkpoint, watch the transcript and confidences change in the same UI.
Device: MPS on the M4 locally; CUDA on RunPod.
Endpoint: POST /transcribe — multipart/form-data audio (+ optional language hint) → the JSON contract in §6.
Mapping: done server-side here. Promote lib_mapping.py (and its unit tests) out of research/2026-05-31-phon-128-audio-transcript/scripts/ into the package. Sourced from the same arpa_to_ipa.json single source of truth. Keeps the Worker a pure proxy and avoids a premature TS port.
Dependency isolation: torch is a dependency of this package only. The Worker never imports it. Audio decoding (e.g. soundfile/librosa) likewise lives only here.
Confidence: CTC posteriors per emitted phoneme, max-softmax over the frames that decode to each output token, projected through the same collapse the mapping applies.

4. Unit 2 — Worker route (`packages/web/workers/src/routes/audio.ts`)¶

POST /api/audio/transcribe.

multipart/form-data in (audio blob + optional language); proxies the body to AUDIO_INFERENCE_URL (env, set in wrangler.toml / dev .dev.vars).
Returns the inference contract verbatim — no transformation in the Worker for v1 (mapping already applied server-side).
Cold-start aware: RunPod scale-to-zero cold start is ~60s. On a slow/503 backend response the route returns a structured "warming up" state the component can render, not a raw gateway timeout. (Local dev: server is always warm, so this path is exercised via test, not normal use.)
Limits: reject oversized uploads and non-audio content types with a 400 + message before proxying.
Mounted in index.ts alongside the existing route group.

5. Unit 3 — Viewer component (`packages/web/frontend`)¶

AudioInput + a transcript panel, on a standalone dev-gated route (not in main nav, not threaded into the five tools).

Input: mic capture via MediaRecorder + file upload. One clear primary action ("Transcribe").
Service: apiClient.transcribeAudio(blob, language?) → /api/audio/transcribe.
Render:
The broad-phoneme IPA transcript.
Per-phoneme confidence as opacity/heat over each phoneme.
coverage ("broad-phoneme") and limitations[] shown inline and unavoidable — broad-phoneme only; distortions/covert contrast invisible by design; not validated on disordered speech. These caveats must be present before this surface is ever shown to a clinician.
A "warming up" state wired to the Worker's cold-start signal.
Dev gating: there is no accounts/auth system to gate behind yet, so v1 uses an unlinked route (established precedent in this frontend) — e.g. /dev/audio, reachable by direct URL only, absent from the nav and from any discoverable surface. No auth, no feature flag; just not linked.

6. API contract¶

POST /api/audio/transcribe (and the inference POST /transcribe it proxies):

Request: multipart/form-data — audio (file/blob), optional language (hint).

Response:

{
  "phonemes": ["k", "æ", "t"],
  "confidences": [0.98, 0.91, 0.95],
  "duration_ms": 1230,
  "coverage": "broad-phoneme",
  "limitations": [
    "Broad-phoneme transcription only; distortions and covert contrast are not represented.",
    "Not validated on disordered speech."
  ]
}

phonemes and confidences are positionally aligned and equal length. coverage and limitations[] are always present so the consuming UI cannot omit the caveats.

7. Error handling¶

Condition	Behavior
Backend warming (RunPod cold)	Worker returns a structured "warming up" state; component shows a warm-up notice
Oversized / non-audio upload	Worker `400` + message, before proxying
Inference failure	Worker `502` + safe message; component shows a retry affordance
OOV / hard audio	Still returns a transcript (broad phoneme), just lower confidences — never an error

8. Testing¶

Python (packages/audio/): pytest smoke — a known short wav → expected broad-phoneme string; promoted lib_mapping unit tests (0% unmapped on native English, the +6.2-PER-point projection).
Worker: vitest cloudflare:test + SELF.fetch, with a mocked inference backend — asserts the proxy passes the contract through, the 400/502/warm-up paths, and multipart handling.
Frontend: vitest component test with a mocked apiClient — asserts the transcript, confidence rendering, and the coverage/limitations caveats all render; asserts the warm-up state renders on the cold-start signal.

9. Out of scope (explicit)¶

Threading transcript → Constraint[] → the five tools (later rung).
A nav tab / production-discoverable surface (later rung).
RunPod host stand-up (ship-time config swap; the handler is built and run locally now).
Fine-tuning the model (PHON-139, separate track). v1 runs the off-the-shelf checkpoint; the --checkpoint flag is the seam for swapping in PHON-139 output.
Disordered-speech transcription, distortion classification, covert contrast (Models #3/#5).

10. Done when¶

packages/audio/ server transcribes a wav → broad-phoneme contract, behind a --checkpoint flag, with passing tests.
POST /api/audio/transcribe Worker route live (dev), proxying via AUDIO_INFERENCE_URL, with passing tests including cold-start + error paths.
Dev-gated viewer renders transcript + confidence + caveats from real local inference, with passing component tests.
The same viewer renders a PHON-139 checkpoint's output by swapping --checkpoint (the research-harness property), demonstrated end to end locally.