Skip to content

Beta Audio Tab — Design

Date: 2026-06-14 Status: Design, approved in brainstorming; for spec review before planning. Branch: feature/phon-145-audio-beta-tab (off release/v6-audio). Realizes: PHON-145 (the user-facing Audio tab), reimagined around the trajectory serving harness (PHON-150) rather than the original 3-model-dropdown framing.

Overview

The first user-facing surface for the v6 trajectory audio model: an SLP records or uploads a production against a target word and gets back what was actually said (faithful transcript) and how it deviated from the target (per-position deviation overlay) — the hero — plus a secondary, session-level source-attribution read (typical / accent / developmental / motor) framed as bonus decision support.

The product value is the transcript + deviation, which work on a single clip. Attribution is honestly secondary: it is underdetermined on one short production and sharpens as more productions accumulate, so the UI is built around a session and tells the user plainly that more productions improve attribution accuracy.

This is decision support, clinician-in-the-loop — never autonomous diagnosis. It is a Beta surface, local-only (no hosted inference endpoint until the owner decides).

Goals / non-goals

Goals (v1): - A single-panel tool tab that runs the trajectory analyzer on word-level productions. - Faithful transcript aligned to the canonical target + per-position deviation overlay. - Session model: every analysis is a session; a single production is a session of 1. - Three ways to populate a session: record, upload, batch upload. - Target field is authoritative and corpus-verified (the lexicon must cover the word); the verified target auto-labels the production. - Secondary session-level attribution with an explicit "add more to sharpen" affordance.

Non-goals (v1) — captured in Future, not built: - Sentence-level targets. - Slicing a longer utterance into word-level productions. - Hosted/remote inference (stays local). - The PHON-151 app-wide vector reseed (separate, gated). - SLP scoring metrics (PCC/PPC; PHON-146, later).

Surface & navigation

A 6th tool in packages/web/frontend/src/App_new.tsx: a TOOL_DEFS entry (id: 'audio', title e.g. "Speech Analysis", a Beta badge), a component factory entry, and a new component components/tools/AudioAnalysisTool/. It joins Word Lists / Contrast Sets / Sentences / Text Analysis / Lookup. No router/dev-page involvement (the /dev/* pages were removed).

Session model

A session is an in-memory list of productions. A production is:

{
  id: string,            // derived from the verified target, hyphenated + disambiguated
  target: string,        // the typed, lexicon-verified word
  canonical: string[],   // phonemes for the target (from the lexicon)
  audio: Blob,           // recorded or uploaded clip
  result?: AnalyzeResult // populated after /api/audio/analyze returns
}
  • A single production is a session of 1 — no special-casing.
  • Per-production targets: each production carries its own target, so a session can be "cat" ×3 or a whole word-list probe; both just accumulate substitution evidence.
  • Session state lives in the component (no persistence in v1).

Production label (id): generated in situ from the verified target — lowercased, spaces→hyphens (future sentences: the-cat-sat), with a -N suffix to disambiguate repeats of the same target (cat-1, cat-2). No manual file naming.

Input — three entry modes

All three converge on the same (target, audio) production shape.

  1. Record — type a target (verified, below), record from the mic, add to the session.
  2. Upload — type a target (verified), choose a file, add.
  3. Batch upload — choose N files at once → N draft productions. Each filename seeds its target field (cat.wav → "cat") as a convenience only; the target field is still authoritative and coverage-checked, shown in an editable per-row list the SLP confirms before running. Rows with unverified/uncovered targets are flagged and not run until fixed.

Target verification (corpus coverage): as the SLP types (debounced) or per batch row, the frontend checks the target against the lexicon via the existing words API (/api/words/:word or batch lookup) — a word is "supported" iff it has phonology. The field shows live coverage state: supported (with the canonical phonemes previewed) / not in the dictionary. Only supported targets can be run.

Hero output — transcript + deviation (per production, always shown)

For each production, after analysis: - Faithful transcript — the model's produced phoneme sequence ("what we heard"), rendered narrow, aligned under the canonical target. - Per-position deviation overlay — the canonical phones as chips, heat-colored by Fisher deviation; hover → the deviation value + the nearest reference (what the position actually sounded like). Positions where nearest ≠ target are flagged as substitutions — the error signal (e.g. target /ɹ/, nearest /w/). Reuses the app's phoneme-chip rendering.

This requires a small serving addition: TrajectoryAnalyzer.analyze() must return the produced transcript (and echo the canonical) alongside positions/attribution, so the frontend can render the transcript-vs-target alignment. The analyzer already computes produced internally; today analyze() drops it.

Bonus output — attribution (session-level, secondary)

A secondary panel showing the typical / accent / developmental / motor read computed over the session's accumulated substitutions, with: - An explicit quantity/confidence indicator and an "add more productions to sharpen this" affordance — the UI states that more productions improve accuracy rather than pretending a 3-phone clip is conclusive. - The accent-vs-disorder separation surfaced as the "don't pathologize an accent" guardrail (a feature, not fine print). - "Patterns like…" language, never "has X".

Attribution aggregation across the session. This must mirror how the model was validated: the research computed raw per-clip features, mean-pooled them to the subject level, then standardized (with the baked mean/std) and classified by nearest centroid. The session read reproduces that exactly — a session is the analogue of a subject: 1. the host returns each production's raw 6-feature vector (pre-standardization) + its per-clip read; 2. the session aggregate = mean-pool the raw per-production vectors → standardize with the baked mean/std → nearest centroid. This keeps the per-production reads honest (noisy, as flagged) while the session read gains fidelity exactly as the validated subject-level numbers did. Whether step 2 runs in a small stateless host endpoint or is shipped into the worker (with the baked attribution_model.json) is a planning detail; the attribution.AttributionModel code already owns standardize + classify, and the raw-feature extraction already exists in analyzer._attribution_features (which currently mean-pools internally — it will be split to expose the raw per-production vector).

Architecture & data flow

Frontend AudioAnalysisTool (session state)
  │  target typed → coverage check ──────────────► /api/words/:word        (existing)
  │  (target, audio) per production ─────────────► /api/audio/analyze       (NEW worker route)
  │                                                   │ canonical lookup (lexicon → phonemes)
  │                                                   │ forward audio + canonical
  │                                                   ▼
  │                                                 host POST /analyze       (exists; +produced)
  │                                                   → {canonical, produced, positions, attribution, features}
  ▼
  render: per-production transcript+deviation (hero) + session attribution (bonus)
  • New worker route /api/audio/analyze (packages/web/workers/src/routes/audio.ts): multipart audio + target. The Worker resolves target → canonical phonemes from the lexicon (the same lookup /pronounce used), forwards audio + canonical (JSON array) to the host /analyze, and returns the host's response. Cold-start aware: a host 503/network failure becomes a { warming: true } shape so the tab shows a warm-up state (the existing audio.ts proxy already does this for /transcribe).
  • Host /analyze already serves; the only change is the analyzer returning produced (+ echoing canonical, and optionally each production's feature vector for session aggregation).
  • Port: the host runs on 127.0.0.1:8000 (matches wrangler.toml); the .dev.vars :8001 vs :8000 mismatch is reconciled to one value during implementation.

Component decomposition (frontend)

components/tools/AudioAnalysisTool/: - AudioAnalysisTool.tsx — owns session state, orchestrates. - TargetField.tsx — target input + debounced coverage check + canonical preview. - CaptureControls.tsx — record/upload (single). - BatchUpload.tsx — multi-file → editable draft-production rows (filename-seeded targets). - ProductionCard.tsx — one production's hero output (transcript + deviation overlay). - DeviationOverlay.tsx — the heat-colored canonical chips + hover detail (reuses phoneme chips). - AttributionPanel.tsx — the session-level bonus read + quantity/confidence + "add more". - audioAnalysisApi.ts — service for /api/audio/analyze + coverage lookup.

Each unit is independently testable; the panel composes them.

Error handling

  • Cold start: first request wakes the model (~tens of seconds); { warming: true } → a clear warm-up state with copy, retry. (See the cold-start-caution copy convention.)
  • Unsupported target: the target field blocks the run and explains ("not in our dictionary"); batch rows with bad targets are flagged, not run.
  • No audio / empty production: disabled run until a clip exists.
  • Host down (local not running): a clear "audio model not running locally" state — there is no deployed endpoint in v1.
  • No scorable positions (alignment failure on a clip): the production shows a "couldn't score" state without breaking the session.

Testing

  • Worker: vitest over /api/audio/analyze — canonical resolution from a target, forward shape, { warming: true } on host failure, bad/uncovered target → 4xx. Mock the host fetch (no real model in CI).
  • Frontend: component tests for TargetField coverage states, BatchUpload row editing, DeviationOverlay rendering from a fixture AnalyzeResult, AttributionPanel quantity messaging. No real audio/model — fixture results.
  • Serving: a unit test that analyze() returns produced + canonical (extends the existing analyzer tests; the slow smoke already exercises the real model).

Future (captured, not built)

  • Sentence targets — the target field + hyphenated labeling already anticipate this (the-cat-sat); needs UX for longer canonical sequences and the verify step over phrases.
  • Slicing — segment a connected utterance into word-level productions, the bridge that makes sentence input clinically useful. The session/production model is built to receive sliced productions without restructuring.
  • SLP scoring metrics (PCC/PPC/PVC, PHON-146) over the produced-vs-canonical alignment.
  • Hosted inference (RunPod staging/prod) when the owner is ready — the harness is already RunPod-buildable.

Out of scope

  • Hosted/remote deployment of the model; the tab is local-only in v1.
  • The PHON-151 app-wide phoneme-vector reseed (separate, gated on "happy").
  • Persisting sessions across reloads.