Beta Audio Tab — Design¶
Date: 2026-06-14
Status: Design, approved in brainstorming; for spec review before planning.
Branch: feature/phon-145-audio-beta-tab (off release/v6-audio).
Realizes: PHON-145 (the user-facing Audio tab), reimagined around the trajectory
serving harness (PHON-150) rather than the original 3-model-dropdown framing.
Overview¶
The first user-facing surface for the v6 trajectory audio model: an SLP records or uploads a production against a target word and gets back what was actually said (faithful transcript) and how it deviated from the target (per-position deviation overlay) — the hero — plus a secondary, session-level source-attribution read (typical / accent / developmental / motor) framed as bonus decision support.
The product value is the transcript + deviation, which work on a single clip. Attribution is honestly secondary: it is underdetermined on one short production and sharpens as more productions accumulate, so the UI is built around a session and tells the user plainly that more productions improve attribution accuracy.
This is decision support, clinician-in-the-loop — never autonomous diagnosis. It is a Beta surface, local-only (no hosted inference endpoint until the owner decides).
Goals / non-goals¶
Goals (v1): - A single-panel tool tab that runs the trajectory analyzer on word-level productions. - Faithful transcript aligned to the canonical target + per-position deviation overlay. - Session model: every analysis is a session; a single production is a session of 1. - Three ways to populate a session: record, upload, batch upload. - Target field is authoritative and corpus-verified (the lexicon must cover the word); the verified target auto-labels the production. - Secondary session-level attribution with an explicit "add more to sharpen" affordance.
Non-goals (v1) — captured in Future, not built: - Sentence-level targets. - Slicing a longer utterance into word-level productions. - Hosted/remote inference (stays local). - The PHON-151 app-wide vector reseed (separate, gated). - SLP scoring metrics (PCC/PPC; PHON-146, later).
Surface & navigation¶
A 6th tool in packages/web/frontend/src/App_new.tsx: a TOOL_DEFS entry
(id: 'audio', title e.g. "Speech Analysis", a Beta badge), a component factory entry,
and a new component components/tools/AudioAnalysisTool/. It joins Word Lists / Contrast
Sets / Sentences / Text Analysis / Lookup. No router/dev-page involvement (the /dev/*
pages were removed).
Session model¶
A session is an in-memory list of productions. A production is:
{
id: string, // derived from the verified target, hyphenated + disambiguated
target: string, // the typed, lexicon-verified word
canonical: string[], // phonemes for the target (from the lexicon)
audio: Blob, // recorded or uploaded clip
result?: AnalyzeResult // populated after /api/audio/analyze returns
}
- A single production is a session of 1 — no special-casing.
- Per-production targets: each production carries its own target, so a session can be "cat" ×3 or a whole word-list probe; both just accumulate substitution evidence.
- Session state lives in the component (no persistence in v1).
Production label (id): generated in situ from the verified target — lowercased,
spaces→hyphens (future sentences: the-cat-sat), with a -N suffix to disambiguate repeats
of the same target (cat-1, cat-2). No manual file naming.
Input — three entry modes¶
All three converge on the same (target, audio) production shape.
- Record — type a target (verified, below), record from the mic, add to the session.
- Upload — type a target (verified), choose a file, add.
- Batch upload — choose N files at once → N draft productions. Each filename seeds
its target field (
cat.wav→ "cat") as a convenience only; the target field is still authoritative and coverage-checked, shown in an editable per-row list the SLP confirms before running. Rows with unverified/uncovered targets are flagged and not run until fixed.
Target verification (corpus coverage): as the SLP types (debounced) or per batch row,
the frontend checks the target against the lexicon via the existing words API
(/api/words/:word or batch lookup) — a word is "supported" iff it has phonology. The field
shows live coverage state: supported (with the canonical phonemes previewed) / not in the
dictionary. Only supported targets can be run.
Hero output — transcript + deviation (per production, always shown)¶
For each production, after analysis:
- Faithful transcript — the model's produced phoneme sequence ("what we heard"),
rendered narrow, aligned under the canonical target.
- Per-position deviation overlay — the canonical phones as chips, heat-colored by Fisher
deviation; hover → the deviation value + the nearest reference (what the position
actually sounded like). Positions where nearest ≠ target are flagged as substitutions —
the error signal (e.g. target /ɹ/, nearest /w/). Reuses the app's phoneme-chip rendering.
This requires a small serving addition: TrajectoryAnalyzer.analyze() must return the
produced transcript (and echo the canonical) alongside positions/attribution, so the
frontend can render the transcript-vs-target alignment. The analyzer already computes
produced internally; today analyze() drops it.
Bonus output — attribution (session-level, secondary)¶
A secondary panel showing the typical / accent / developmental / motor read computed over the session's accumulated substitutions, with: - An explicit quantity/confidence indicator and an "add more productions to sharpen this" affordance — the UI states that more productions improve accuracy rather than pretending a 3-phone clip is conclusive. - The accent-vs-disorder separation surfaced as the "don't pathologize an accent" guardrail (a feature, not fine print). - "Patterns like…" language, never "has X".
Attribution aggregation across the session. This must mirror how the model was
validated: the research computed raw per-clip features, mean-pooled them to the
subject level, then standardized (with the baked mean/std) and classified by nearest
centroid. The session read reproduces that exactly — a session is the analogue of a subject:
1. the host returns each production's raw 6-feature vector (pre-standardization) + its
per-clip read;
2. the session aggregate = mean-pool the raw per-production vectors → standardize with
the baked mean/std → nearest centroid.
This keeps the per-production reads honest (noisy, as flagged) while the session read gains
fidelity exactly as the validated subject-level numbers did. Whether step 2 runs in a small
stateless host endpoint or is shipped into the worker (with the baked attribution_model.json)
is a planning detail; the attribution.AttributionModel code already owns standardize +
classify, and the raw-feature extraction already exists in analyzer._attribution_features
(which currently mean-pools internally — it will be split to expose the raw per-production
vector).
Architecture & data flow¶
Frontend AudioAnalysisTool (session state)
│ target typed → coverage check ──────────────► /api/words/:word (existing)
│ (target, audio) per production ─────────────► /api/audio/analyze (NEW worker route)
│ │ canonical lookup (lexicon → phonemes)
│ │ forward audio + canonical
│ ▼
│ host POST /analyze (exists; +produced)
│ → {canonical, produced, positions, attribution, features}
▼
render: per-production transcript+deviation (hero) + session attribution (bonus)
- New worker route
/api/audio/analyze(packages/web/workers/src/routes/audio.ts): multipartaudio+target. The Worker resolvestarget → canonical phonemesfrom the lexicon (the same lookup/pronounceused), forwardsaudio+canonical(JSON array) to the host/analyze, and returns the host's response. Cold-start aware: a host 503/network failure becomes a{ warming: true }shape so the tab shows a warm-up state (the existingaudio.tsproxy already does this for/transcribe). - Host
/analyzealready serves; the only change is the analyzer returningproduced(+ echoingcanonical, and optionally each production's feature vector for session aggregation). - Port: the host runs on
127.0.0.1:8000(matcheswrangler.toml); the.dev.vars:8001vs:8000mismatch is reconciled to one value during implementation.
Component decomposition (frontend)¶
components/tools/AudioAnalysisTool/:
- AudioAnalysisTool.tsx — owns session state, orchestrates.
- TargetField.tsx — target input + debounced coverage check + canonical preview.
- CaptureControls.tsx — record/upload (single).
- BatchUpload.tsx — multi-file → editable draft-production rows (filename-seeded targets).
- ProductionCard.tsx — one production's hero output (transcript + deviation overlay).
- DeviationOverlay.tsx — the heat-colored canonical chips + hover detail (reuses phoneme
chips).
- AttributionPanel.tsx — the session-level bonus read + quantity/confidence + "add more".
- audioAnalysisApi.ts — service for /api/audio/analyze + coverage lookup.
Each unit is independently testable; the panel composes them.
Error handling¶
- Cold start: first request wakes the model (~tens of seconds);
{ warming: true }→ a clear warm-up state with copy, retry. (See the cold-start-caution copy convention.) - Unsupported target: the target field blocks the run and explains ("not in our dictionary"); batch rows with bad targets are flagged, not run.
- No audio / empty production: disabled run until a clip exists.
- Host down (local not running): a clear "audio model not running locally" state — there is no deployed endpoint in v1.
- No scorable positions (alignment failure on a clip): the production shows a "couldn't score" state without breaking the session.
Testing¶
- Worker:
vitestover/api/audio/analyze— canonical resolution from a target, forward shape,{ warming: true }on host failure, bad/uncovered target → 4xx. Mock the host fetch (no real model in CI). - Frontend: component tests for
TargetFieldcoverage states,BatchUploadrow editing,DeviationOverlayrendering from a fixtureAnalyzeResult,AttributionPanelquantity messaging. No real audio/model — fixture results. - Serving: a unit test that
analyze()returnsproduced+canonical(extends the existing analyzer tests; the slow smoke already exercises the real model).
Future (captured, not built)¶
- Sentence targets — the target field + hyphenated labeling already anticipate this
(
the-cat-sat); needs UX for longer canonical sequences and the verify step over phrases. - Slicing — segment a connected utterance into word-level productions, the bridge that makes sentence input clinically useful. The session/production model is built to receive sliced productions without restructuring.
- SLP scoring metrics (PCC/PPC/PVC, PHON-146) over the produced-vs-canonical alignment.
- Hosted inference (RunPod staging/prod) when the owner is ready — the harness is already RunPod-buildable.
Out of scope¶
- Hosted/remote deployment of the model; the tab is local-only in v1.
- The PHON-151 app-wide phoneme-vector reseed (separate, gated on "happy").
- Persisting sessions across reloads.