v6 — Audio Support Across PhonoLex¶

Status: design approved, plan(s) pending Workstream parent: PHON-44 Audio (reframed scope per this spec) Date: 2026-05-30

This spec is the umbrella for v6. It defines the architecture (five targeted models), the platform integration shape (how audio flows through the existing five tools and what new surfaces it grows), the release plan (v6.0 → v6.4 as point releases under one banner), and the explicit out-of-scope set that anchors retirement of the pre-retro audio tickets. Each model later gets its own per-model implementation spec when its release goes active.

1. Goal¶

PhonoLex accepts audio as an input modality across the platform. Today the platform takes typed text or pre-tokenized phoneme strings; v6 says upload a speech sample and the platform routes it through the same analysis chain it routes typed input through, plus an acoustic-analysis layer the typed-input path doesn't have.

Audio support is implemented as five targeted models, each serving one user need, shipped as a sequence of point releases under the v6 banner.

1.1 Non-goal¶

v6 does not try to close the SLP diagnostic-therapy-feedback loop autonomously. That framing in earlier memory was inherited overreach. SLP decision-support is the goal — give the clinician better data inside their existing workflow. The clinician closes the loop.

This is consistent with the field: clinical SOTA on disorder diagnosis is below the FDA bar (The Sound of Syntax, EMNLP 2025, arXiv 2509.16765: micro-F1 ~0.56 vs FDA-grade 0.80–0.85). v6's role is useful adjunct, not replaces the clinician.

2. The five targeted models¶

Each is one task, one user need, one validation slice, independently shippable, independently retire-able. Failure of any one model doesn't take the others down. No single 6-hour-then-three-more-experiments commitment.

Rejected architecture (documented for the record): a single waveform→phonemes model serving every audio user simultaneously. PHON-123 retro (2026-05-30) identified two structural failures in this spine: 1. Phone-CTC always emits some phoneme — gradient signal lost before any downstream graded metric (including PHON-126) can use it. 2. Lexical-prior fine-tuning (the PHON-52 → PHON-55 → PHON-123 chain) regresses disordered productions to canonical in 47–59% of cases (SSD / CAS respectively). This is structural to transcribe-then-align as the spine; not recipe-tunable.

The five-model architecture sidesteps both by giving each user a model whose objective matches their actual need.

2.1 Model #1 — General audio → phoneme transcript¶

Task: transcribe an arbitrary spoken word or utterance into a broad phoneme sequence.
User: anyone uploading audio to use Text Analysis / Lookup / Sentences as if they had typed it.
Recipe: off-the-shelf Whisper or wav2vec2 (Apache-2.0). No fine-tuning, no canonical pretraining of any kind. Candidate bases: OpenAI Whisper (whisper-base / whisper-small) or facebook/wav2vec2-lv-60-espeak-cv-ft. Final base selection deferred to the Model #1 implementation spec.
Validation: standard ASR phoneme accuracy (PER) on a clean adult-speech held-out slice. Target: PER ≤ the published baseline of the chosen base model on its own eval slice. Disordered-speech transcription explicitly out of scope.
Honest limit: broad-phoneme only. Distortions and covert contrast invisible by design (handled by Models #3 and #5). Caveat surfaced in API response ({coverage: "broad-phoneme", limitations: [...]}) and tool UX.
Dependencies: none new. Plugs into the existing Worker route surface.
Ship size: small. Pipeline integration, not model research.

2.2 Model #2 — L2 / accent pronunciation scorer¶

Task: given a target word + an audio production, score how far the production is from canonical, per phoneme position.
User: language learners, accent-modification clinicians, ESL teachers, voice coaches.
Recipe: Model #1 produces the transcript, then cos_dist(produced, canonical) over PhonoLex's learned vectors (packages/features/outputs/vectors.csv). Per-position cost plus an overall score. PHON-126 metric, validated.
Validation: PHON-126's three diagnostic metrics (Mann-Whitney variant<error, practical threshold, Spearman ρ on severity rank) extended from PHON-126's synthetic pair-level to real audio pairs. PHON-126 already proved the underlying metric; Model #2 validates that audio-derived inputs preserve the variant-vs-error separation.
Honest limit: requires the learner to be trying to match canonical. The metric is designed for "did the production match the target?" — canonical bias is the feature here, not the bug. Useless for disordered SLP cases where the goal is faithful transcription of non-canonical productions (those are served by Models #3 and #5).
Dependencies: Model #1 (sub-call for transcription); packages/features/ (vectors).
Ship size: small. Frontend + Model #1 + existing cos_dist.

2.3 Model #3 — Distortion-type classifier¶

Task: binary or graded answer to a specific question like "is this /ɹ/ correctly produced?" — per error type, narrow.
User: SLPs doing per-phoneme assessment (pediatric SSD or adult dysarthria depending on the model's training data per type).
Recipe: per-error-type classifier trained on graded perceptual ratings. First model: rhotic distortion via PERCEPT-R (~9 crowd + ~3 expert graded ratings per token, already accessible per HANDOFF). Subsequent per-phoneme classifiers (sibilant lateralization, dentalized /s/, etc.) ship as separate v6.2.x releases if/when training data lands.
Validation: per-type ICC against graded human ratings; ROC vs. binary clinical judgment. Each new error type validates independently before its endpoint ships. No transcription dependency. The model never produces canonical phones, so canonical regression is structurally impossible.
Honest limit: one model per error type. Not a "general SSD classifier." Each model serves the specific error type it was trained on; expanding requires a new dataset and a new model.
Dependencies: PERCEPT-R (rhotic); equivalents for other error types as they're added.
Ship size: medium per error type. PHON-122 was already scoped this way before the retro.

2.4 Model #4 — Acoustic feature extraction¶

Task: extract formants (F1–F3 trajectory), F0, VOT, durations, COG, spectral moments from an audio sample.
User: phonetic researchers; voice quality assessment; prosody analysis; speech science.
Recipe: Parselmouth-driven Praat scripts. No ML. Layered value: PhonoLex's percentile-anchored norms wrap the extracted features ("F1 is at the 12th percentile for age-matched speakers"). This is the differentiator vs. raw Praat — the norms come from our existing properties data, not from the speaker's own production space (that's Model #5).
Validation: parity with Praat-direct extraction on a reference corpus (formants within ±10 Hz; F0 within ±2 Hz; durations within ±5 ms). Then percentile-overlay correctness against packages/web/workers/src/config/properties.ts norms.
Honest limit: descriptive only. No judgment layer. The percentile overlay is a normative reference, not a clinical opinion.
Dependencies: Parselmouth runtime, Praat. Native binaries — see §3.2 for deployment.
Ship size: small. Pipeline + frontend tool surface.

2.5 Model #5 — Speaker-relative acoustic comparison¶

Task: given N productions from one speaker, compute distance in their own acoustic space (formant, F0, duration).
User: SLPs doing covert contrast detection; voice clinicians tracking longitudinal change; researchers studying within-speaker variability.
Recipe: Model #4 → speaker-internal clustering / contrast detection. No external reference space needed — the speaker is their own baseline. Covert contrast detector: for each token where the perceptual label collapses a target contrast (e.g., /s/→[θ]), compute acoustic distance to the speaker's other [s] and [θ] productions; flag if the produced token sits closer to /s/ acoustic space than to /θ/.
Validation: covert-contrast detection rate against a hand-labeled gold subset (the corpus spec's T6 metric); within-speaker test-retest stability over repeated recordings of the same word. Hardest model to validate — detailed validation design deferred to per-model spec.
Honest limit: requires multiple samples per speaker. Useless for one-shot analysis. The speaker is the unit of analysis, not the utterance.
Why this model matters most for SLP: it sidesteps canonical bias entirely. A child's /s/ vs. /θ/ acoustic separation in their own production space says whether they're making the contrast, independent of whether a transcriber would categorically collapse them. This is the model that closes the gap PHON-123 retro identified.
Dependencies: Model #4 (provides the feature space).
Ship size: medium. Algorithm design + validation.

2.6 What each user audience gets¶

Audience	Models used	Resulting capability
Adult SLP (aphasia, accent modification, voice)	1 + 4 + 5 + 3 for specific phonemes	Audio in, percentile-anchored acoustic analysis, longitudinal tracking
L2 educator / language learner	1 + 2	Pronunciation feedback against canonical
Pediatric SLP (SSD)	3 + 5 (+ 1 for connected speech indexing)	Per-distortion classifier output + covert contrast detection. Honest about what the broad transcription can't do.
Phonetic researcher	1 + 4 + 5 + TextGrid I/O	Extract + analyze + percentile-anchored norms inside their existing tooling

Pediatric SSD is one use case among several. Not the workstream's reason for existing.

3. Platform integration shape¶

3.1 Worker route surface¶

New /api/audio/* routes; existing /api/words/*, /api/sentences, /api/text consume Model #1's transcript output as if it were typed input.

Route	Model	Request	Response
`POST /api/audio/transcribe`	#1	`multipart/form-data` audio + optional `language` hint	`{phonemes[], confidences[], duration_ms, coverage, limitations[]}`
`POST /api/audio/pronounce`	#2	`{target_word, audio_blob}`	`{cos_dist_per_position[], overall_score, variant_vs_error_class, transcript}`
`POST /api/audio/distortion/{phoneme}`	#3	`{audio_blob}`	`{score, confidence, classifier_version}`
`POST /api/audio/acoustic`	#4	`{audio_blob, target_age?, target_sex?}`	`{formants, f0, vot, duration_ms, cog, spectral_moments, percentile_overlay}`
`POST /api/audio/compare`	#5	`{audio_blobs[], target_phoneme?}`	`{within_speaker_distance_matrix, covert_contrast_flags[]}`

Audio input format: PCM WAV (16-bit min, 44.1 kHz min) on upload; Worker normalizes to 16 kHz mono before model-host hand-off. Client-side recording uses the browser's MediaRecorder API and produces an Opus or WAV blob.

3.2 Inference host¶

All five model-serving endpoints live on RunPod Serverless. The Worker proxies inbound requests through to RunPod; same proxy / scale-to-zero pattern as the retired CSP gen stack (recoverable from the archive/csp-generation-v5.2 tag). ~$185 prepaid balance still on the account per [[reference_runpod_endpoints]].

One RunPod endpoint per model — independent scaling per model, independent retire-ability per model.

Model #4 deployment is an open question. Parselmouth needs a native Praat runtime, which doesn't fit a model-serving image cleanly. Investigation item: can Praat + Parselmouth run on Cloudflare (Container revival, WASM build of Praat, or Pyodide-hosted Parselmouth)? Goal: the only external deployments are the model-serving endpoints; everything else (data, percentile overlay, route surface) stays on Cloudflare. Fallback: Model #4 on RunPod as a Python container with espeak-ng + praat installed.

3.3 Frontend¶

New AudioInput component: - Records audio via MediaRecorder API (record button, stop button, playback control, re-record). - Posts to the relevant /api/audio/* endpoint. - Threads the result through the existing Constraint[] → WordSearchBody → results pipeline so Model #1's output is interchangeable with typed input across the existing five tools.

For Models #2-5, new tool surfaces (Pronunciation tool consuming Model #2; Acoustic Analysis tool consuming Model #4; etc.) get designed at per-model implementation-spec time. Not committed here.

3.4 Reuse of existing PhonoLex assets¶

Learned feature vectors (packages/features/, PHON-126-validated): Model #2 cos_dist scoring; eventually Models #3, #5 substrate where appropriate.
Lexicon + norms (packages/data/, properties.ts): Model #1 result lookups; Model #4 percentile overlay.
compileWordFilter (packages/web/workers/src/lib/wordFilter.ts): Model #1 transcript becomes typed-input equivalent across /api/words/* and /api/sentences without code changes.
Contrast Sets (/api/contrastive): Model #3 phoneme-target selection UX.
Sentences corpus (/api/sentences): Model #1 transcript can be used as a query against the corpus.

4. Release plan¶

v6 ships as a sequence of point releases under one banner — not a single big-bang release. No time pressure.

v6.0 — Model #1 (audio → phoneme transcript) live across /api/audio/transcribe + existing tools accept audio input via AudioInput component. This is the "PhonoLex accepts audio" milestone.
v6.1 — Model #2 (L2/accent pronunciation scorer) + new Pronunciation tool surface.
v6.2 — Model #3 (first distortion classifier — rhotic, via PERCEPT-R). Subsequent per-phoneme classifiers ship as v6.2.x point releases as their training data lands.
v6.3 — Model #4 (acoustic feature extraction) + new Acoustic Analysis tool surface for researchers.
v6.4 — Model #5 (speaker-relative comparison) + covert-contrast detection UX. The most differentiated capability for SLP decision-support.

Each release validates the architecture and informs the next. v7 planning is whatever comes after all five audio capabilities are live — explicitly not specified here.

5. Out of scope¶

Explicit list with reasoning. Each line directly anchors a ticket retirement.

v6 does NOT close the SLP diagnostic-therapy-feedback loop autonomously. Earlier memory ("the endgame is the diagnostic-therapy-feedback loop SLPs run manually, automated") was inherited overreach. The goal was always SLP decision-support — give the clinician better data inside their existing workflow. The clinician closes the loop.
v6 does NOT ship a unified SSD diagnostic tool. PHON-53's original framing as a single audio diagnostic tool is retired. Transcribe-then-align is structurally broken per the PHON-123 retro (47% SSD / 59% CAS canonical regression). The five targeted models replace this. → PHON-53 split into per-model tickets.
v6 does NOT iterate on the transcribe-then-align spine. PHON-121 (close SSD/CAS precision floor via expanded mixed-cohort FT), PHON-124 (PERCEPT-R / narrative / bilingual coverage extension), PHON-125 (variant-aware phonological LM as denoiser), PHON-127 (audio-grounded LLM correction probe) — all retired. They iterate the architecture PHON-123 retro identified as broken. → PHON-121, PHON-124, PHON-125, PHON-127 closed as Won't Do.
v6 does NOT depend on the 500-child pediatric annotation corpus. That corpus (pediatric_speech_annotation_spec.md + the in-repo review at pediatric_speech_annotation_spec_review.md) is a separate SBIR-scale effort; PhonoLex can be infrastructure for it but v6 ships without it. Long-horizon collaboration vehicle, not a v6 dependency. → No ticket action; corpus stays as its own future-track item.
v6 does NOT promise per-model expansion beyond honest limits. Each model has documented "what it can't tell you." Future tickets expanding a model beyond its scope require a new architectural spec.
v6 is NOT pediatric-specific. The pediatric narrowing in PHON-44 was inherited momentum from PHON-51's data choice, never a user requirement. v6 serves adult SLP (aphasia, accent modification, voice), L2 educators, phonological researchers, and pediatric SLP equally — pediatric is one downstream use case via Models #3 + #5. → PHON-44 scope statement reframed in Jira.

6. Validation strategy¶

Per-model, scoped to that model's user need. No unified "is v6 done" gate.

Each validation is per-release-blocking for that model. v6.0 ships when Model #1 hits its target. v6.1 ships when Model #2 hits, and so on.

Model	Validation
#1	Standard ASR phoneme accuracy (PER) on a clean adult-speech held-out slice. Target: PER ≤ chosen base model's published baseline on its eval slice. Disordered-speech transcription explicitly out of scope — caveat documented.
#2	PHON-126's three diagnostic metrics (Mann-Whitney variant<error, practical threshold, Spearman ρ on severity rank) extended from synthetic pair-level to real audio pairs.
#3	Per-type ICC against graded human ratings (PERCEPT-R for rhotic); ROC vs. binary clinical judgment. Each new error type validates independently before its endpoint ships.
#4	Parity with Praat-direct extraction on a reference corpus (formants ±10 Hz; F0 ±2 Hz; durations ±5 ms). Then percentile-overlay correctness against `properties.ts` norms.
#5	Covert-contrast detection rate against a hand-labeled gold subset; within-speaker test-retest stability over repeated recordings of the same word. Detailed validation design deferred to per-model spec.

7. Ticket actions (derived from this spec)¶

Ticket	Title	Action
PHON-44	Audio workstream	Reframe — scope statement updated to "Audio support across PhonoLex, modality-agnostic." Stays as parent epic.
PHON-45	Audio detection spike	Already Done. Findings retained as data point.
PHON-46	Speechocean762 fine-tune spike	Already Done. Findings retained as data point for L2/adult use cases.
PHON-51	Pediatric cross-corpus eval	Already Done. Choice (Fork B) retired post-PHON-123 retro; not load-bearing.
PHON-52	pedTD canonical asserter	Already Done. Reframe in memory: the "16x FP-clean improvement" was canonical-bias induction, not calibration.
PHON-53	Productize the v1 audio tool	Won't Do as currently scoped. Split into Model #1–#5 implementation tickets (new PHON-128, PHON-129, PHON-130, PHON-131; PHON-122 already covers Model #3).
PHON-55	Mixed-cohort actual-target FT	Already Done. Reframe: "wins on F1" overstated; canonical-bias issue masked by aggregated metrics.
PHON-121	Close SSD/CAS precision floor	Won't Do. Iterates the broken architecture.
PHON-122	Distortion classification spike	Keep, reframe as Model #3 v6.2 release ticket. Scope was already correct.
PHON-123	Model B re-baseline	Already Done. Findings.md committed; license cleared regardless. Final verdict: structural canonical bias; do not adopt as v1.
PHON-124	PERCEPT-R / narrative / bilingual coverage extension	Won't Do. Iterates the broken architecture.
PHON-125	Variant-aware phonological LM	Won't Do. Downstream of transcribe-then-align spine.
PHON-126	Feature-vector graded error	Already Done. Metric is the substrate for Model #2 and any future graded scoring.
PHON-127	Audio-grounded LLM correction probe	Won't Do. Downstream of transcribe-then-align spine.
(new) PHON-128	Model #1 — Audio → phoneme transcript implementation (v6.0)	File. First implementation ticket.
(new) PHON-129	Model #2 — L2/accent pronunciation scorer implementation (v6.1)	File. Depends on Model #1 shipping.
PHON-122 (reframe)	Model #3 — Distortion-type classifier, rhotic via PERCEPT-R (v6.2)	(Already exists; reframe scope.)
(new) PHON-130	Model #4 — Acoustic feature extraction (v6.3)	File. Includes the Praat-on-Cloudflare investigation as a blocker subtask.
(new) PHON-131	Model #5 — Speaker-relative acoustic comparison (v6.4)	File. Depends on Model #4 shipping.

Each new ticket points at a section of this spec for scope. Each retired ticket links to §5 for reasoning.

8. References¶

[[project_audio_workstream]] — workstream-level state + the PHON-123 retro narrative
[[project_audio_targeted_models]] — five-model architecture detail
[[feedback_question_inherited_framings]] — failure-mode pattern (pediatric narrowing + Stage 1 in PHON-123)
PHON-126 findings — graded-error metric validation
PHON-123 findings — canonical-bias diagnosis
archive/csp-generation-v5.2 — proxy/scale-to-zero pattern for Worker → RunPod
pediatric_speech_annotation_spec.md + pediatric_speech_annotation_spec_review.md — long-horizon corpus vehicle, decoupled from v6
The Sound of Syntax (EMNLP 2025, arXiv 2509.16765) — clinical SOTA reference

Spec written 2026-05-30. Per-model implementation specs follow when each release ticket activates.