Audio Model¶

The technical basis of the Speech Analysis tool: how PhonoLex turns an audio clip into a per-position deviation read and a speaker-pattern attribution. The attribution layer is computed server-side but is not currently surfaced in the UI (the panel was removed in v6.2.2 until the diagnostic loop is closed).

Status

The audio model is a research-validated Beta. The numbers below are measured on held-out speakers from the corpora named, not a claim of clinical-grade or population-general accuracy.

One representation: phones as trajectories¶

The model is a faithful feature emitter: a wav2vec2-lv-60-espeak self-supervised backbone with a small linear head (Linear(1024, 26) + sigmoid) that emits, for every ~20 ms frame, a 26-dimensional articulatory feature vector — the same learned feature space used for phonological similarity elsewhere in PhonoLex.

Every phoneme is represented not as a single point but as a trajectory: an 8-point path through that 26-d space over the phone's duration. One uniform representation, measured one way, makes all phones directly comparable — including obstruents (whose formants are unreliable) and semivowels/liquids (which a single point cannot place on the consonant–vowel gradient). The emitter, not formant tracking, is the yardstick: it captures frication, burst, and glide uniformly.

This is "faithful" in the clinical sense — it transcribes what was produced, not what should have been produced.

Reference trajectories (63 phones)¶

Each target phoneme is scored against a reference trajectory. The reference set covers the full 63-phone inventory with explicit provenance:

Source	Count	Method
Measured	52	Per-phone feature trajectories aggregated from time-aligned TIMIT + L2-ARCTIC productions (gold/narrow-IPA boundaries)
Derived	11	Nearest-measured-neighbour transplant for phones absent from those corpora — provenance-tagged, never presented as measured

Scoring¶

A produced phone's emitted sub-trajectory is compared to its target's reference with a Fisher-weighted trajectory distance — each (timepoint, dimension) is weighted by its between-phone / within-phone variance, so place- and manner-bearing dimensions count more and noise counts less. The same comparison against all references yields the nearest phone — the sound actually produced — which is what lets the tool name a substitution rather than only flag a miss.

The discriminative weighting and the per-frame trajectory targets together roughly doubled the model's error catch-rate over a centroid baseline on held-out clips (a research metric, not a clinical sensitivity figure).

Source attribution¶

A second layer attributes the source of deviation across four categories: typical, accent (L1-transfer), developmental, and motor. The intuition is that perceptual confusability, productive substitution patterns, and motor execution are distinct generators, each with a different signature.

Per production, six features are computed and aggregated across the session before classification:

Feature	Captures
gradient (global + consonant-restricted)	distance-to-nearest-reference — the "disordered" axis
excursion (global + consonant-restricted)	range of articulatory movement — reduced vs full
rate	frames per phone — slowed production
accent-score	how much the substitution pattern matches L1-transfer priors vs a developmental prior

A session's pooled feature vector is classified by nearest standardized centroid. On a leave-one-subject-out evaluation across the training populations, the 4-way classifier reached 90% (chance 25%): child/developmental 96%, accent 83%, typical 87%, motor 100%. The disorder axes separate cleanly; the only confusion is accent ↔ typical — clinically benign, since neither is a disorder, and exactly the "don't pathologize an accent" guardrail expressed as a measurement.

Validated per speaker, served per session

The 90% figure is leave-one-subject-out over many productions per speaker. The Worker computes a per-session read that pools each production's features the same way; a single short production carries little signal. The shipped UI surfaces the transcript and the per-position deviation overlay only — the attribution read is computed but not displayed (panel removed in v6.2.2). At serve time the produced transcript comes from the emitter (not a gold transcript), so the accent-score feature is noisier than in the offline validation — it informs the read, it does not drive a verdict.

Training data¶

The emitter and references are built from a union of openly-licensed speech corpora, used under a non-commercial research posture (LDC agreement held by Neumann's Workshop, LLC; entity decision 2026-07-12, PHON-163). Models trained on this data must not ship in paid products:

Corpus	Role	License basis
TIMIT (LDC93S1)	clean native-adult, phone-aligned	LDC, non-commercial research/technology development
L2-ARCTIC	L1-transfer (accent) productions	CC BY-NC
PhonBank / TalkBank	child + disordered productions	CC BY-NC-SA; algorithm-development clause
TORGO	dysarthria (motor)	academic / non-profit
LibriSpeech, Common Voice	native breadth, problem-phone coverage	open / CC0

Model weights are a trained transformation, not redistribution of any corpus; no source audio is hosted by PhonoLex.

Known limits¶

Encoder-limited phones. Retroflex place and the glottal stop /ʔ/ cluster but do not finely separate in the current encoder — a documented limitation, not a framework gap.
Per-clip noise. Attribution is reliable in aggregate; single short clips are indicative only.
Broad-phoneme, word-level. Fine sub-phonemic distortion is not separately modeled; sentence input and utterance slicing are planned.

Serving¶

The model is served by a FastAPI inference host (phonolex_audio); the Worker proxies /api/audio/analyze (target → canonical lookup → host) and /api/audio/attribute (session feature pooling). Both endpoints are live, but the session-level attribution panel was removed from the UI in v6.2.2 — the frontend currently renders the transcript + deviation overlay only.