Audio Model¶
The technical basis of the Speech Analysis tool: how PhonoLex turns an audio clip into a per-position deviation read and a speaker-pattern attribution.
Status
The audio model is a research-validated Beta. The numbers below are measured on held-out speakers from the corpora named, not a claim of clinical-grade or population-general accuracy.
One representation: phones as trajectories¶
The model is a faithful feature emitter: a wav2vec2-lv-60-espeak self-supervised backbone with a small linear head (Linear(1024, 26) + sigmoid) that emits, for every ~20 ms frame, a 26-dimensional articulatory feature vector — the same learned feature space used for phonological similarity elsewhere in PhonoLex.
Every phoneme is represented not as a single point but as a trajectory: an 8-point path through that 26-d space over the phone's duration. One uniform representation, measured one way, makes all phones directly comparable — including obstruents (whose formants are unreliable) and semivowels/liquids (which a single point cannot place on the consonant–vowel gradient). The emitter, not formant tracking, is the yardstick: it captures frication, burst, and glide uniformly.
This is "faithful" in the clinical sense — it transcribes what was produced, not what should have been produced.
Reference trajectories (63 phones)¶
Each target phoneme is scored against a reference trajectory. The reference set covers the full 63-phone inventory with explicit provenance:
| Source | Count | Method |
|---|---|---|
| Measured | 52 | Per-phone feature trajectories aggregated from time-aligned TIMIT + L2-ARCTIC productions (gold/narrow-IPA boundaries) |
| Derived | 11 | Nearest-measured-neighbour transplant for phones absent from those corpora — provenance-tagged, never presented as measured |
Scoring¶
A produced phone's emitted sub-trajectory is compared to its target's reference with a Fisher-weighted trajectory distance — each (timepoint, dimension) is weighted by its between-phone / within-phone variance, so place- and manner-bearing dimensions count more and noise counts less. The same comparison against all references yields the nearest phone — the sound actually produced — which is what lets the tool name a substitution rather than only flag a miss.
The discriminative weighting and the per-frame trajectory targets together roughly doubled the model's error catch-rate over a centroid baseline on held-out clips (a research metric, not a clinical sensitivity figure).
Source attribution¶
A second layer attributes the source of deviation across four categories: typical, accent (L1-transfer), developmental, and motor. The intuition is that perceptual confusability, productive substitution patterns, and motor execution are distinct generators, each with a different signature.
Per production, six features are computed and aggregated across the session before classification:
| Feature | Captures |
|---|---|
| gradient (global + consonant-restricted) | distance-to-nearest-reference — the "disordered" axis |
| excursion (global + consonant-restricted) | range of articulatory movement — reduced vs full |
| rate | frames per phone — slowed production |
| accent-score | how much the substitution pattern matches L1-transfer priors vs a developmental prior |
A session's pooled feature vector is classified by nearest standardized centroid. On a leave-one-subject-out evaluation across the training populations, the 4-way classifier reached 90% (chance 25%): child/developmental 96%, accent 83%, typical 87%, motor 100%. The disorder axes separate cleanly; the only confusion is accent ↔ typical — clinically benign, since neither is a disorder, and exactly the "don't pathologize an accent" guardrail expressed as a measurement.
Validated per speaker, served per session
The 90% figure is leave-one-subject-out over many productions per speaker. The tool serves a per-session read that pools each production's features the same way; a single short production carries little signal, which is why the UI states confidence by quantity. At serve time the produced transcript comes from the emitter (not a gold transcript), so the accent-score feature is noisier than in the offline validation — it informs the read, it does not drive a verdict.
Training data¶
The emitter and references are built from a union of openly-licensed speech corpora, used under a non-commercial, nonprofit posture (Just Semantics):
| Corpus | Role | License basis |
|---|---|---|
| TIMIT (LDC93S1) | clean native-adult, phone-aligned | LDC, non-commercial research/technology development |
| L2-ARCTIC | L1-transfer (accent) productions | CC BY-NC |
| PhonBank / TalkBank | child + disordered productions | CC BY-NC-SA; algorithm-development clause |
| TORGO | dysarthria (motor) | academic / non-profit |
| LibriSpeech, Common Voice | native breadth, problem-phone coverage | open / CC0 |
Model weights are a trained transformation, not redistribution of any corpus; no source audio is hosted by PhonoLex.
Known limits¶
- Encoder-limited phones. Retroflex place and the glottal stop /ʔ/ cluster but do not finely separate in the current encoder — a documented limitation, not a framework gap.
- Per-clip noise. Attribution is reliable in aggregate; single short clips are indicative only.
- Broad-phoneme, word-level. Fine sub-phonemic distortion is not separately modeled; sentence input and utterance slicing are planned.
Serving¶
The model is served by a FastAPI inference host (phonolex_audio); the Worker proxies /api/audio/analyze (target → canonical lookup → host) and /api/audio/attribute (session feature pooling).