Skip to content

Audio Model

The technical basis of the Speech Analysis tool: how PhonoLex turns an audio clip into a per-position deviation read and a speaker-pattern attribution.

Status

The audio model is a research-validated Beta. The numbers below are measured on held-out speakers from the corpora named, not a claim of clinical-grade or population-general accuracy.

One representation: phones as trajectories

The model is a faithful feature emitter: a wav2vec2-lv-60-espeak self-supervised backbone with a small linear head (Linear(1024, 26) + sigmoid) that emits, for every ~20 ms frame, a 26-dimensional articulatory feature vector — the same learned feature space used for phonological similarity elsewhere in PhonoLex.

Every phoneme is represented not as a single point but as a trajectory: an 8-point path through that 26-d space over the phone's duration. One uniform representation, measured one way, makes all phones directly comparable — including obstruents (whose formants are unreliable) and semivowels/liquids (which a single point cannot place on the consonant–vowel gradient). The emitter, not formant tracking, is the yardstick: it captures frication, burst, and glide uniformly.

This is "faithful" in the clinical sense — it transcribes what was produced, not what should have been produced.

Reference trajectories (63 phones)

Each target phoneme is scored against a reference trajectory. The reference set covers the full 63-phone inventory with explicit provenance:

Source Count Method
Measured 52 Per-phone feature trajectories aggregated from time-aligned TIMIT + L2-ARCTIC productions (gold/narrow-IPA boundaries)
Derived 11 Nearest-measured-neighbour transplant for phones absent from those corpora — provenance-tagged, never presented as measured

Scoring

A produced phone's emitted sub-trajectory is compared to its target's reference with a Fisher-weighted trajectory distance — each (timepoint, dimension) is weighted by its between-phone / within-phone variance, so place- and manner-bearing dimensions count more and noise counts less. The same comparison against all references yields the nearest phone — the sound actually produced — which is what lets the tool name a substitution rather than only flag a miss.

The discriminative weighting and the per-frame trajectory targets together roughly doubled the model's error catch-rate over a centroid baseline on held-out clips (a research metric, not a clinical sensitivity figure).

Source attribution

A second layer attributes the source of deviation across four categories: typical, accent (L1-transfer), developmental, and motor. The intuition is that perceptual confusability, productive substitution patterns, and motor execution are distinct generators, each with a different signature.

Per production, six features are computed and aggregated across the session before classification:

Feature Captures
gradient (global + consonant-restricted) distance-to-nearest-reference — the "disordered" axis
excursion (global + consonant-restricted) range of articulatory movement — reduced vs full
rate frames per phone — slowed production
accent-score how much the substitution pattern matches L1-transfer priors vs a developmental prior

A session's pooled feature vector is classified by nearest standardized centroid. On a leave-one-subject-out evaluation across the training populations, the 4-way classifier reached 90% (chance 25%): child/developmental 96%, accent 83%, typical 87%, motor 100%. The disorder axes separate cleanly; the only confusion is accent ↔ typical — clinically benign, since neither is a disorder, and exactly the "don't pathologize an accent" guardrail expressed as a measurement.

Validated per speaker, served per session

The 90% figure is leave-one-subject-out over many productions per speaker. The tool serves a per-session read that pools each production's features the same way; a single short production carries little signal, which is why the UI states confidence by quantity. At serve time the produced transcript comes from the emitter (not a gold transcript), so the accent-score feature is noisier than in the offline validation — it informs the read, it does not drive a verdict.

Training data

The emitter and references are built from a union of openly-licensed speech corpora, used under a non-commercial, nonprofit posture (Just Semantics):

Corpus Role License basis
TIMIT (LDC93S1) clean native-adult, phone-aligned LDC, non-commercial research/technology development
L2-ARCTIC L1-transfer (accent) productions CC BY-NC
PhonBank / TalkBank child + disordered productions CC BY-NC-SA; algorithm-development clause
TORGO dysarthria (motor) academic / non-profit
LibriSpeech, Common Voice native breadth, problem-phone coverage open / CC0

Model weights are a trained transformation, not redistribution of any corpus; no source audio is hosted by PhonoLex.

Known limits

  • Encoder-limited phones. Retroflex place and the glottal stop /ʔ/ cluster but do not finely separate in the current encoder — a documented limitation, not a framework gap.
  • Per-clip noise. Attribution is reliable in aggregate; single short clips are indicative only.
  • Broad-phoneme, word-level. Fine sub-phonemic distortion is not separately modeled; sentence input and utterance slicing are planned.

Serving

The model is served by a FastAPI inference host (phonolex_audio); the Worker proxies /api/audio/analyze (target → canonical lookup → host) and /api/audio/attribute (session feature pooling).