Skip to content

Speech Analysis (Beta)

Record or upload a spoken production against a target word and get back what was actually said (a narrow phonetic transcript) and how it deviated from the target (a per-position deviation overlay). A session of several productions also yields a secondary read of which speaker pattern the productions most resemble.

Beta — decision support, not diagnosis

Speech Analysis is a clinician-in-the-loop decision-support tool, not an autonomous assessment. It surfaces structured phonetic evidence — a faithful transcript, graded per-position deviations, and a speaker-pattern read — for you to interpret. It does not diagnose, and its outputs are not a clinical score. The audio model is under active development.

What it does

For each production you supply (a target word + an audio clip), the tool returns two things:

  • The faithful transcript — the phonemes the model actually heard, in narrow form ("what we heard"), aligned against the canonical target.
  • The deviation overlay — each target phoneme, colored by how far the production drifted from it, with the nearest reference sound on hover. Where the nearest sound differs from the target, the position is flagged as a substitution (e.g. target /ɹ/, nearest /w/).

This is the core of the tool, and it works on a single clip.

A session of multiple productions additionally surfaces a source-attribution read — whether the productions, taken together, pattern like typical, accent (L1-transfer), developmental, or motor speech. This is secondary, explicitly framed as supporting context, and it sharpens as you add more productions (see Source attribution below).

How it works

PhonoLex's audio model emits, for every frame of audio, a 26-dimensional articulatory feature vector — the same learned feature space the rest of the platform uses for phonological similarity. Every phoneme is represented as a short trajectory (a path) through that space, and your production is scored against a reference trajectory for each target phoneme. The deviation is a discriminatively-weighted distance: larger means the production's path through articulatory space drifted further from the target's. The nearest reference is simply the phoneme whose trajectory the production actually came closest to — so the tool can name the sound that was produced, not just flag that the target was missed.

For the model internals, see Technical → Audio Model.

Using it

1. Set a target

Type the target word. The tool checks it against the PhonoLex lexicon in real time and shows whether it is supported (we have a canonical pronunciation to score against) — its canonical phonemes are previewed when it is. Words not in the dictionary cannot be scored.

Recording and upload stay disabled until a supported target is set, so a clip is always attached to a target.

2. Add a production

Three ways to add a production to the session:

  • Record — capture from the microphone against the current target.
  • Upload — attach a single audio clip to the current target.
  • Batch upload — select several files at once. Each becomes an editable row, its target seeded from the filename and verified against the lexicon. Fix or remove any row, then run the verified ones on demand — nothing is analyzed silently.

A single production is just a session of one. Each production carries its own target, so a session can be the same word repeated or a whole word-list probe.

3. Read the result

Each production renders as a card: the target, the faithful transcript beneath it, and the deviation overlay. Hover any position for its deviation value and the nearest sound. Substitutions are flagged.

Source attribution (bonus)

Below the productions, the session-level panel reports which speaker pattern the accumulated productions most resemble — typical / accent / developmental / motor — with the contributing distances.

Read it as a tendency, not a verdict

Attribution is computed across all the productions in the session and was validated at the speaker level (over many productions per speaker). A single short production is genuinely underdetermined — the panel says so, and shows a confidence indicator that rises as you add productions. The phrasing is deliberate: a session "patterns like" a category; it never asserts a speaker has a condition. The accent-vs-disorder separation is built in specifically so that an accent is not mistaken for a disorder.

Scope and limits

  • Word-level, broad-phoneme. Targets are single words drawn from the lexicon. Connected-sentence input and automatic slicing of longer utterances into word-level productions are planned, not yet available. Scoring is over a broad phonetic inventory; fine sub-phonemic distortions are not separately modeled.
  • Beta. The audio model is under active development; its coverage and outputs may change as it matures.
  • Supporting evidence only. Treat every output as structured input to your own clinical judgment.