PhonoLex Documentation¶

PhonoLex is a phonological analysis and corpus-retrieval platform for speech-language pathologists, linguists, researchers, and educators.

What's in the box¶

PhonoLex consolidates a CMU-grounded lexicon, in-house psycholinguistic norms, a learned phoneme feature space, and a curated naturalistic English corpus into a single edge-deployed API + SPA:

~125,000 phonology-bearing entries from the CMU Pronouncing Dictionary, with a ~47,000-word canonical content-POS subset (NOUN / VERB / ADJ / ADV) carrying the full ~150-column psycholinguistic norm set
~1.6 million word-similarity edges — Qwensim neural-embedding cosine over FineWeb-Edu (the bulk), plus thin tails of ECCC perceptual confusability and WordSim-353 human-rated similarity. Semantic similarity from a sentence-transformer; not free-association norm data.
642K minimal pairs with learned-feature distance + sonorant-diff metrics
~236K curated corpus sentences (CoLA, UD-EWT, GUM, Tatoeba, OpenSubtitles), gated for SLP suitability and indexed for fast constraint queries
Learned phoneme feature vectors via in-house Bayesian inference (r=0.987 cosine correlation vs theory-assigned features)
Word→image mapping — ~1,700 words with picture-card imagery (Mulberry Symbols + OpenMoji, CC BY-SA), with a has_image filter and a "Picture pairs only" mode in Contrast Sets

Live tools (clinician-facing)¶

Custom Word Lists — IPA pattern matching combined with property filters across ~150 psycholinguistic dimensions, CV-shape selection, sound-similarity anchoring, and a has_image filter for picture-card material
Text Analysis — passage analysis with percentile statistics and per-word property-overlay highlighting
Contrast Sets — minimal pairs, maximal opposition, multiple opposition, with a "Picture pairs only" mode
Lookup — word details, phoneme features, neighboring words via Qwensim semantic similarity, percentile profiles, picture cards where imagery exists
Sentences — curated corpus retrieval with the full constraint vocabulary and a per-result word-highlight overlay
Speech Analysis (Beta) — faithful narrow transcript + per-position deviation overlay from recorded or uploaded productions; decision support for a clinician-in-the-loop, not a diagnosis

Visit phonolex.com to use them in your browser. No installation required.

R&D workstreams (path to the closed loop)¶

The endgame is the diagnostic-therapy-feedback cycle SLPs run manually, automated:

Audio Detection — diagnostic input + progress feedback. Research complete (2026-06-14); a beta slice ships as Speech Analysis, full productization is gated follow-on work.
Curriculum Recommender — diagnostic profile → graded sequence of targets delivered through the live tools. Successor framing for the older "Content Catalog" concept.
Governed Generation — paused. The CSP-solver + reranker stack was retired in v5.2 in favor of corpus retrieval; returns when curricula need synthetic material the corpus can't supply.
Adaptive Loop — glue closing diagnostic → curriculum → feedback → re-recommendation.

Architecture¶

PhonoLex is a uv-workspace monorepo with three packages:

packages/data — shared Python data layer (loaders, phonology, runtime parquet + D1 SQL emit)
packages/features — learned phoneme feature vectors via Bayesian inference
packages/web — workers/ Hono API on Cloudflare Workers + D1, and frontend/ React + MUI SPA

The deploy artifact is a single LFS-tracked d1-seed.sql. Developer builds locally, CI applies the seed via wrangler d1 execute --remote. No Python pipeline runs in CI.

See Architecture for the full system diagram.

API¶

PhonoLex provides a REST API at api.phonolex.com. See API reference for the route catalog.