PhonoLex Documentation¶
PhonoLex is a phonological analysis and corpus-retrieval platform for speech-language pathologists, linguists, researchers, and educators.
What's in the box¶
PhonoLex consolidates a CMU-grounded lexicon, in-house psycholinguistic norms, a learned phoneme feature space, and a curated naturalistic English corpus into a single edge-deployed API + SPA:
- ~125,000 phonology-bearing entries from the CMU Pronouncing Dictionary, with a ~47,000-word canonical content-POS subset (NOUN / VERB / ADJ / ADV) carrying the full ~150-column psycholinguistic norm set
- ~1.6 million word-similarity edges — Qwensim neural-embedding cosine over FineWeb-Edu (the bulk), plus thin tails of ECCC perceptual confusability and WordSim-353 human-rated similarity. Semantic similarity from a sentence-transformer; not free-association norm data.
- 642K minimal pairs with learned-feature distance + sonorant-diff metrics
- ~236K curated corpus sentences (CoLA, UD-EWT, GUM, Tatoeba, OpenSubtitles), gated for SLP suitability and indexed for fast constraint queries
- Learned phoneme feature vectors via in-house Bayesian inference (r=0.987 cosine correlation vs theory-assigned features)
Live tools (clinician-facing)¶
- Custom Word Lists — IPA pattern matching combined with property filters across ~150 psycholinguistic dimensions, CV-shape selection, and sound-similarity anchoring
- Text Analysis — passage analysis with percentile statistics and per-word property-overlay highlighting
- Contrast Sets — minimal pairs, maximal opposition, multiple opposition
- Lookup — word details, phoneme features, neighboring words via Qwensim semantic similarity, percentile profiles
- Sentences — curated corpus retrieval with the full constraint vocabulary and a per-result word-highlight overlay
Visit phonolex.com to use them in your browser. No installation required.
R&D workstreams (path to the closed loop)¶
The endgame is the diagnostic-therapy-feedback cycle SLPs run manually, automated:
- Audio Detection — diagnostic input + progress feedback. Transcriber model trained (F1 0.43, PER 0.093).
- Curriculum Recommender — diagnostic profile → graded sequence of targets delivered through the live tools. Successor framing for the older "Content Catalog" concept.
- Governed Generation — paused. The CSP-solver + reranker stack was retired in v5.2 in favor of corpus retrieval; returns when curricula need synthetic material the corpus can't supply.
- Adaptive Loop — glue closing diagnostic → curriculum → feedback → re-recommendation.
Architecture¶
PhonoLex is a uv-workspace monorepo with three packages:
packages/data— shared Python data layer (loaders, phonology, runtime parquet + D1 SQL emit)packages/features— learned phoneme feature vectors via Bayesian inferencepackages/web—workers/Hono API on Cloudflare Workers + D1, andfrontend/React + MUI SPA
The deploy artifact is a single LFS-tracked d1-seed.sql. Developer builds locally, CI applies the seed via wrangler d1 execute --remote. No Python pipeline runs in CI.
See Architecture for the full system diagram.
API¶
PhonoLex provides a REST API at api.phonolex.com. See API reference for the route catalog.
License¶
Proprietary. Copyright (c) 2025-2026 Neumann's Workshop, LLC. All rights reserved. See License.