Skip to content

PhonoLex Documentation

PhonoLex is a phonological analysis and corpus-retrieval platform for speech-language pathologists, linguists, researchers, and educators.

What's in the box

PhonoLex consolidates a CMU-grounded lexicon, in-house psycholinguistic norms, a learned phoneme feature space, and a curated naturalistic English corpus into a single edge-deployed API + SPA:

  • ~125,000 phonology-bearing entries from the CMU Pronouncing Dictionary, with a ~47,000-word canonical content-POS subset (NOUN / VERB / ADJ / ADV) carrying the full ~150-column psycholinguistic norm set
  • ~1.6 million word-similarity edges — Qwensim neural-embedding cosine over FineWeb-Edu (the bulk), plus thin tails of ECCC perceptual confusability and WordSim-353 human-rated similarity. Semantic similarity from a sentence-transformer; not free-association norm data.
  • 642K minimal pairs with learned-feature distance + sonorant-diff metrics
  • ~236K curated corpus sentences (CoLA, UD-EWT, GUM, Tatoeba, OpenSubtitles), gated for SLP suitability and indexed for fast constraint queries
  • Learned phoneme feature vectors via in-house Bayesian inference (r=0.987 cosine correlation vs theory-assigned features)

Live tools (clinician-facing)

  1. Custom Word Lists — IPA pattern matching combined with property filters across ~150 psycholinguistic dimensions, CV-shape selection, and sound-similarity anchoring
  2. Text Analysis — passage analysis with percentile statistics and per-word property-overlay highlighting
  3. Contrast Sets — minimal pairs, maximal opposition, multiple opposition
  4. Lookup — word details, phoneme features, neighboring words via Qwensim semantic similarity, percentile profiles
  5. Sentences — curated corpus retrieval with the full constraint vocabulary and a per-result word-highlight overlay

Visit phonolex.com to use them in your browser. No installation required.

R&D workstreams (path to the closed loop)

The endgame is the diagnostic-therapy-feedback cycle SLPs run manually, automated:

  • Audio Detection — diagnostic input + progress feedback. Transcriber model trained (F1 0.43, PER 0.093).
  • Curriculum Recommender — diagnostic profile → graded sequence of targets delivered through the live tools. Successor framing for the older "Content Catalog" concept.
  • Governed Generation — paused. The CSP-solver + reranker stack was retired in v5.2 in favor of corpus retrieval; returns when curricula need synthetic material the corpus can't supply.
  • Adaptive Loop — glue closing diagnostic → curriculum → feedback → re-recommendation.

Architecture

PhonoLex is a uv-workspace monorepo with three packages:

  • packages/data — shared Python data layer (loaders, phonology, runtime parquet + D1 SQL emit)
  • packages/features — learned phoneme feature vectors via Bayesian inference
  • packages/webworkers/ Hono API on Cloudflare Workers + D1, and frontend/ React + MUI SPA

The deploy artifact is a single LFS-tracked d1-seed.sql. Developer builds locally, CI applies the seed via wrangler d1 execute --remote. No Python pipeline runs in CI.

See Architecture for the full system diagram.

API

PhonoLex provides a REST API at api.phonolex.com. See API reference for the route catalog.

License

Proprietary. Copyright (c) 2025-2026 Neumann's Workshop, LLC. All rights reserved. See License.