Technical Architecture¶
Technical details on PhonoLex's architecture and computational methods.
System overview¶
PhonoLex is a Hono API on Cloudflare Workers + D1 (SQLite at the edge) with a React + TypeScript + MUI frontend. Data is consolidated locally by a developer-facing Python pipeline and shipped as a single LFS-tracked d1-seed.sql artifact; CI applies the seed via wrangler d1 execute --remote. No Python pipeline runs in CI.
React frontend (TypeScript + MUI, vite + Cloudflare Pages)
↓ HTTPS / JSON
Hono backend (TypeScript, Cloudflare Workers)
↓ SQL queries
D1 (SQLite at the edge, 13 tables)
├── words ~125K phonology-bearing entries (is_canonical flag marks the ~47K content-POS subset)
├── word_properties ~47K canonical norm rows (~150 columns surfaced)
├── word_freq_bands ~47K raw freq-band cols (not surfaced)
├── word_percentiles ~47K per-property percentile ranks
├── edges ~1.6M word-similarity edges (Qwensim 99.8% + ECCC + WordSim)
├── pairs 642K precomputed minimal pairs (feature_distance + sonorant_diff)
├── corpus_sentences_index ~236K sentence headers (text, sources[], rarity_score, content_lemma_sig)
├── corpus_sentences ~2.1M (sentence_id, surface, is_content) membership rows
├── phonemes / phoneme_dots / components / word_syllables (similarity infrastructure)
└── metadata key/value config
Consolidated data¶
Word properties (~150 columns)¶
Each canonical content-POS word carries properties across the following categories:
| Category | Columns | Sources |
|---|---|---|
| Phonological Complexity | syllable_count, phoneme_count, wcm_score, cv_shape |
CMU dict + Stoel-Gammon (2010) WCM |
| Phonotactic Probability | phono_prob_avg, positional_prob_avg, stressed variants, neighborhood_density |
Method: Vitevitch & Luce (2004); computed locally from CMU |
| Lexical Frequency | frequency, log_frequency, contextual_diversity |
PhonoLex derivation from FineWeb-Edu |
| Developmental Frequency | freq_age_2y / 5y / 8y / 12y (child PRODUCTION), freq_age_all (alias for frequency) |
PhonoLex from CHILDES + PhonBank |
| Child-Corpus Frequency | freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13 |
CYP-LEX (Korochkina et al., 2024, CC BY 4.0) |
| Lexical Timing | aoa |
PhonoLex in-house gpt-4.1-mini cloze, 1-7 age-banded (Spearman 0.868 vs Glasgow) |
| Semantic | concreteness, familiarity, imageability, boi, iconicity, socialness, semantic_diversity, semd_* |
PhonoLex (in-house gpt-4.1-mini cloze) |
| Affective | valence, arousal |
PhonoLex (in-house, Warriner-scale anchor) |
| Morphological | morpheme_count, n_prefixes, n_suffixes, is_monomorphemic |
Algorithmic + MorphyNet (CC BY-SA 3.0) |
The in-house norm columns are derived locally via gpt-4.1-mini cloze-prompt ratings. The original-author papers cited above are scale anchors — values served are PhonoLex's, validated against held-out oracles. See Data & Methods for full per-column provenance.
Word similarity graph (1.6M edges)¶
| Source | Edges | What it measures |
|---|---|---|
| Qwensim | ~1.63M | PhonoLex — Qwen3-Embedding-0.6B cosine over FineWeb-Edu. Bulk of the graph. |
| ECCC | ~2.5K | Perceptual confusability in noise (Marxer et al., 2016, CC BY 4.0) |
| WordSim-353 | ~351 | Human-rated semantic relatedness (Finkelstein et al., 2001) |
This is neural-embedding similarity, not free-association norm data (the USF / SWOW / MEN / SPP / SimLex datasets were retired in the licensing audit).
Phonological feature vectors¶
The 41-phoneme CMU inventory is represented in phonemes / phoneme_dots / components tables as 26-d Bayesian posterior vectors learned from theory-assigned priors (Hayes 2009) + ECCC perceptual confusion evidence + Hillenbrand vowel acoustic measurements. Posteriors achieve r=0.987 cosine correlation vs theory-assigned features at convergence — see packages/features/.
Curated corpus (~236K sentences)¶
The Sentences tool's source corpus: CoLA + UD English-EWT + GUM + Tatoeba + OpenSubtitles, run through a multi-stage build pipeline (vocab coverage, profanity filter, V/A + AFINN content-safety gate, PROPN cap, parataxis rejection, contraction-stem glue, cross-source identical-text merge with aggressive normalization, coverage-aware rarity dedup). CHILDES + PhonBank conversational transcripts were retired 2026-05-25 — they remain in use only for the developmental-frequency derivations above.
Hono backend (Cloudflare Workers + D1)¶
Build pipeline (developer-local)¶
build_lexical_databaseinpackages/data/src/phonolex_data/pipeline/reads source TSVs (CMU + in-house phonolex_*.tsv)derived.pycomputes per-property percentiles (frequency-class columns treat 0 as NULL), phoneme dot products, syllable components, minimal pairsemit_parquet.pywritesdata/runtime/{words,pairs,corpus_sentences,corpus_sentences_index}.parquet(gitignored local cache)emit_d1_sql.pywrites the LFS-trackedpackages/web/workers/scripts/d1-seed.sqlchunk-seed-sql.pysplits the ~550 MB seed into 40 MB chunks forwrangler d1 execute --file(D1 statement size cap)
Cold start¶
The similarity engine loads phoneme norms (~45 entries), pairwise phoneme dot products (~990), and component phoneme sequences (~600) into Worker isolate memory on cold start. All other routes query D1 directly.
API endpoints¶
| Method | Path | Description |
|---|---|---|
GET |
/api/health |
Health check |
GET |
/api/stats |
Vocabulary statistics |
GET |
/api/property-metadata |
Property definitions, categories, labels (drives the frontend filter UI) |
GET |
/api/edge-types |
Word-similarity edge type definitions |
GET |
/api/words/{word} |
Word detail with full property profile |
POST |
/api/words/search |
Unified search (patterns + filters + cv_shape + similar_to + bounds + pagination) |
POST |
/api/words/word-list |
Flat word list for a constraint set |
POST |
/api/words/batch |
Batch word lookup (up to 1000 words) |
POST |
/api/similarity/search |
Phonological similarity with adjustable weights |
POST |
/api/sentences |
Sentence retrieval (pattern + cv_shape + bound + contrastive constraints; returns match_count + rarity_score + per-word highlights) |
GET |
/api/associations/{word} |
Word-similarity neighbors by edge type |
GET |
/api/associations/{word}/confusability |
ECCC perceptual confusability |
GET |
/api/phonemes, /{ipa}, POST /compare, POST /search |
Phoneme inventory + feature lookups |
POST |
/api/contrastive/{minimal-pairs,maximal-opposition/*,multiple-opposition/*} |
Pair-graph predicates served from pairs table |
POST |
/api/text/analyze |
Per-passage analysis with percentile aggregation |
Property metadata system¶
Property definitions are served via GET /api/property-metadata, providing per-column ID, label, short label, category grouping, source, scale description, interpretation, display format, and filter configuration (slider step, log-scale flag, integer flag). The frontend loads this once at startup and drives all UI (filter sliders, table columns, bound pickers) from the metadata — no hardcoded property lists in TypeScript.
Phonological similarity¶
PhonoLex uses two-level soft Levenshtein with precomputed phoneme-level dot products.
Phoneme-level precomputation¶
At build time, the pipeline computes:
- Norm squares (||v||²) for each of the 41 CMU phonemes
- Pairwise dot products between all phoneme pairs (~820 pairs)
- Component phoneme sequences (e.g., onset /kr/) preserving cluster structure — no vector averaging
Runtime similarity¶
Level 1 — Component similarity (within onset, nucleus, or coda):
Soft Levenshtein DP on phoneme sequences. Substitution cost = 1 - cosine(phoneme_a, phoneme_b) using precomputed norms and dot products. Cluster length penalties (e.g., /kr/ vs /k/ onset) emerge naturally from the DP.
Level 2 — Word similarity (across syllable sequences):
Syllable similarity = weighted average of component similarities:
syl_sim = (w_o · compSim(o1, o2) + w_n · compSim(n1, n2) + w_c · compSim(c1, c2))
/ (w_o + w_n + w_c)
Soft Levenshtein DP on syllable sequences with substitution cost = 1 - syl_sim:
DP[i][j] = min(
DP[i-1][j] + 1.0, # deletion
DP[i][j-1] + 1.0, # insertion
DP[i-1][j-1] + (1 - syl_sim(s1_i, s2_j)) # substitution
)
word_similarity = 1 - (DP[n][m] / max(n, m))
Weight presets¶
| Preset | Onset | Nucleus | Coda |
|---|---|---|---|
| Balanced | 0.33 | 0.33 | 0.33 |
| Rhymes | 0.0 | 0.5 | 0.5 |
| Alliteration | 1.0 | 0.5 | 0.0 |
| Assonance | 0.0 | 1.0 | 0.0 |
| Consonance | 0.5 | 0.0 | 0.5 |
Contrast sets¶
Minimal pairs¶
Precomputed at build time. Two words differing by exactly one phoneme at the same position. The pairs D1 table carries 642K rows with feature_distance (continuous L2 over learned vectors) + sonorant_diff (whether the contrast crosses the sonorant class) + is_canonical (whether both members are content-POS).
Maximal opposition (Gierut, 1989)¶
Filters pairs on sonorant_diff >= threshold. The major-class crossing predicts broader generalization across the phonological system.
Multiple opposition (Williams, 2000)¶
A substitute phoneme contrasted with N target phonemes. Surfaced at the word level (Contrast Sets tool). The /api/sentences contrastive_multopp constraint exists but is not surfaced in the Sentences UI — a single sentence almost never witnesses one substitute against ≥2 distinct target phonemes.
Sentences route ranking¶
Tiered globally by per-query match_count DESC (multi-hit > single-hit, regardless of source), then source-interleaved within tier by static rarity_score. Bound rules use NULL-fail semantics for percentile columns, NULL-pass for raw norms. Contrastive constraints (minpair / maxopp / multopp) apply via self-join through pairs — the sentence must contain BOTH members of at least one witness pair. Per-result highlights payload returns the surfaces that earned the slot, used by the frontend overlay.
Performance¶
| Operation | Notes |
|---|---|
| Word lookup | D1 primary-key lookup |
| Property filter | SQL WHERE with indexed columns |
| Similarity search | Cold start loads ~4,500 phoneme dot products; full-vocab scan with scalar math |
| Edge query | D1 indexed lookup on source column |
| Text analysis | Batch D1 lookup + percentile aggregation |
| Sentence retrieval | D1 CTE pipeline (include/exclude/bound CTEs + INNER-JOIN; tens of ms typical) |
Deployment¶
Cloudflare Workers¶
cd packages/web/workers
npx wrangler deploy
# Apply seed to remote D1 (CI does this from the LFS-tracked seed)
for f in scripts/d1-chunks/chunk_*.sql; do
npx wrangler d1 execute phonolex --remote --file "$f"
done
Development¶
The D1 seed SQL is LFS-tracked and pulled automatically on clone.
# One-time: chunk the seed + apply to local D1
cd packages/web/workers
uv run python scripts/chunk-seed-sql.py
for f in scripts/d1-chunks/chunk_*.sql; do
npx wrangler d1 execute phonolex --local --file "$f"
done
# Run the API
npx wrangler dev
# Frontend (React + MUI, separate terminal)
cd ../frontend && npm install && npm run dev
Technical limitations¶
Dialect coverage¶
General American English only (CMU primary pronunciations). Not supported: British English, regional dialects, pronunciation variants.
Property coverage¶
Coverage is uniform (~100%) within the canonical content-POS subset for the in-house columns. Frequency-class columns vary by source corpus (freq_age_* is sparse; freq_cyplex_* covers ~80%; frequency ~90%). Filtering by a column with NULL coverage excludes the missing rows; percentile bounds use NULL-fail semantics.
Syllabification¶
Rule-based English phonotactic constraints. May not optimally handle loanwords, proper nouns with unusual structures, or ambisyllabic consonants.
References¶
See Data & Methods and Citations for the full bibliography.
Key methodology papers:
- Stoel-Gammon, C. (2010). Word Complexity Measure. Clinical Linguistics & Phonetics, 24(4-5), 271-282.
- Vitevitch, M. S., & Luce, P. A. (2004). Phonotactic probability method. Behavior Research Methods, 36(3), 481-487.
- Hayes, B. (2009). Introductory Phonology (theory-assigned features as Bayesian prior).
- Marxer et al. (2016) ECCC (Bayesian evidence + perceptual confusability edges).
- Hillenbrand et al. (1995) American English vowel acoustics (Bayesian evidence for vowel posteriors).
- Levenshtein, V. I. (1966). Soft-Levenshtein algorithm origin.
- Martínez, G., Conde, J., Reviriego, P., & Brysbaert, M. (2025). Using LLMs to generate psycholinguistic norms — methodology validation for the in-house derivations.
- Brysbaert, M. (2024). Validating LLM-derived psycholinguistic ratings — companion methodology paper.
- Gierut, J. A. (1989). Maximal opposition approach. JSHD, 54(1), 9-19.
- Williams, A. L. (2000). Multiple oppositions. AJSLP, 9(4), 282-288.
- Storkel, H. L. (2022). Minimal, Maximal, or Multiple Oppositions. LSHSS, 53(3), 632-645.