Skip to content

Technical Architecture

Technical details on PhonoLex's architecture and computational methods.

System overview

PhonoLex is a Hono API on Cloudflare Workers + D1 (SQLite at the edge) with a React + TypeScript + MUI frontend. Data is consolidated locally by a developer-facing Python pipeline and shipped as a single LFS-tracked d1-seed.sql artifact; CI applies the seed via wrangler d1 execute --remote. No Python pipeline runs in CI.

React frontend (TypeScript + MUI, vite + Cloudflare Pages)
 ↓ HTTPS / JSON
Hono backend (TypeScript, Cloudflare Workers)
 ↓ SQL queries
D1 (SQLite at the edge, 13 tables)
 ├── words ~125K phonology-bearing entries (is_canonical flag marks the ~47K content-POS subset)
 ├── word_properties ~47K canonical norm rows (~150 columns surfaced)
 ├── word_freq_bands ~47K raw freq-band cols (not surfaced)
 ├── word_percentiles ~47K per-property percentile ranks
 ├── edges ~1.6M word-similarity edges (Qwensim 99.8% + ECCC + WordSim)
 ├── pairs 642K precomputed minimal pairs (feature_distance + sonorant_diff)
 ├── corpus_sentences_index ~236K sentence headers (text, sources[], rarity_score, content_lemma_sig)
 ├── corpus_sentences ~2.1M (sentence_id, surface, is_content) membership rows
 ├── phonemes / phoneme_dots / components / word_syllables (similarity infrastructure)
 └── metadata key/value config

Consolidated data

Word properties (~150 columns)

Each canonical content-POS word carries properties across the following categories:

Category Columns Sources
Phonological Complexity syllable_count, phoneme_count, wcm_score, cv_shape CMU dict + Stoel-Gammon (2010) WCM
Phonotactic Probability phono_prob_avg, positional_prob_avg, stressed variants, neighborhood_density Method: Vitevitch & Luce (2004); computed locally from CMU
Lexical Frequency frequency, log_frequency, contextual_diversity PhonoLex derivation from FineWeb-Edu
Developmental Frequency freq_age_2y / 5y / 8y / 12y (child PRODUCTION), freq_age_all (alias for frequency) PhonoLex from CHILDES + PhonBank
Child-Corpus Frequency freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13 CYP-LEX (Korochkina et al., 2024, CC BY 4.0)
Lexical Timing aoa PhonoLex in-house gpt-4.1-mini cloze, 1-7 age-banded (Spearman 0.868 vs Glasgow)
Semantic concreteness, familiarity, imageability, boi, iconicity, socialness, semantic_diversity, semd_* PhonoLex (in-house gpt-4.1-mini cloze)
Affective valence, arousal PhonoLex (in-house, Warriner-scale anchor)
Morphological morpheme_count, n_prefixes, n_suffixes, is_monomorphemic Algorithmic + MorphyNet (CC BY-SA 3.0)

The in-house norm columns are derived locally via gpt-4.1-mini cloze-prompt ratings. The original-author papers cited above are scale anchors — values served are PhonoLex's, validated against held-out oracles. See Data & Methods for full per-column provenance.

Word similarity graph (1.6M edges)

Source Edges What it measures
Qwensim ~1.63M PhonoLex — Qwen3-Embedding-0.6B cosine over FineWeb-Edu. Bulk of the graph.
ECCC ~2.5K Perceptual confusability in noise (Marxer et al., 2016, CC BY 4.0)
WordSim-353 ~351 Human-rated semantic relatedness (Finkelstein et al., 2001)

This is neural-embedding similarity, not free-association norm data (the USF / SWOW / MEN / SPP / SimLex datasets were retired in the licensing audit).

Phonological feature vectors

The 41-phoneme CMU inventory is represented in phonemes / phoneme_dots / components tables as 26-d Bayesian posterior vectors learned from theory-assigned priors (Hayes 2009) + ECCC perceptual confusion evidence + Hillenbrand vowel acoustic measurements. Posteriors achieve r=0.987 cosine correlation vs theory-assigned features at convergence — see packages/features/.

Curated corpus (~236K sentences)

The Sentences tool's source corpus: CoLA + UD English-EWT + GUM + Tatoeba + OpenSubtitles, run through a multi-stage build pipeline (vocab coverage, profanity filter, V/A + AFINN content-safety gate, PROPN cap, parataxis rejection, contraction-stem glue, cross-source identical-text merge with aggressive normalization, coverage-aware rarity dedup). CHILDES + PhonBank conversational transcripts were retired 2026-05-25 — they remain in use only for the developmental-frequency derivations above.

Hono backend (Cloudflare Workers + D1)

Build pipeline (developer-local)

  1. build_lexical_database in packages/data/src/phonolex_data/pipeline/ reads source TSVs (CMU + in-house phonolex_*.tsv)
  2. derived.py computes per-property percentiles (frequency-class columns treat 0 as NULL), phoneme dot products, syllable components, minimal pairs
  3. emit_parquet.py writes data/runtime/{words,pairs,corpus_sentences,corpus_sentences_index}.parquet (gitignored local cache)
  4. emit_d1_sql.py writes the LFS-tracked packages/web/workers/scripts/d1-seed.sql
  5. chunk-seed-sql.py splits the ~550 MB seed into 40 MB chunks for wrangler d1 execute --file (D1 statement size cap)

Cold start

The similarity engine loads phoneme norms (~45 entries), pairwise phoneme dot products (~990), and component phoneme sequences (~600) into Worker isolate memory on cold start. All other routes query D1 directly.

API endpoints

Method Path Description
GET /api/health Health check
GET /api/stats Vocabulary statistics
GET /api/property-metadata Property definitions, categories, labels (drives the frontend filter UI)
GET /api/edge-types Word-similarity edge type definitions
GET /api/words/{word} Word detail with full property profile
POST /api/words/search Unified search (patterns + filters + cv_shape + similar_to + bounds + pagination)
POST /api/words/word-list Flat word list for a constraint set
POST /api/words/batch Batch word lookup (up to 1000 words)
POST /api/similarity/search Phonological similarity with adjustable weights
POST /api/sentences Sentence retrieval (pattern + cv_shape + bound + contrastive constraints; returns match_count + rarity_score + per-word highlights)
GET /api/associations/{word} Word-similarity neighbors by edge type
GET /api/associations/{word}/confusability ECCC perceptual confusability
GET /api/phonemes, /{ipa}, POST /compare, POST /search Phoneme inventory + feature lookups
POST /api/contrastive/{minimal-pairs,maximal-opposition/*,multiple-opposition/*} Pair-graph predicates served from pairs table
POST /api/text/analyze Per-passage analysis with percentile aggregation

Property metadata system

Property definitions are served via GET /api/property-metadata, providing per-column ID, label, short label, category grouping, source, scale description, interpretation, display format, and filter configuration (slider step, log-scale flag, integer flag). The frontend loads this once at startup and drives all UI (filter sliders, table columns, bound pickers) from the metadata — no hardcoded property lists in TypeScript.

Phonological similarity

PhonoLex uses two-level soft Levenshtein with precomputed phoneme-level dot products.

Phoneme-level precomputation

At build time, the pipeline computes:

  • Norm squares (||v||²) for each of the 41 CMU phonemes
  • Pairwise dot products between all phoneme pairs (~820 pairs)
  • Component phoneme sequences (e.g., onset /kr/) preserving cluster structure — no vector averaging

Runtime similarity

Level 1 — Component similarity (within onset, nucleus, or coda):

Soft Levenshtein DP on phoneme sequences. Substitution cost = 1 - cosine(phoneme_a, phoneme_b) using precomputed norms and dot products. Cluster length penalties (e.g., /kr/ vs /k/ onset) emerge naturally from the DP.

Level 2 — Word similarity (across syllable sequences):

Syllable similarity = weighted average of component similarities:

syl_sim = (w_o · compSim(o1, o2) + w_n · compSim(n1, n2) + w_c · compSim(c1, c2))
 / (w_o + w_n + w_c)

Soft Levenshtein DP on syllable sequences with substitution cost = 1 - syl_sim:

DP[i][j] = min(
 DP[i-1][j] + 1.0, # deletion
 DP[i][j-1] + 1.0, # insertion
 DP[i-1][j-1] + (1 - syl_sim(s1_i, s2_j)) # substitution
)

word_similarity = 1 - (DP[n][m] / max(n, m))

Weight presets

Preset Onset Nucleus Coda
Balanced 0.33 0.33 0.33
Rhymes 0.0 0.5 0.5
Alliteration 1.0 0.5 0.0
Assonance 0.0 1.0 0.0
Consonance 0.5 0.0 0.5

Contrast sets

Minimal pairs

Precomputed at build time. Two words differing by exactly one phoneme at the same position. The pairs D1 table carries 642K rows with feature_distance (continuous L2 over learned vectors) + sonorant_diff (whether the contrast crosses the sonorant class) + is_canonical (whether both members are content-POS).

Maximal opposition (Gierut, 1989)

Filters pairs on sonorant_diff >= threshold. The major-class crossing predicts broader generalization across the phonological system.

Multiple opposition (Williams, 2000)

A substitute phoneme contrasted with N target phonemes. Surfaced at the word level (Contrast Sets tool). The /api/sentences contrastive_multopp constraint exists but is not surfaced in the Sentences UI — a single sentence almost never witnesses one substitute against ≥2 distinct target phonemes.

Sentences route ranking

Tiered globally by per-query match_count DESC (multi-hit > single-hit, regardless of source), then source-interleaved within tier by static rarity_score. Bound rules use NULL-fail semantics for percentile columns, NULL-pass for raw norms. Contrastive constraints (minpair / maxopp / multopp) apply via self-join through pairs — the sentence must contain BOTH members of at least one witness pair. Per-result highlights payload returns the surfaces that earned the slot, used by the frontend overlay.

Performance

Operation Notes
Word lookup D1 primary-key lookup
Property filter SQL WHERE with indexed columns
Similarity search Cold start loads ~4,500 phoneme dot products; full-vocab scan with scalar math
Edge query D1 indexed lookup on source column
Text analysis Batch D1 lookup + percentile aggregation
Sentence retrieval D1 CTE pipeline (include/exclude/bound CTEs + INNER-JOIN; tens of ms typical)

Deployment

Cloudflare Workers

cd packages/web/workers
npx wrangler deploy

# Apply seed to remote D1 (CI does this from the LFS-tracked seed)
for f in scripts/d1-chunks/chunk_*.sql; do
 npx wrangler d1 execute phonolex --remote --file "$f"
done

Development

The D1 seed SQL is LFS-tracked and pulled automatically on clone.

# One-time: chunk the seed + apply to local D1
cd packages/web/workers
uv run python scripts/chunk-seed-sql.py
for f in scripts/d1-chunks/chunk_*.sql; do
 npx wrangler d1 execute phonolex --local --file "$f"
done

# Run the API
npx wrangler dev

# Frontend (React + MUI, separate terminal)
cd ../frontend && npm install && npm run dev

Technical limitations

Dialect coverage

General American English only (CMU primary pronunciations). Not supported: British English, regional dialects, pronunciation variants.

Property coverage

Coverage is uniform (~100%) within the canonical content-POS subset for the in-house columns. Frequency-class columns vary by source corpus (freq_age_* is sparse; freq_cyplex_* covers ~80%; frequency ~90%). Filtering by a column with NULL coverage excludes the missing rows; percentile bounds use NULL-fail semantics.

Syllabification

Rule-based English phonotactic constraints. May not optimally handle loanwords, proper nouns with unusual structures, or ambisyllabic consonants.

References

See Data & Methods and Citations for the full bibliography.

Key methodology papers:

  • Stoel-Gammon, C. (2010). Word Complexity Measure. Clinical Linguistics & Phonetics, 24(4-5), 271-282.
  • Vitevitch, M. S., & Luce, P. A. (2004). Phonotactic probability method. Behavior Research Methods, 36(3), 481-487.
  • Hayes, B. (2009). Introductory Phonology (theory-assigned features as Bayesian prior).
  • Marxer et al. (2016) ECCC (Bayesian evidence + perceptual confusability edges).
  • Hillenbrand et al. (1995) American English vowel acoustics (Bayesian evidence for vowel posteriors).
  • Levenshtein, V. I. (1966). Soft-Levenshtein algorithm origin.
  • Martínez, G., Conde, J., Reviriego, P., & Brysbaert, M. (2025). Using LLMs to generate psycholinguistic norms — methodology validation for the in-house derivations.
  • Brysbaert, M. (2024). Validating LLM-derived psycholinguistic ratings — companion methodology paper.
  • Gierut, J. A. (1989). Maximal opposition approach. JSHD, 54(1), 9-19.
  • Williams, A. L. (2000). Multiple oppositions. AJSLP, 9(4), 282-288.
  • Storkel, H. L. (2022). Minimal, Maximal, or Multiple Oppositions. LSHSS, 53(3), 632-645.