Technical Architecture¶

Technical details on PhonoLex's architecture and computational methods.

System overview¶

PhonoLex is a Hono API on Cloudflare Workers + D1 (SQLite at the edge) with a React + TypeScript + MUI frontend. Data is consolidated locally by a developer-facing Python pipeline and shipped as a single LFS-tracked d1-seed.sql artifact; CI applies the seed via wrangler d1 execute --remote. No Python pipeline runs in CI.

React frontend (TypeScript + MUI, vite + Cloudflare Pages)
 ↓ HTTPS / JSON
Hono backend (TypeScript, Cloudflare Workers)
 ↓ SQL queries
D1 (SQLite at the edge, 14 tables)
 ├── words ~125K phonology-bearing entries (is_canonical flag marks the ~47K content-POS subset; has_image flag)
 ├── word_properties ~125K rows (in-house norms populated for the ~47K canonical subset)
 ├── word_freq_bands ~125K rows (15 FineWeb-Edu grade-band cols; not surfaced)
 ├── word_percentiles ~125K per-property percentile ranks
 ├── word_images ~1.7K word→image rows (Mulberry Symbols + OpenMoji, CC BY-SA)
 ├── edges ~1.6M word-similarity edges (Qwensim 99.8% + ECCC + WordSim)
 ├── pairs 642K precomputed minimal pairs (feature_distance + sonorant_diff)
 ├── corpus_sentences_index ~236K sentence headers (text, sources[], rarity_score, content_lemma_sig)
 ├── corpus_sentences ~2.1M (sentence_id, surface, is_content) membership rows
 ├── phonemes / phoneme_dots / components / word_syllables (similarity infrastructure)
 └── metadata key/value config

Consolidated data¶

Word properties (~150 columns)¶

Each canonical content-POS word carries properties across the following categories:

Category	Columns	Sources
Phonological Complexity	`syllable_count`, `phoneme_count`, `wcm_score`, `cv_shape`	CMU dict + Stoel-Gammon (2010) WCM
Phonotactic Probability	`phono_prob_avg`, `positional_prob_avg`, stressed variants, `neighborhood_density`	Method: Vitevitch & Luce (2004); computed locally from CMU
Lexical Frequency	`frequency`, `log_frequency`, `contextual_diversity`	PhonoLex derivation from FineWeb-Edu
Child-Corpus Frequency	`freq_cyplex_7_9`, `freq_cyplex_10_12`, `freq_cyplex_13`	CYP-LEX (Korochkina et al., 2024, CC BY 4.0)
Lexical Timing	`aoa`	PhonoLex in-house gpt-4.1-mini cloze, 1-7 age-banded (Spearman 0.868 vs Glasgow)
Semantic	`concreteness`, `familiarity`, `imageability`, `boi`, `iconicity`, `socialness`, `semantic_diversity`, semd_*	PhonoLex (in-house gpt-4.1-mini cloze)
Affective	`valence`, `arousal`	PhonoLex (in-house, Warriner-scale anchor)
Morphological	`morpheme_count`, `n_prefixes`, `n_suffixes`, `is_monomorphemic`	Algorithmic + MorphyNet (CC BY-SA 3.0)

The in-house norm columns are derived locally via gpt-4.1-mini cloze-prompt ratings. The original-author papers cited above are scale anchors — values served are PhonoLex's, validated against held-out oracles. See Data & Methods for full per-column provenance.

Word similarity graph (1.6M edges)¶

Source	Edges	What it measures
Qwensim	~1.63M	PhonoLex — Qwen3-Embedding-0.6B cosine over FineWeb-Edu. Bulk of the graph.
ECCC	~2.5K	Perceptual confusability in noise (Marxer et al., 2016, CC BY 4.0)
WordSim-353	~351	Human-rated semantic relatedness (Finkelstein et al., 2001)

This is neural-embedding similarity, not free-association norm data (the USF / SWOW / MEN / SPP / SimLex datasets were retired in the licensing audit).

Phonological feature vectors¶

The 41-phoneme CMU inventory is represented in phonemes / phoneme_dots / components tables as 26-d Bayesian posterior vectors learned from theory-assigned priors (Hayes 2009) + ECCC perceptual confusion evidence + Hillenbrand vowel acoustic measurements. Posteriors achieve r=0.987 cosine correlation vs theory-assigned features at convergence — see packages/features/.

Curated corpus (~236K sentences)¶

The Sentences tool's source corpus: CoLA + UD English-EWT + GUM + Tatoeba + OpenSubtitles, run through a multi-stage build pipeline (vocab coverage, profanity filter, V/A + AFINN content-safety gate, PROPN cap, parataxis rejection, contraction-stem glue, cross-source identical-text merge with aggressive normalization, coverage-aware rarity dedup). CHILDES + PhonBank conversational transcripts were retired 2026-05-25 — they remain in use only for the developmental-frequency derivations above.

Hono backend (Cloudflare Workers + D1)¶

Build pipeline (developer-local)¶

build_lexical_database in packages/data/src/phonolex_data/pipeline/ reads source TSVs (CMU + in-house phonolex_*.tsv)
derived.py computes per-property percentiles (frequency-class columns treat 0 as NULL), phoneme dot products, syllable components, minimal pairs
emit_parquet.py writes data/runtime/{words,pairs,corpus_sentences,corpus_sentences_index}.parquet (gitignored local cache)
emit_d1_sql.py writes the LFS-tracked packages/web/workers/scripts/d1-seed.sql
chunk-seed-sql.py splits the ~550 MB seed into 40 MB chunks for wrangler d1 execute --file (D1 statement size cap)

Cold start¶

The similarity engine loads phoneme norms (~45 entries), pairwise phoneme dot products (~990), and component phoneme sequences (~600) into Worker isolate memory on cold start. All other routes query D1 directly.

API endpoints¶

Method	Path	Description
`GET`	`/api/health`	Health check
`GET`	`/api/stats`	Vocabulary statistics
`GET`	`/api/property-metadata`	Property definitions, categories, labels (drives the frontend filter UI)
`GET`	`/api/edge-types`	Word-similarity edge type definitions
`GET`	`/api/words/{word}`	Word detail with full property profile
`POST`	`/api/words/search`	Unified search (patterns + filters + cv_shape + similar_to + bounds + pagination)
`POST`	`/api/words/word-list`	Flat word list for a constraint set
`POST`	`/api/words/batch`	Batch word lookup (up to 1000 words)
`POST`	`/api/similarity/search`	Phonological similarity with adjustable weights
`POST`	`/api/sentences`	Sentence retrieval (pattern + cv_shape + bound + contrastive constraints; returns `match_count` + `rarity_score` + per-word `highlights`)
`GET`	`/api/associations/{word}`	Word-similarity neighbors by edge type
`GET`	`/api/associations/{word}/confusability`	ECCC perceptual confusability
`GET`	`/api/phonemes`, `/{ipa}`, `POST /compare`, `POST /search`	Phoneme inventory + feature lookups
`POST`	`/api/contrastive/{minimal-pairs,maximal-opposition/,multiple-opposition/}`	Pair-graph predicates served from `pairs` table
`POST`	`/api/text/analyze`	Per-passage analysis with percentile aggregation

Property metadata system¶

Property definitions are served via GET /api/property-metadata, providing per-column ID, label, short label, category grouping, source, scale description, interpretation, display format, and filter configuration (slider step, log-scale flag, integer flag). The frontend loads this once at startup and drives all UI (filter sliders, table columns, bound pickers) from the metadata — no hardcoded property lists in TypeScript.

Phonological similarity¶

PhonoLex uses two-level soft Levenshtein with precomputed phoneme-level dot products.

Phoneme-level precomputation¶

At build time, the pipeline computes:

Norm squares (||v||²) for each of the 41 CMU phonemes
Pairwise dot products between all phoneme pairs (~820 pairs)
Component phoneme sequences (e.g., onset /kr/) preserving cluster structure — no vector averaging

Runtime similarity¶

Level 1 — Component similarity (within onset, nucleus, or coda):

Soft Levenshtein DP on phoneme sequences. Substitution cost = 1 - cosine(phoneme_a, phoneme_b) using precomputed norms and dot products. Cluster length penalties (e.g., /kr/ vs /k/ onset) emerge naturally from the DP.

Level 2 — Word similarity (across syllable sequences):

Syllable similarity = weighted average of component similarities:

syl_sim = (w_o · compSim(o1, o2) + w_n · compSim(n1, n2) + w_c · compSim(c1, c2))
 / (w_o + w_n + w_c)

Soft Levenshtein DP on syllable sequences with substitution cost = 1 - syl_sim:

DP[i][j] = min(
 DP[i-1][j] + 1.0, # deletion
 DP[i][j-1] + 1.0, # insertion
 DP[i-1][j-1] + (1 - syl_sim(s1_i, s2_j)) # substitution
)

word_similarity = 1 - (DP[n][m] / max(n, m))

Weight presets¶

Preset	Onset	Nucleus	Coda
Balanced	0.33	0.33	0.33
Rhymes	0.0	0.5	0.5
Alliteration	1.0	0.5	0.0
Assonance	0.0	1.0	0.0
Consonance	0.5	0.0	0.5

Contrast sets¶

Minimal pairs¶

Precomputed at build time. Two words differing by exactly one phoneme at the same position. The pairs D1 table carries 642K rows with feature_distance (continuous L2 over learned vectors) + sonorant_diff (whether the contrast crosses the sonorant class) + is_canonical (whether both members are content-POS).

Maximal opposition (Gierut, 1989)¶

Filters pairs on sonorant_diff >= threshold. The major-class crossing predicts broader generalization across the phonological system.

Multiple opposition (Williams, 2000)¶

A substitute phoneme contrasted with N target phonemes. Surfaced at the word level (Contrast Sets tool). The /api/sentences contrastive_multopp constraint exists but is not surfaced in the Sentences UI — a single sentence almost never witnesses one substitute against ≥2 distinct target phonemes.

Sentences route ranking¶

Tiered globally by per-query match_count DESC (multi-hit > single-hit, regardless of source), then source-interleaved within tier by static rarity_score. Bound rules use NULL-fail semantics for percentile columns, NULL-pass for raw norms. Contrastive constraints (minpair / maxopp / multopp) apply via self-join through pairs — the sentence must contain BOTH members of at least one witness pair. Per-result highlights payload returns the surfaces that earned the slot, used by the frontend overlay.

Performance¶

Operation	Notes
Word lookup	D1 primary-key lookup
Property filter	SQL WHERE with indexed columns
Similarity search	Cold start loads ~4,500 phoneme dot products; full-vocab scan with scalar math
Edge query	D1 indexed lookup on `source` column
Text analysis	Batch D1 lookup + percentile aggregation
Sentence retrieval	D1 CTE pipeline (include/exclude/bound CTEs + INNER-JOIN; tens of ms typical)

Deployment¶

Cloudflare Workers¶

cd packages/web/workers
npx wrangler deploy

# Apply seed to remote D1 (CI does this from the LFS-tracked seed)
for f in scripts/d1-chunks/chunk_*.sql; do
 npx wrangler d1 execute phonolex --remote --file "$f"
done

Development¶

The D1 seed SQL is LFS-tracked and pulled automatically on clone.

# One-time: chunk the seed + apply to local D1
cd packages/web/workers
uv run python scripts/chunk-seed-sql.py
for f in scripts/d1-chunks/chunk_*.sql; do
 npx wrangler d1 execute phonolex --local --file "$f"
done

# Run the API
npx wrangler dev

# Frontend (React + MUI, separate terminal)
cd ../frontend && npm install && npm run dev

Technical limitations¶

Dialect coverage¶

General American English only (CMU primary pronunciations). Not supported: British English, regional dialects, pronunciation variants.

Property coverage¶

Coverage is uniform (~100%) within the canonical content-POS subset for the in-house columns. Frequency-class columns vary by source corpus (freq_cyplex_* covers ~80%; frequency ~90%). Filtering by a column with NULL coverage excludes the missing rows; percentile bounds use NULL-fail semantics.

Syllabification¶

Rule-based English phonotactic constraints. May not optimally handle loanwords, proper nouns with unusual structures, or ambisyllabic consonants.

References¶

See Data & Methods and Citations for the full bibliography.

Key methodology papers:

Stoel-Gammon, C. (2010). Word Complexity Measure. Clinical Linguistics & Phonetics, 24(4-5), 271-282.
Vitevitch, M. S., & Luce, P. A. (2004). Phonotactic probability method. Behavior Research Methods, 36(3), 481-487.
Hayes, B. (2009). Introductory Phonology (theory-assigned features as Bayesian prior).
Marxer et al. (2016) ECCC (Bayesian evidence + perceptual confusability edges).
Hillenbrand et al. (1995) American English vowel acoustics (Bayesian evidence for vowel posteriors).
Levenshtein, V. I. (1966). Soft-Levenshtein algorithm origin.
Martínez, G., Conde, J., Reviriego, P., & Brysbaert, M. (2025). Using LLMs to generate psycholinguistic norms — methodology validation for the in-house derivations.
Brysbaert, M. (2024). Validating LLM-derived psycholinguistic ratings — companion methodology paper.
Gierut, J. A. (1989). Maximal opposition approach. JSHD, 54(1), 9-19.
Williams, A. L. (2000). Multiple oppositions. AJSLP, 9(4), 282-288.
Storkel, H. L. (2022). Minimal, Maximal, or Multiple Oppositions. LSHSS, 53(3), 632-645.