Technical Architecture¶
Technical details on PhonoLex's architecture and computational methods.
System Overview¶
PhonoLex v4.0 uses a Hono backend on Cloudflare Workers with a D1 database (SQLite at the edge), serving all queries via a REST API. The React frontend communicates with the API. Data is consolidated from 15 research datasets and exported to D1 via a build-time pipeline.
React Frontend (TypeScript + MUI)
↓ HTTP API calls
Hono Backend (TypeScript, Cloudflare Workers)
↓ SQL queries
D1 Database (SQLite at the edge)
├── 44,011 word rows (35 filterable properties + 35 percentile columns)
├── 1,012,327 edge rows (7 relationship types)
├── Phoneme-level dot products (for similarity)
├── Precomputed minimal pairs
└── Phoneme features (26 learned articulatory features)
Consolidated Dataset¶
PhonoLex consolidates 15 research datasets into a single queryable platform, covering word properties, cognitive associations, and phonological features.
Word Properties¶
Each word carries properties across 9 categories (35 are filterable via the API; additional derived fields like log_frequency and is_monomorphemic are included in responses):
| Category | Properties | Sources |
|---|---|---|
| Phonological Complexity | syllable_count, phoneme_count, wcm_score | CMU Dict, Stoel-Gammon (2010) |
| Phonotactic Probability | phono_prob_avg, positional_prob_avg | Vitevitch & Luce (2004) |
| Lexical | frequency, log_frequency, contextual_diversity, prevalence, aoa, aoa_kuperman, elp_lexical_decision_rt | SUBTLEX-US, Kuperman et al., ELP |
| Semantic | imageability, familiarity, concreteness, size | Glasgow Norms, Brysbaert et al. |
| Affective | valence, arousal, dominance | Warriner et al. (2013) |
| Cognitive / Embodied | iconicity, boi, socialness | Winter et al., Tillotson et al., Diveica et al. |
| Sensorimotor — Perceptual | auditory, visual, haptic, gustatory, olfactory, interoceptive | Lancaster Norms |
| Sensorimotor — Action | hand_arm, foot_leg, head, mouth, torso | Lancaster Norms |
| Morphological | morpheme_count, is_monomorphemic, n_prefixes, n_suffixes | MorphoLex |
Edge Types (6 relationship types)¶
| Edge Type | Edges | Source | Description |
|---|---|---|---|
| USF | 62,923 | University of South Florida | Forward/backward association norms |
| MEN | 3,000 | MEN Dataset | Human-judged semantic relatedness |
| ECCC | 2,456 | Edinburgh Confusability | Perceptual confusability in noise |
| SPP | 1,546 | Semantic Priming Project | Semantic priming (short/long SOA) |
| SimLex | 998 | SimLex-999 | Human-judged semantic similarity |
| WordSim | 351 | WordSim-353 | Human-judged word relatedness |
Phonological Data¶
The platform also stores: - Learned feature vectors: 26-dimensional articulatory feature vectors via Bayesian inference - Phoneme features: 26 learned articulatory features per phoneme - Syllable structures: Onset-nucleus-coda decompositions for all words
Vocabulary Filtering¶
The full dataset has 245,393 entries. Vocabulary is filtered during the build process to 44,011 words using:
- Must have IPA transcription
- Must have frequency data (SUBTLEX-US)
- Must have at least one psycholinguistic norm value
Hono Backend (Cloudflare Workers + D1)¶
Data Pipeline¶
The build pipeline reads the consolidated research datasets and:
- Filters vocabulary (245K to 44K words)
- Computes WCM scores (Stoel-Gammon 2010)
- Computes percentile ranks for all numeric properties
- Extracts syllable components and precomputes phoneme-level similarity data
- Precomputes minimal pairs
- Computes property ranges (min/max) for filter UI
- Outputs the database seed
Cold Start¶
On cold start, the similarity engine loads phoneme norms, pairwise phoneme similarity data, and component phoneme sequences into memory. All other routes query D1 directly.
API Endpoints¶
| Method | Path | Description |
|---|---|---|
GET |
/api/health |
Health check with vocabulary stats |
GET |
/api/stats |
Full statistics |
GET |
/api/property-ranges |
Min/max for all numeric properties |
GET |
/api/property-metadata |
Property definitions, categories, labels |
GET |
/api/edge-types |
Edge type definitions |
| Words | ||
GET |
/api/words |
Browse vocabulary with pagination and sorting |
GET |
/api/words/{word} |
Word details with all properties |
POST |
/api/words/search |
Unified search (patterns + filters + exclusions + sorting + pagination) |
POST |
/api/words/batch |
Batch word lookup (up to 1000 words) |
| Similarity | ||
POST |
/api/similarity/search |
Phonological similarity with adjustable weights |
| Associations | ||
GET |
/api/associations/{word} |
Cognitive associations by edge type |
GET |
/api/associations/{word}/confusability |
ECCC perceptual confusability |
GET |
/api/associations/compare |
Shared associations with Jaccard score |
| Phonemes | ||
GET |
/api/phonemes |
All phonemes with features |
GET |
/api/phonemes/{ipa} |
Single phoneme features (auto-normalizes ASCII "g" to IPA "ɡ") |
POST |
/api/phonemes/compare |
Feature-by-feature comparison |
POST |
/api/phonemes/search |
Search by feature values |
| Contrastive | ||
POST |
/api/contrastive/minimal-pairs |
Minimal pairs for phoneme contrast |
POST |
/api/contrastive/maximal-opposition/pairs |
Maximal opposition pairs |
POST |
/api/contrastive/maximal-opposition/word-lists |
Maximal opposition word lists |
POST |
/api/contrastive/multiple-opposition/targets |
Representative target selection |
POST |
/api/contrastive/multiple-opposition/sets |
Multiple opposition sets |
| Text | ||
POST |
/api/text/analyze |
Text analysis with percentiles |
Property Metadata System¶
Property definitions are served via GET /api/property-metadata, providing:
- Property ID, label, short label
- Category grouping
- Source dataset
- Scale description and interpretation
- Display format (decimal places, suffix)
- Filter configuration (slider step, log scale, integer)
The frontend loads this once at startup and drives all UI (filter sliders, table columns, text analysis features) from the metadata — no hardcoded property lists in TypeScript.
Phonological Similarity¶
PhonoLex uses two-level soft Levenshtein distance with precomputed phoneme-level dot products.
Learned Feature Vectors¶
Each phoneme is represented as a 26-dimensional feature vector learned via Bayesian inference from empirical evidence (acoustic data, confusion corpora, morphological patterns). These replace the original PHOIBLE binary feature assignments with continuous values that better predict perceptual confusion.
Phoneme-Level Precomputation¶
At build time, the pipeline computes: - Norm squares (||v||^2) for each of the 45 phonemes (40 monophthongs + 5 diphthongs) - Pairwise dot products between all phoneme pairs (~990 pairs) - Component phoneme sequences (e.g., onset /kr/) preserving cluster structure
No vector averaging -- consonant clusters and diphthongs are preserved as phoneme sequences.
Runtime Similarity¶
At runtime, two levels of soft Levenshtein:
Level 1 — Component similarity (within onset, nucleus, or coda):
Soft Levenshtein DP on phoneme sequences. Substitution cost = 1 - cosine(phoneme_a, phoneme_b) using precomputed norms and dot products. This preserves cluster length penalties (e.g., /kr/ vs /k/ onset → cost includes an insertion penalty).
Level 2 — Word similarity (across syllable sequences): Syllable similarity = weighted average of component similarities:
syl_sim = (w_o * compSim(o1,o2) + w_n * compSim(n1,n2) + w_c * compSim(c1,c2)) / (w_o + w_n + w_c)
Soft Levenshtein DP on syllable sequences with substitution cost = 1 - syl_sim:
DP[i][j] = min(
DP[i-1][j] + 1.0, # deletion
DP[i][j-1] + 1.0, # insertion
DP[i-1][j-1] + (1 - syl_sim(s1_i, s2_j)) # substitution
)
word_similarity = 1 - (DP[n][m] / max(n, m))
Weighted Component Presets¶
User-adjustable onset/nucleus/coda weights control the similarity computation:
Weight presets: - Balanced: onset=0.33, nucleus=0.33, coda=0.33 - Rhymes: onset=0.0, nucleus=0.5, coda=0.5 - Alliteration: onset=1.0, nucleus=0.5, coda=0.0 - Assonance: onset=0.0, nucleus=1.0, coda=0.0 - Consonance: onset=0.5, nucleus=0.0, coda=0.5
Contrastive Sets¶
Minimal Pairs¶
Precomputed at build time. Words that differ by exactly one phoneme at the same position.
Maximal Opposition (Gierut 1989-1992)¶
Computed on-the-fly. Finds phoneme pairs with maximum feature distance, prioritizing: 1. Major class difference (consonant vs. vowel features) 2. Number of differing distinctive features
Multiple Opposition¶
Finds minimal sets (triplets/quadruplets) where words differ only in one phoneme position, targeting global phoneme collapse patterns.
Performance¶
| Operation | Notes |
|---|---|
| Word lookup | D1 primary key lookup |
| Property filter | SQL WHERE with indexed columns |
| Similarity search | Cold start loads component cache; full vocabulary scan with scalar math |
| Edge query | D1 indexed lookup on source column |
| Text analysis | Batch D1 lookup + percentile aggregation |
Data Coverage¶
Vocabulary: 44,011 English words (General American English, CMU primary pronunciations)
Phoneme inventory: 39 phonemes (24 consonants + 15 vowels including diphthongs)
Property coverage varies by dataset: - 100%: Syllables, phonemes, frequency - 90%+: WCM, phonotactic probability - 70-80%: AoA, concreteness - 40-60%: Imageability, familiarity, valence, arousal, dominance - 30-50%: Sensorimotor norms, iconicity, BOI, socialness, morphological
Deployment¶
Cloudflare Workers¶
# Deploy backend
cd workers
npx wrangler deploy
# Seed D1 database
npx wrangler d1 execute phonolex --file scripts/d1-seed.sql
Development¶
# Backend
cd workers
npm install
npx wrangler dev
# Frontend
cd webapp/frontend
npm install && npm run dev
Technical Limitations¶
Dialect Coverage¶
General American English only (CMU Dictionary primary pronunciations). Not supported: British English, regional dialects, pronunciation variants.
Property Coverage¶
Not all words have all properties. Coverage varies by dataset (see above). Filters exclude words without the relevant property.
Syllabification¶
Rule-based English phonotactic constraints. May not optimally handle loanwords, proper nouns with unusual structures, or ambisyllabic consonants.
References¶
Phonological Features: - Hayes, B. (2009). Introductory Phonology. Wiley-Blackwell.
Similarity: - Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707-710.
Phonological Complexity: - Stoel-Gammon, C. (2010). The Word Complexity Measure. Clinical Linguistics & Phonetics, 24(4-5), 271-282.
Psycholinguistic Norms: - Brysbaert, M., & New, B. (2009). SUBTLEX-US word frequencies. Behavior Research Methods, 41(4), 977-990. - Kuperman, V., et al. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978-990. - Brysbaert, M., et al. (2014). Concreteness ratings for 40,000 word lemmas. Behavior Research Methods, 46, 904-911. - Warriner, A. B., et al. (2013). Norms of valence, arousal, and dominance. Behavior Research Methods, 45, 1191-1207.
Cognitive Association Sources: - De Deyne, S., et al. (2019). Small World of Words. Behavior Research Methods, 51(3), 987-1006. - Nelson, D. L., et al. (2004). University of South Florida norms. Behavior Research Methods, 36(3), 402-407. - Bruni, E., et al. (2014). Multimodal distributional semantics. JAIR, 49, 1-47. - Marxer, R., et al. (2016). Edinburgh Confusability Corpus. - Hutchison, K. A., et al. (2013). The Semantic Priming Project. Behavior Research Methods, 45(4), 1099-1114. - Hill, F., et al. (2015). SimLex-999. Computational Linguistics, 41(4), 665-695. - Finkelstein, L., et al. (2001). Placing search in context. WWW 2001, 406-414.
Clinical Interventions: - Gierut, J. A. (1989). Maximal opposition approach. JSHD, 54(1), 9-19. - Storkel, H. L. (2022). Minimal, Maximal, or Multiple Oppositions. LSHSS, 53(3), 632-645.