Technical Architecture¶

Technical details on PhonoLex's architecture and computational methods.

System Overview¶

PhonoLex v4.0 uses a Hono backend on Cloudflare Workers with a D1 database (SQLite at the edge), serving all queries via a REST API. The React frontend communicates with the API. Data is consolidated from 15 research datasets and exported to D1 via a build-time pipeline.

React Frontend (TypeScript + MUI)
    ↓ HTTP API calls
Hono Backend (TypeScript, Cloudflare Workers)
    ↓ SQL queries
D1 Database (SQLite at the edge)
    ├── 44,011 word rows (35 filterable properties + 35 percentile columns)
    ├── 1,012,327 edge rows (7 relationship types)
    ├── Phoneme-level dot products (for similarity)
    ├── Precomputed minimal pairs
    └── Phoneme features (26 learned articulatory features)

Consolidated Dataset¶

PhonoLex consolidates 15 research datasets into a single queryable platform, covering word properties, cognitive associations, and phonological features.

Word Properties¶

Each word carries properties across 9 categories (35 are filterable via the API; additional derived fields like log_frequency and is_monomorphemic are included in responses):

Category	Properties	Sources
Phonological Complexity	syllable_count, phoneme_count, wcm_score	CMU Dict, Stoel-Gammon (2010)
Phonotactic Probability	phono_prob_avg, positional_prob_avg	Vitevitch & Luce (2004)
Lexical	frequency, log_frequency, contextual_diversity, prevalence, aoa, aoa_kuperman, elp_lexical_decision_rt	SUBTLEX-US, Kuperman et al., ELP
Semantic	imageability, familiarity, concreteness, size	Glasgow Norms, Brysbaert et al.
Affective	valence, arousal, dominance	Warriner et al. (2013)
Cognitive / Embodied	iconicity, boi, socialness	Winter et al., Tillotson et al., Diveica et al.
Sensorimotor — Perceptual	auditory, visual, haptic, gustatory, olfactory, interoceptive	Lancaster Norms
Sensorimotor — Action	hand_arm, foot_leg, head, mouth, torso	Lancaster Norms
Morphological	morpheme_count, is_monomorphemic, n_prefixes, n_suffixes	MorphoLex

Edge Types (6 relationship types)¶

Edge Type	Edges	Source	Description
USF	62,923	University of South Florida	Forward/backward association norms
MEN	3,000	MEN Dataset	Human-judged semantic relatedness
ECCC	2,456	Edinburgh Confusability	Perceptual confusability in noise
SPP	1,546	Semantic Priming Project	Semantic priming (short/long SOA)
SimLex	998	SimLex-999	Human-judged semantic similarity
WordSim	351	WordSim-353	Human-judged word relatedness

Phonological Data¶

The platform also stores: - Learned feature vectors: 26-dimensional articulatory feature vectors via Bayesian inference - Phoneme features: 26 learned articulatory features per phoneme - Syllable structures: Onset-nucleus-coda decompositions for all words

Vocabulary Filtering¶

The full dataset has 245,393 entries. Vocabulary is filtered during the build process to 44,011 words using:

Must have IPA transcription
Must have frequency data (SUBTLEX-US)
Must have at least one psycholinguistic norm value

Hono Backend (Cloudflare Workers + D1)¶

Data Pipeline¶

The build pipeline reads the consolidated research datasets and:

Filters vocabulary (245K to 44K words)
Computes WCM scores (Stoel-Gammon 2010)
Computes percentile ranks for all numeric properties
Extracts syllable components and precomputes phoneme-level similarity data
Precomputes minimal pairs
Computes property ranges (min/max) for filter UI
Outputs the database seed

Cold Start¶

On cold start, the similarity engine loads phoneme norms, pairwise phoneme similarity data, and component phoneme sequences into memory. All other routes query D1 directly.

API Endpoints¶

Method	Path	Description
`GET`	`/api/health`	Health check with vocabulary stats
`GET`	`/api/stats`	Full statistics
`GET`	`/api/property-ranges`	Min/max for all numeric properties
`GET`	`/api/property-metadata`	Property definitions, categories, labels
`GET`	`/api/edge-types`	Edge type definitions
Words
`GET`	`/api/words`	Browse vocabulary with pagination and sorting
`GET`	`/api/words/{word}`	Word details with all properties
`POST`	`/api/words/search`	Unified search (patterns + filters + exclusions + sorting + pagination)
`POST`	`/api/words/batch`	Batch word lookup (up to 1000 words)
Similarity
`POST`	`/api/similarity/search`	Phonological similarity with adjustable weights
Associations
`GET`	`/api/associations/{word}`	Cognitive associations by edge type
`GET`	`/api/associations/{word}/confusability`	ECCC perceptual confusability
`GET`	`/api/associations/compare`	Shared associations with Jaccard score
Phonemes
`GET`	`/api/phonemes`	All phonemes with features
`GET`	`/api/phonemes/{ipa}`	Single phoneme features (auto-normalizes ASCII "g" to IPA "ɡ")
`POST`	`/api/phonemes/compare`	Feature-by-feature comparison
`POST`	`/api/phonemes/search`	Search by feature values
Contrastive
`POST`	`/api/contrastive/minimal-pairs`	Minimal pairs for phoneme contrast
`POST`	`/api/contrastive/maximal-opposition/pairs`	Maximal opposition pairs
`POST`	`/api/contrastive/maximal-opposition/word-lists`	Maximal opposition word lists
`POST`	`/api/contrastive/multiple-opposition/targets`	Representative target selection
`POST`	`/api/contrastive/multiple-opposition/sets`	Multiple opposition sets
Text
`POST`	`/api/text/analyze`	Text analysis with percentiles

Property Metadata System¶

Property definitions are served via GET /api/property-metadata, providing: - Property ID, label, short label - Category grouping - Source dataset - Scale description and interpretation - Display format (decimal places, suffix) - Filter configuration (slider step, log scale, integer)

The frontend loads this once at startup and drives all UI (filter sliders, table columns, text analysis features) from the metadata — no hardcoded property lists in TypeScript.

Phonological Similarity¶

PhonoLex uses two-level soft Levenshtein distance with precomputed phoneme-level dot products.

Learned Feature Vectors¶

Each phoneme is represented as a 26-dimensional feature vector learned via Bayesian inference from empirical evidence (acoustic data, confusion corpora, morphological patterns). These replace the original PHOIBLE binary feature assignments with continuous values that better predict perceptual confusion.

Phoneme-Level Precomputation¶

At build time, the pipeline computes: - Norm squares (||v||^2) for each of the 45 phonemes (40 monophthongs + 5 diphthongs) - Pairwise dot products between all phoneme pairs (~990 pairs) - Component phoneme sequences (e.g., onset /kr/) preserving cluster structure

No vector averaging -- consonant clusters and diphthongs are preserved as phoneme sequences.

Runtime Similarity¶

At runtime, two levels of soft Levenshtein:

Level 1 — Component similarity (within onset, nucleus, or coda): Soft Levenshtein DP on phoneme sequences. Substitution cost = 1 - cosine(phoneme_a, phoneme_b) using precomputed norms and dot products. This preserves cluster length penalties (e.g., /kr/ vs /k/ onset → cost includes an insertion penalty).

Level 2 — Word similarity (across syllable sequences): Syllable similarity = weighted average of component similarities:

syl_sim = (w_o * compSim(o1,o2) + w_n * compSim(n1,n2) + w_c * compSim(c1,c2)) / (w_o + w_n + w_c)

Soft Levenshtein DP on syllable sequences with substitution cost = 1 - syl_sim:

DP[i][j] = min(
    DP[i-1][j] + 1.0,                       # deletion
    DP[i][j-1] + 1.0,                        # insertion
    DP[i-1][j-1] + (1 - syl_sim(s1_i, s2_j)) # substitution
)

word_similarity = 1 - (DP[n][m] / max(n, m))

Weighted Component Presets¶

User-adjustable onset/nucleus/coda weights control the similarity computation:

Weight presets: - Balanced: onset=0.33, nucleus=0.33, coda=0.33 - Rhymes: onset=0.0, nucleus=0.5, coda=0.5 - Alliteration: onset=1.0, nucleus=0.5, coda=0.0 - Assonance: onset=0.0, nucleus=1.0, coda=0.0 - Consonance: onset=0.5, nucleus=0.0, coda=0.5

Contrastive Sets¶

Minimal Pairs¶

Precomputed at build time. Words that differ by exactly one phoneme at the same position.

Maximal Opposition (Gierut 1989-1992)¶

Computed on-the-fly. Finds phoneme pairs with maximum feature distance, prioritizing: 1. Major class difference (consonant vs. vowel features) 2. Number of differing distinctive features

Multiple Opposition¶

Finds minimal sets (triplets/quadruplets) where words differ only in one phoneme position, targeting global phoneme collapse patterns.

Performance¶

Operation	Notes
Word lookup	D1 primary key lookup
Property filter	SQL WHERE with indexed columns
Similarity search	Cold start loads component cache; full vocabulary scan with scalar math
Edge query	D1 indexed lookup on source column
Text analysis	Batch D1 lookup + percentile aggregation

Data Coverage¶

Vocabulary: 44,011 English words (General American English, CMU primary pronunciations)

Phoneme inventory: 39 phonemes (24 consonants + 15 vowels including diphthongs)

Property coverage varies by dataset: - 100%: Syllables, phonemes, frequency - 90%+: WCM, phonotactic probability - 70-80%: AoA, concreteness - 40-60%: Imageability, familiarity, valence, arousal, dominance - 30-50%: Sensorimotor norms, iconicity, BOI, socialness, morphological

Deployment¶

Cloudflare Workers¶

# Deploy backend
cd workers
npx wrangler deploy

# Seed D1 database
npx wrangler d1 execute phonolex --file scripts/d1-seed.sql

Development¶

# Backend
cd workers
npm install
npx wrangler dev

# Frontend
cd webapp/frontend
npm install && npm run dev

Technical Limitations¶

Dialect Coverage¶

General American English only (CMU Dictionary primary pronunciations). Not supported: British English, regional dialects, pronunciation variants.

Property Coverage¶

Not all words have all properties. Coverage varies by dataset (see above). Filters exclude words without the relevant property.

Syllabification¶

Rule-based English phonotactic constraints. May not optimally handle loanwords, proper nouns with unusual structures, or ambisyllabic consonants.

References¶

Phonological Features: - Hayes, B. (2009). Introductory Phonology. Wiley-Blackwell.

Similarity: - Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707-710.

Phonological Complexity: - Stoel-Gammon, C. (2010). The Word Complexity Measure. Clinical Linguistics & Phonetics, 24(4-5), 271-282.

Psycholinguistic Norms: - Brysbaert, M., & New, B. (2009). SUBTLEX-US word frequencies. Behavior Research Methods, 41(4), 977-990. - Kuperman, V., et al. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978-990. - Brysbaert, M., et al. (2014). Concreteness ratings for 40,000 word lemmas. Behavior Research Methods, 46, 904-911. - Warriner, A. B., et al. (2013). Norms of valence, arousal, and dominance. Behavior Research Methods, 45, 1191-1207.

Cognitive Association Sources: - De Deyne, S., et al. (2019). Small World of Words. Behavior Research Methods, 51(3), 987-1006. - Nelson, D. L., et al. (2004). University of South Florida norms. Behavior Research Methods, 36(3), 402-407. - Bruni, E., et al. (2014). Multimodal distributional semantics. JAIR, 49, 1-47. - Marxer, R., et al. (2016). Edinburgh Confusability Corpus. - Hutchison, K. A., et al. (2013). The Semantic Priming Project. Behavior Research Methods, 45(4), 1099-1114. - Hill, F., et al. (2015). SimLex-999. Computational Linguistics, 41(4), 665-695. - Finkelstein, L., et al. (2001). Placing search in context. WWW 2001, 406-414.

Clinical Interventions: - Gierut, J. A. (1989). Maximal opposition approach. JSHD, 54(1), 9-19. - Storkel, H. L. (2022). Minimal, Maximal, or Multiple Oppositions. LSHSS, 53(3), 632-645.