Skip to content

Technical Architecture

Technical details on PhonoLex's architecture and computational methods.

System Overview

PhonoLex v4.0 uses a Hono backend on Cloudflare Workers with a D1 database (SQLite at the edge), serving all queries via a REST API. The React frontend communicates with the API. Data is consolidated from 15 research datasets and exported to D1 via a build-time pipeline.

React Frontend (TypeScript + MUI)
    ↓ HTTP API calls
Hono Backend (TypeScript, Cloudflare Workers)
    ↓ SQL queries
D1 Database (SQLite at the edge)
    ├── 44,011 word rows (35 filterable properties + 35 percentile columns)
    ├── 1,012,327 edge rows (7 relationship types)
    ├── Phoneme-level dot products (for similarity)
    ├── Precomputed minimal pairs
    └── Phoneme features (26 learned articulatory features)

Consolidated Dataset

PhonoLex consolidates 15 research datasets into a single queryable platform, covering word properties, cognitive associations, and phonological features.

Word Properties

Each word carries properties across 9 categories (35 are filterable via the API; additional derived fields like log_frequency and is_monomorphemic are included in responses):

Category Properties Sources
Phonological Complexity syllable_count, phoneme_count, wcm_score CMU Dict, Stoel-Gammon (2010)
Phonotactic Probability phono_prob_avg, positional_prob_avg Vitevitch & Luce (2004)
Lexical frequency, log_frequency, contextual_diversity, prevalence, aoa, aoa_kuperman, elp_lexical_decision_rt SUBTLEX-US, Kuperman et al., ELP
Semantic imageability, familiarity, concreteness, size Glasgow Norms, Brysbaert et al.
Affective valence, arousal, dominance Warriner et al. (2013)
Cognitive / Embodied iconicity, boi, socialness Winter et al., Tillotson et al., Diveica et al.
Sensorimotor — Perceptual auditory, visual, haptic, gustatory, olfactory, interoceptive Lancaster Norms
Sensorimotor — Action hand_arm, foot_leg, head, mouth, torso Lancaster Norms
Morphological morpheme_count, is_monomorphemic, n_prefixes, n_suffixes MorphoLex

Edge Types (6 relationship types)

Edge Type Edges Source Description
USF 62,923 University of South Florida Forward/backward association norms
MEN 3,000 MEN Dataset Human-judged semantic relatedness
ECCC 2,456 Edinburgh Confusability Perceptual confusability in noise
SPP 1,546 Semantic Priming Project Semantic priming (short/long SOA)
SimLex 998 SimLex-999 Human-judged semantic similarity
WordSim 351 WordSim-353 Human-judged word relatedness

Phonological Data

The platform also stores: - Learned feature vectors: 26-dimensional articulatory feature vectors via Bayesian inference - Phoneme features: 26 learned articulatory features per phoneme - Syllable structures: Onset-nucleus-coda decompositions for all words

Vocabulary Filtering

The full dataset has 245,393 entries. Vocabulary is filtered during the build process to 44,011 words using:

  1. Must have IPA transcription
  2. Must have frequency data (SUBTLEX-US)
  3. Must have at least one psycholinguistic norm value

Hono Backend (Cloudflare Workers + D1)

Data Pipeline

The build pipeline reads the consolidated research datasets and:

  1. Filters vocabulary (245K to 44K words)
  2. Computes WCM scores (Stoel-Gammon 2010)
  3. Computes percentile ranks for all numeric properties
  4. Extracts syllable components and precomputes phoneme-level similarity data
  5. Precomputes minimal pairs
  6. Computes property ranges (min/max) for filter UI
  7. Outputs the database seed

Cold Start

On cold start, the similarity engine loads phoneme norms, pairwise phoneme similarity data, and component phoneme sequences into memory. All other routes query D1 directly.

API Endpoints

Method Path Description
GET /api/health Health check with vocabulary stats
GET /api/stats Full statistics
GET /api/property-ranges Min/max for all numeric properties
GET /api/property-metadata Property definitions, categories, labels
GET /api/edge-types Edge type definitions
Words
GET /api/words Browse vocabulary with pagination and sorting
GET /api/words/{word} Word details with all properties
POST /api/words/search Unified search (patterns + filters + exclusions + sorting + pagination)
POST /api/words/batch Batch word lookup (up to 1000 words)
Similarity
POST /api/similarity/search Phonological similarity with adjustable weights
Associations
GET /api/associations/{word} Cognitive associations by edge type
GET /api/associations/{word}/confusability ECCC perceptual confusability
GET /api/associations/compare Shared associations with Jaccard score
Phonemes
GET /api/phonemes All phonemes with features
GET /api/phonemes/{ipa} Single phoneme features (auto-normalizes ASCII "g" to IPA "ɡ")
POST /api/phonemes/compare Feature-by-feature comparison
POST /api/phonemes/search Search by feature values
Contrastive
POST /api/contrastive/minimal-pairs Minimal pairs for phoneme contrast
POST /api/contrastive/maximal-opposition/pairs Maximal opposition pairs
POST /api/contrastive/maximal-opposition/word-lists Maximal opposition word lists
POST /api/contrastive/multiple-opposition/targets Representative target selection
POST /api/contrastive/multiple-opposition/sets Multiple opposition sets
Text
POST /api/text/analyze Text analysis with percentiles

Property Metadata System

Property definitions are served via GET /api/property-metadata, providing: - Property ID, label, short label - Category grouping - Source dataset - Scale description and interpretation - Display format (decimal places, suffix) - Filter configuration (slider step, log scale, integer)

The frontend loads this once at startup and drives all UI (filter sliders, table columns, text analysis features) from the metadata — no hardcoded property lists in TypeScript.

Phonological Similarity

PhonoLex uses two-level soft Levenshtein distance with precomputed phoneme-level dot products.

Learned Feature Vectors

Each phoneme is represented as a 26-dimensional feature vector learned via Bayesian inference from empirical evidence (acoustic data, confusion corpora, morphological patterns). These replace the original PHOIBLE binary feature assignments with continuous values that better predict perceptual confusion.

Phoneme-Level Precomputation

At build time, the pipeline computes: - Norm squares (||v||^2) for each of the 45 phonemes (40 monophthongs + 5 diphthongs) - Pairwise dot products between all phoneme pairs (~990 pairs) - Component phoneme sequences (e.g., onset /kr/) preserving cluster structure

No vector averaging -- consonant clusters and diphthongs are preserved as phoneme sequences.

Runtime Similarity

At runtime, two levels of soft Levenshtein:

Level 1 — Component similarity (within onset, nucleus, or coda): Soft Levenshtein DP on phoneme sequences. Substitution cost = 1 - cosine(phoneme_a, phoneme_b) using precomputed norms and dot products. This preserves cluster length penalties (e.g., /kr/ vs /k/ onset → cost includes an insertion penalty).

Level 2 — Word similarity (across syllable sequences): Syllable similarity = weighted average of component similarities:

syl_sim = (w_o * compSim(o1,o2) + w_n * compSim(n1,n2) + w_c * compSim(c1,c2)) / (w_o + w_n + w_c)

Soft Levenshtein DP on syllable sequences with substitution cost = 1 - syl_sim:

DP[i][j] = min(
    DP[i-1][j] + 1.0,                       # deletion
    DP[i][j-1] + 1.0,                        # insertion
    DP[i-1][j-1] + (1 - syl_sim(s1_i, s2_j)) # substitution
)

word_similarity = 1 - (DP[n][m] / max(n, m))

Weighted Component Presets

User-adjustable onset/nucleus/coda weights control the similarity computation:

Weight presets: - Balanced: onset=0.33, nucleus=0.33, coda=0.33 - Rhymes: onset=0.0, nucleus=0.5, coda=0.5 - Alliteration: onset=1.0, nucleus=0.5, coda=0.0 - Assonance: onset=0.0, nucleus=1.0, coda=0.0 - Consonance: onset=0.5, nucleus=0.0, coda=0.5

Contrastive Sets

Minimal Pairs

Precomputed at build time. Words that differ by exactly one phoneme at the same position.

Maximal Opposition (Gierut 1989-1992)

Computed on-the-fly. Finds phoneme pairs with maximum feature distance, prioritizing: 1. Major class difference (consonant vs. vowel features) 2. Number of differing distinctive features

Multiple Opposition

Finds minimal sets (triplets/quadruplets) where words differ only in one phoneme position, targeting global phoneme collapse patterns.

Performance

Operation Notes
Word lookup D1 primary key lookup
Property filter SQL WHERE with indexed columns
Similarity search Cold start loads component cache; full vocabulary scan with scalar math
Edge query D1 indexed lookup on source column
Text analysis Batch D1 lookup + percentile aggregation

Data Coverage

Vocabulary: 44,011 English words (General American English, CMU primary pronunciations)

Phoneme inventory: 39 phonemes (24 consonants + 15 vowels including diphthongs)

Property coverage varies by dataset: - 100%: Syllables, phonemes, frequency - 90%+: WCM, phonotactic probability - 70-80%: AoA, concreteness - 40-60%: Imageability, familiarity, valence, arousal, dominance - 30-50%: Sensorimotor norms, iconicity, BOI, socialness, morphological

Deployment

Cloudflare Workers

# Deploy backend
cd workers
npx wrangler deploy

# Seed D1 database
npx wrangler d1 execute phonolex --file scripts/d1-seed.sql

Development

# Backend
cd workers
npm install
npx wrangler dev

# Frontend
cd webapp/frontend
npm install && npm run dev

Technical Limitations

Dialect Coverage

General American English only (CMU Dictionary primary pronunciations). Not supported: British English, regional dialects, pronunciation variants.

Property Coverage

Not all words have all properties. Coverage varies by dataset (see above). Filters exclude words without the relevant property.

Syllabification

Rule-based English phonotactic constraints. May not optimally handle loanwords, proper nouns with unusual structures, or ambisyllabic consonants.

References

Phonological Features: - Hayes, B. (2009). Introductory Phonology. Wiley-Blackwell.

Similarity: - Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707-710.

Phonological Complexity: - Stoel-Gammon, C. (2010). The Word Complexity Measure. Clinical Linguistics & Phonetics, 24(4-5), 271-282.

Psycholinguistic Norms: - Brysbaert, M., & New, B. (2009). SUBTLEX-US word frequencies. Behavior Research Methods, 41(4), 977-990. - Kuperman, V., et al. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978-990. - Brysbaert, M., et al. (2014). Concreteness ratings for 40,000 word lemmas. Behavior Research Methods, 46, 904-911. - Warriner, A. B., et al. (2013). Norms of valence, arousal, and dominance. Behavior Research Methods, 45, 1191-1207.

Cognitive Association Sources: - De Deyne, S., et al. (2019). Small World of Words. Behavior Research Methods, 51(3), 987-1006. - Nelson, D. L., et al. (2004). University of South Florida norms. Behavior Research Methods, 36(3), 402-407. - Bruni, E., et al. (2014). Multimodal distributional semantics. JAIR, 49, 1-47. - Marxer, R., et al. (2016). Edinburgh Confusability Corpus. - Hutchison, K. A., et al. (2013). The Semantic Priming Project. Behavior Research Methods, 45(4), 1099-1114. - Hill, F., et al. (2015). SimLex-999. Computational Linguistics, 41(4), 665-695. - Finkelstein, L., et al. (2001). Placing search in context. WWW 2001, 406-414.

Clinical Interventions: - Gierut, J. A. (1989). Maximal opposition approach. JSHD, 54(1), 9-19. - Storkel, H. L. (2022). Minimal, Maximal, or Multiple Oppositions. LSHSS, 53(3), 632-645.