Data & Methods¶
PhonoLex is a phonological analysis and corpus-retrieval platform for SLPs working on phonological targets, with a broader REST API for phonological research and language-education tooling. This page explains what's in the database, where it comes from, and how the computational methods work.
Overview¶
PhonoLex provides five categories of data:
- Lexicon — ~125K CMU-phonology entries; the ~47K canonical content-POS subset (NOUN / VERB / ADJ / ADV) carries the full ~150-column norm set.
- Word properties — ~150 in-house psycholinguistic + phonological columns derived from PhonoLex pipelines (→ ), with original-author papers cited as methodological anchors for the scales we adopt.
- Word similarity graph — ~1.63M edges (~99.8% Qwensim neural-embedding cosine over FineWeb-Edu + ~2.5K ECCC perceptual confusability + ~351 WordSim-353 human-rated similarity).
- Minimal-pair lexicon — 642K precomputed minimal pairs with learned-feature distance + sonorant-difference metrics.
- Curated corpus — ~236K naturalistic English sentences (CoLA, UD-EWT, GUM, Tatoeba, OpenSubtitles) gated at build time for SLP suitability.
All data is keyed on General American English pronunciations from the CMU Pronouncing Dictionary (primary pronunciations only).
The in-house norm columns are derived locally via gpt-4.1-mini cloze-prompt ratings adapting the scales of published norm sets, validated against held-out oracles. This replaces the original-author CSVs that PhonoLex used to redistribute — those CSVs had licensing terms that didn't sit cleanly with proprietary deployment.
The LLM-cloze methodology is grounded in two validation papers showing that LLM-derived psycholinguistic ratings correlate r = .74–.95 with human raters and outperform human raters on downstream prediction tasks:
- Martínez, G., Conde, J., Reviriego, P., & Brysbaert, M. (2025). Using Large Language Models to Generate Psycholinguistic Norms. Behavior Research Methods (in press).
- Brysbaert, M. (2024). Validating LLM-derived psycholinguistic ratings for word stimuli. Memory & Cognition.
Word properties¶
PhonoLex assigns up to ~150 columns to each word. Coverage varies by column — the canonical content-POS subset (~47K words) carries the full set; non-canonical entries (proper nouns, function words) carry phonology only.
Phonological complexity¶
Structural complexity of a word's sound form — central to clinical decision-making because phonological complexity predicts production difficulty in children with speech sound disorders.
| Property | What it measures | Scale | Source |
|---|---|---|---|
syllable_count |
Number of syllables | 1–8 | Computed from CMU |
phoneme_count |
Number of speech sounds | 1–15+ | Computed from CMU |
cv_shape |
Syllable structure as a CV string (CVC, CCVCC, CV-CVC, ...) |
categorical | Computed from CMU |
wcm_score |
Composite Word Complexity Measure (clusters, velars, fricatives, liquids, stress pattern, word-final consonants) | 0–20+ | Method: Stoel-Gammon (2010); computed locally |
Phonotactic probability¶
How common a word's sound patterns are in English. Children learn high-probability sound sequences earlier and produce them more accurately.
| Property | What it measures | Scale | Source |
|---|---|---|---|
phono_prob_avg |
Average biphone probability | 0–1 | Method: Vitevitch & Luce (2004); computed from CMU |
positional_prob_avg |
Positional segment probability (per syllable position) | 0–1 | Method: Vitevitch & Luce (2004); computed from CMU |
str_phono_prob_avg, str_positional_prob_avg |
Stressed-syllable variants | 0–1 | Same |
neighborhood_density, str_neighborhood_density |
Number of one-phoneme-substitution neighbors in the lexicon | counts | Computed from CMU |
Lexical frequency¶
| Property | What it measures | Scale | Source |
|---|---|---|---|
frequency |
General-corpus words-per-million | 0–80,000+ | PhonoLex derivation from FineWeb-Edu (~800M tokens, ~1M docs) |
log_frequency |
Log-Zipf normalization of frequency |
0–10 | Same |
contextual_diversity |
Document-level diversity proxy | 0–1 | Same |
freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13 |
CYP-LEX child-corpus age bands (7-9, 10-12, 13+ years) | wpm | Korochkina et al. (2024), CC BY 4.0 |
Developmental frequency¶
Aggregated from CHILDES + PhonBank TalkBank corpora. 2026-05-25 rework: child PRODUCTION channels (not caregiver INPUT), so freq_age_2y answers "what do 2-year-olds actually say" rather than "what do adults say to them."
| Property | Source | Note |
|---|---|---|
freq_age_2y |
PB + CHILDES prod_12-24mo + prod_24-36mo mean |
child production |
freq_age_5y |
PB + CHILDES prod_36-48mo + prod_48-72mo mean |
child production |
freq_age_8y |
CHILDES prod_72-108mo |
child production |
freq_age_12y |
CHILDES prod_108-144mo |
child production |
freq_age_all |
Aliases frequency (FineWeb-Edu general) |
general-corpus reference at the top of the developmental ladder |
Source: MacWhinney (2000) CHILDES + Rose & MacWhinney (2014) PhonBank; PhonoLex derivation.
Lexical timing¶
| Property | What it measures | Scale | Source |
|---|---|---|---|
aoa |
Age at which a word is typically learned | 1–7 (age-band anchored: 1≈0-2y, 4≈7-8y, 7≈13y+) | PhonoLex in-house: gpt-4.1-mini cloze with logprob expected-value extraction. Validated Spearman 0.868 vs Glasgow Norms (N=5,551), Pearson 0.816 vs Kuperman (held-out N=500). |
Semantic¶
All derived locally via gpt-4.1-mini cloze-prompt over ~47K non-PROPN content words. Methodology adapts the published scale referenced in each row's source.
| Property | What it measures | Scale | Source / scale anchor |
|---|---|---|---|
concreteness |
Concrete (tangible) vs. abstract | 1–5 | ; Brysbaert et al. (2014) scale |
familiarity |
How familiar the word feels | 1–7 | |
boi |
Body-object interaction strength | 1–7 | ; Pexman et al. (2019) scale |
iconicity |
Sound-meaning correspondence | -5 to +5 | ; Winter et al. (2024) scale |
socialness |
Degree of social content | 1–7 | ; Diveica et al. (2023) scale |
semantic_diversity, semd_topic, semd_vn, semd_h13, n_topics_for_word |
Semantic diversity / topic statistics | various | ; Hoffman et al. (2013) scale |
Affective¶
| Property | What it measures | Scale | Source |
|---|---|---|---|
valence |
Emotional positivity (1 = negative, 9 = positive) | 1–9 | PhonoLex (gpt-4.1-mini cloze; Warriner et al. (2013) scale; Spearman 0.836 vs held-out Warriner oracle for valence on N=500 pilot) |
arousal |
Emotional intensity (1 = calm, 9 = excited) | 1–9 | Same |
The Warriner D (dominance) axis was not re-derived in the PhonoLex in-house build — see freq_age_adult → freq_age_all migration note in CHANGELOG v5.2.1 for similar retirement framing.
Morphological¶
| Property | What it measures | Source |
|---|---|---|
morpheme_count, n_prefixes, n_suffixes, is_monomorphemic |
Algorithmic morpheme decomposition | In-house algorithmic morphology + MorphyNet (Batsuren et al., 2021, CC BY-SA 3.0) |
Percentiles¶
Every numeric column has a parallel {column}_percentile (0–100) for cross-property comparison and for the percentile-mode UI filters. Frequency-class properties (anything starting with freq_) treat value = 0 as NULL when computing percentiles — a word that never occurs in the source corpus shouldn't cluster at a misleading mid-rank.
Phonological similarity¶
Phonological similarity in PhonoLex uses a method that respects internal syllable structure. Each word is decomposed into syllables, and each syllable into onset (initial consonants), nucleus (vowel), and coda (final consonants).
How it works¶
Two levels:
- Within each syllable component, compare the phoneme sequences using soft Levenshtein distance. Substitution cost between two phonemes is based on how articulatorily similar they are — phonemes sharing many features (e.g., /p/ and /b/, differing only in voicing) have low substitution cost; phonemes differing in many features (e.g., /s/ and /m/) have high cost. Consonant clusters are handled naturally — comparing a /kr/ onset to a /k/ onset incurs an appropriate length penalty.
- Across syllables, compute a weighted average of onset, nucleus, and coda similarity, then run another soft Levenshtein over syllable sequences. Handles words of different lengths gracefully.
Phoneme similarity comes from learned 26-d Bayesian feature vectors (packages/features/): theory-assigned articulatory features (Hayes 2009) serve as Dirichlet priors over feature posteriors; ECCC perceptual confusion (Marxer et al., 2016) + Hillenbrand acoustic measurements (Hillenbrand et al., 1995) serve as evidence. Validation: r=0.987 cosine correlation against the theory-assigned feature inventory at convergence.
Component weights¶
Adjustable onset / nucleus / coda weights let the same algorithm surface different relationship types:
| Preset | Onset | Nucleus | Coda | What it finds |
|---|---|---|---|---|
| Balanced | 0.33 | 0.33 | 0.33 | Overall sound similarity |
| Rhymes | 0.0 | 0.5 | 0.5 | Rhyming words (matching vowel + ending) |
| Alliteration | 1.0 | 0.5 | 0.0 | Words with similar onsets |
| Assonance | 0.0 | 1.0 | 0.0 | Words with matching vowel sounds |
| Consonance | 0.5 | 0.0 | 0.5 | Words with matching consonant frame |
Surfaced as the "Sound Similarity" rule inside Custom Word Lists, and used by the Lookup tool's similar-words panel.
Interpreting similarity scores¶
Scores range 0.0 (completely different) to 1.0 (identical):
- 0.90+ — near-identical (e.g.,
cat/bat, perfect rhymes) - 0.75–0.85 — strong resemblance (e.g.,
computer/commuter) - 0.40–0.60 — moderate
- 0.20–0.30 — low
Word similarity graph¶
The ~1.6M edges in the edges D1 table:
| Source | Edges | What it measures |
|---|---|---|
| Qwensim | ~1,627,000 | Neural-embedding cosine similarity over FineWeb-Edu via Qwen3-Embedding-0.6B. Bulk of the graph; semantic similarity from a sentence-transformer, not free-association norm data. |
| ECCC | ~2,456 | Perceptual confusability in noise (Marxer et al., 2016, CC BY 4.0) |
| WordSim-353 | ~351 | Human-judged semantic relatedness (Finkelstein et al., 2001) |
The "word association" framing the older data layer used (USF, MEN, SPP, SimLex, SWOW) is retired — those datasets were removed during the licensing audit and replaced by the much larger Qwensim layer. Honest framing for SLP-facing UX: "neighboring words via Qwensim semantic similarity."
Contrastive intervention data¶
Three evidence-based contrastive approaches. All three are available at the word level (Contrast Sets tool). Minimal pair and maximal opposition are also available at the sentence level (Sentences tool); multiple opposition is word-level only, because a single attested sentence almost never witnesses one substitute against several target phonemes at once.
Minimal pairs¶
Two words differing by exactly one phoneme in the same position (cat / bat, pat / pan). PhonoLex precomputes 642K minimal pairs across the lexicon. Filter by phoneme contrast and position (initial / medial / final / any). Each pair carries a feature_distance (continuous L2 over learned vectors) and a sonorant_diff (boolean — whether the contrast crosses the sonorant class).
Maximal opposition (Gierut, 1989)¶
Minimal pair where the contrasting phonemes also differ in major class (typically obstruent vs. sonorant). The theory: targeting maximally different contrasts promotes broader generalization across the phonological system. PhonoLex surfaces this by filtering minimal pairs on sonorant_diff >= threshold.
Multiple opposition (Williams, 2000)¶
Targets phoneme collapse — when a child substitutes one sound for multiple different target sounds (e.g., /t/ for /k/, /s/, /ʃ/). Treatment selects a set of words that all contrast against the substitute at the same position, so the child must differentiate multiple contrasts simultaneously.
Corpus¶
The Sentences tool draws from ~236K curated naturalistic English sentences. Build pipeline gates (in order):
- Unicode normalize + char whitelist + length bounds + repetition entropy
- Punctuation balance + terminator + clause repetition + no-garbage URL/citation/HTML
- Letter-spelled-word rejection (
R-o-b-a-r-dpatterns) - Vocab coverage (every alpha token must have CMU phonology)
- Profanity filter (
better-profanity) - SLP content-suitability gate (in-house V/A norms + AFINN NEG_3/4/5 buckets — drops violence/conflict)
- Verbal-filler rejection (
uh) + Spanish-loanword denylist - spaCy parse + single-ROOT verb-headed validation + parataxis rejection
- Obscure-PROPN rejection (non-canonical PROPN with FineWeb-Edu freq < 5/million)
- PROPN cap of 2 per sentence
- spaCy contraction-glue (stem + suffix → whole-word
don't/won't/it's) - Coverage-aware rarity-driven dedup + cross-source identical-text merge
Sources currently feeding the corpus: CoLA, UD English-EWT, GUM, Tatoeba, OpenSubtitles. CHILDES + PhonBank conversational transcripts were retired 2026-05-25 — the CHAT-transcript cleanup couldn't reliably distinguish locally-plausible fragments from genuine sentences. Sentences are ranked by per-query match_count first (multi-hit > single-hit), then by static rarity_score (coverage-aware sum over satisfied phoneme-position + bigram + top-50 CV-shape constraints of 1 / n_sentences[C]).
Full citations¶
Pronunciation data¶
Carnegie Mellon University. (2014). The CMU Pronouncing Dictionary (~134,000 words). License: Modified BSD. http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Phonological features (priors + evidence)¶
Hayes, B. (2009). Introductory Phonology (theory-assigned 26-feature inventory used as prior). Wiley-Blackwell.
Marxer, R., Barker, J., Martin, N., & Coleman, J. (2016). Modelling speech intelligibility in adverse conditions: a corpus study (ECCC v1.2; evidence layer for feature posteriors and perceptual confusability edges). License: CC BY 4.0. https://datashare.ed.ac.uk/handle/10283/2791
Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels (evidence layer for vowel feature posteriors). Journal of the Acoustical Society of America, 97(5 Pt 1), 3099–3111. DOI: 10.1121/1.411872
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.
Phonological complexity¶
Stoel-Gammon, C. (2010). The Word Complexity Measure: Description and application to developmental phonology and disorders. Clinical Linguistics & Phonetics, 24(4-5), 271–282. DOI: 10.3109/02699200903581059
Phonotactic probability¶
Vitevitch, M. S., & Luce, P. A. (2004). A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481–487. DOI: 10.3758/BF03195594 — method origin; PhonoLex computes values directly from the CMU dict.
Lexical frequency¶
PhonoLex (in-house derivation). In-house derivation from HuggingFace FineWeb-Edu (~800M tokens, ~1M docs). License: ODC-BY 1.0. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
Korochkina, M., Marelli, M., Brysbaert, M., & Rastle, K. (2024). The Children and Young People's Books Lexicon (CYP-LEX). Quarterly Journal of Experimental Psychology, 77(11), 2197–2214. CC BY 4.0. https://osf.io/squ49/
Developmental frequency¶
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.). Lawrence Erlbaum Associates. https://childes.talkbank.org/
Rose, Y., & MacWhinney, B. (2014). The PhonBank Project: Data and software-assisted methods for the study of phonology and phonological development. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford Handbook of Corpus Phonology. https://phonbank.talkbank.org/
Lexical timing (AoA — validation oracles)¶
Scott, G. G., Keitel, A., Becirspahic, M., Yao, B., & Sereno, S. C. (2019). The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods, 51(3), 1258–1270. DOI: 10.3758/s13428-018-1099-3 — primary validation oracle (Spearman 0.868 on N=5,551).
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978–990. DOI: 10.3758/s13428-012-0210-4 — secondary cross-construct oracle (Pearson 0.816 on Glasgow-unseen N=500). Not redistributed.
Semantic (methodological anchors)¶
Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904–911. DOI: 10.3758/s13428-013-0403-5 — scale anchor for in-house concreteness.
Pexman, P. M., Muraki, E., Sidhu, D. M., Siakaluk, P. D., & Yap, M. J. (2019). Quantifying sensorimotor experience: Body-object interaction ratings for more than 9,000 English words. Behavior Research Methods, 51(2), 453–466. DOI: 10.3758/s13428-018-1171-z — scale anchor for in-house BOI.
Winter, B., Lupyan, G., Perry, L. K., Dingemanse, M., & Perlman, M. (2024). Iconicity ratings for 14,000+ English words. Behavior Research Methods, 56(3), 1640–1655. DOI: 10.3758/s13428-023-02112-6 — scale anchor for in-house iconicity.
Diveica, V., Pexman, P. M., & Binney, R. J. (2023). Quantifying social semantics: An inclusive definition of socialness and ratings for 8,388 English words. Behavior Research Methods, 55(2), 461–473. DOI: 10.3758/s13428-022-01810-x — scale anchor for in-house socialness.
Hoffman, P., Lambon Ralph, M. A., & Rogers, T. T. (2013). Semantic diversity: A measure of semantic ambiguity based on variability in the contextual usage of words. Behavior Research Methods, 45(3), 718–730. DOI: 10.3758/s13428-012-0278-x — scale anchor for in-house semantic diversity.
Affective (methodological anchor)¶
Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4), 1191–1207. DOI: 10.3758/s13428-012-0314-x — scale anchor for in-house valence + arousal; validated against held-out Warriner oracle.
Word similarity graph¶
PhonoLex (in-house derivation). Qwensim: ~1.6M word-similarity edges from Qwen3-Embedding-0.6B cosine over FineWeb-Edu. Bulk of the graph.
Marxer et al. (2016) ECCC — see Phonological features section above.
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001). Placing search in context: The concept revisited (WordSim-353). Proceedings of the 10th International Conference on World Wide Web, 406–414. DOI: 10.1145/371920.372094
Morphology¶
Batsuren, K., Bella, G., & Giunchiglia, F. (2021). MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology. Proceedings of the 18th SIGMORPHON Workshop, 39–48. License: CC BY-SA 3.0. https://github.com/kbatsuren/MorphyNet
Sentences corpus¶
Penedo, G., Kydlíček, H., Lozhkov, A., et al. (2024). FineWeb-Edu: an open and high-quality dataset for educational content. License: ODC-BY 1.0. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
Nivre, J., et al. (2020). Universal Dependencies v2 (UD English-EWT + GUM corpora). License: CC BY-SA. https://universaldependencies.org/
Warstadt, A., Singh, A., & Bowman, S. R. (2019). CoLA: The Corpus of Linguistic Acceptability (positive examples). Transactions of the ACL. https://nyu-mll.github.io/CoLA/
Tatoeba contributors. (2024). Tatoeba sentence collection (English subset). License: CC BY 2.0 FR. https://tatoeba.org/
Lison, P., & Tiedemann, J. (2016). OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora. Proceedings of LREC. http://www.opensubtitles.org/
Clinical intervention approaches¶
Gierut, J. A. (1989). Maximal opposition approach to phonological treatment. Journal of Speech and Hearing Disorders, 54(1), 9–19. DOI: 10.1044/jshd.5401.09
Gierut, J. A. (1990). Differential learning of phonological oppositions. Journal of Speech and Hearing Research, 33(3), 540–549. DOI: 10.1044/jshr.3303.540
Williams, A. L. (2000). Multiple oppositions: theoretical foundations for an alternative contrastive intervention approach. American Journal of Speech-Language Pathology, 9(4), 282–288. DOI: 10.1044/1058-0360.0904.282
Storkel, H. L. (2022). Minimal, Maximal, or Multiple: Which Contrastive Intervention Approach to Use With Children With Speech Sound Disorders? Language, Speech, and Hearing Services in Schools, 53(3), 632–645. DOI: 10.1044/2022_LSHSS-21-00137