Data & Methods¶

PhonoLex is a phonological analysis and corpus-retrieval platform for SLPs working on phonological targets, with a broader REST API for phonological research and language-education tooling. This page explains what's in the database, where it comes from, and how the computational methods work.

Overview¶

PhonoLex provides five categories of data:

Lexicon — ~125K CMU-phonology entries; the ~47K canonical content-POS subset (NOUN / VERB / ADJ / ADV) carries the full ~150-column norm set.
Word properties — ~150 in-house psycholinguistic + phonological columns derived from PhonoLex pipelines (→ ), with original-author papers cited as methodological anchors for the scales we adopt.
Word similarity graph — ~1.63M edges (~99.8% Qwensim neural-embedding cosine over FineWeb-Edu + ~2.5K ECCC perceptual confusability + ~351 WordSim-353 human-rated similarity).
Minimal-pair lexicon — 642K precomputed minimal pairs with learned-feature distance + sonorant-difference metrics.
Curated corpus — ~236K naturalistic English sentences (CoLA, UD-EWT, GUM, Tatoeba, OpenSubtitles) gated at build time for SLP suitability.

All data is keyed on General American English pronunciations from the CMU Pronouncing Dictionary (primary pronunciations only).

The in-house norm columns are derived locally via gpt-4.1-mini cloze-prompt ratings adapting the scales of published norm sets, validated against held-out oracles. This replaces the original-author CSVs that PhonoLex used to redistribute — those CSVs had licensing terms that didn't sit cleanly with proprietary deployment.

The LLM-cloze methodology is grounded in two validation papers showing that LLM-derived psycholinguistic ratings correlate r = .74–.95 with human raters and outperform human raters on downstream prediction tasks:

Martínez, G., Conde, J., Reviriego, P., & Brysbaert, M. (2025). Using Large Language Models to Generate Psycholinguistic Norms. Behavior Research Methods (in press).
Brysbaert, M. (2024). Validating LLM-derived psycholinguistic ratings for word stimuli. Memory & Cognition.

Word properties¶

PhonoLex assigns up to ~150 columns to each word. Coverage varies by column — the canonical content-POS subset (~47K words) carries the full set; non-canonical entries (proper nouns, function words) carry phonology only.

Phonological complexity¶

Structural complexity of a word's sound form — central to clinical decision-making because phonological complexity predicts production difficulty in children with speech sound disorders.

Property	What it measures	Scale	Source
`syllable_count`	Number of syllables	1–8	Computed from CMU
`phoneme_count`	Number of speech sounds	1–15+	Computed from CMU
`cv_shape`	Syllable structure as a CV string (`CVC`, `CCVCC`, `CV-CVC`, ...)	categorical	Computed from CMU
`wcm_score`	Composite Word Complexity Measure (clusters, velars, fricatives, liquids, stress pattern, word-final consonants)	0–20+	Method: Stoel-Gammon (2010); computed locally

Phonotactic probability¶

How common a word's sound patterns are in English. Children learn high-probability sound sequences earlier and produce them more accurately.

Property	What it measures	Scale	Source
`phono_prob_avg`	Average biphone probability	0–1	Method: Vitevitch & Luce (2004); computed from CMU
`positional_prob_avg`	Positional segment probability (per syllable position)	0–1	Method: Vitevitch & Luce (2004); computed from CMU
`str_phono_prob_avg`, `str_positional_prob_avg`	Stressed-syllable variants	0–1	Same
`neighborhood_density`, `str_neighborhood_density`	Number of one-phoneme-substitution neighbors in the lexicon	counts	Computed from CMU

Lexical frequency¶

Property	What it measures	Scale	Source
`frequency`	General-corpus words-per-million	0–80,000+	PhonoLex derivation from FineWeb-Edu (~800M tokens, ~1M docs)
`log_frequency`	Log-Zipf normalization of `frequency`	0–10	Same
`contextual_diversity`	Document-level diversity proxy	0–1	Same
`freq_cyplex_7_9`, `freq_cyplex_10_12`, `freq_cyplex_13`	CYP-LEX child-corpus age bands (7-9, 10-12, 13+ years)	wpm	Korochkina et al. (2024), CC BY 4.0

Developmental frequency¶

Aggregated from CHILDES + PhonBank TalkBank corpora. 2026-05-25 rework: child PRODUCTION channels (not caregiver INPUT), so freq_age_2y answers "what do 2-year-olds actually say" rather than "what do adults say to them."

Property	Source	Note
`freq_age_2y`	PB + CHILDES `prod_12-24mo` + `prod_24-36mo` mean	child production
`freq_age_5y`	PB + CHILDES `prod_36-48mo` + `prod_48-72mo` mean	child production
`freq_age_8y`	CHILDES `prod_72-108mo`	child production
`freq_age_12y`	CHILDES `prod_108-144mo`	child production
`freq_age_all`	Aliases `frequency` (FineWeb-Edu general)	general-corpus reference at the top of the developmental ladder

Source: MacWhinney (2000) CHILDES + Rose & MacWhinney (2014) PhonBank; PhonoLex derivation.

Lexical timing¶

Property	What it measures	Scale	Source
`aoa`	Age at which a word is typically learned	1–7 (age-band anchored: 1≈0-2y, 4≈7-8y, 7≈13y+)	PhonoLex in-house: gpt-4.1-mini cloze with logprob expected-value extraction. Validated Spearman 0.868 vs Glasgow Norms (N=5,551), Pearson 0.816 vs Kuperman (held-out N=500).

Semantic¶

All derived locally via gpt-4.1-mini cloze-prompt over ~47K non-PROPN content words. Methodology adapts the published scale referenced in each row's source.

Property	What it measures	Scale	Source / scale anchor
`concreteness`	Concrete (tangible) vs. abstract	1–5	; Brysbaert et al. (2014) scale
`familiarity`	How familiar the word feels	1–7
`boi`	Body-object interaction strength	1–7	; Pexman et al. (2019) scale
`iconicity`	Sound-meaning correspondence	-5 to +5	; Winter et al. (2024) scale
`socialness`	Degree of social content	1–7	; Diveica et al. (2023) scale
`semantic_diversity`, `semd_topic`, `semd_vn`, `semd_h13`, `n_topics_for_word`	Semantic diversity / topic statistics	various	; Hoffman et al. (2013) scale

Affective¶

Property	What it measures	Scale	Source
`valence`	Emotional positivity (1 = negative, 9 = positive)	1–9	PhonoLex (gpt-4.1-mini cloze; Warriner et al. (2013) scale; Spearman 0.836 vs held-out Warriner oracle for valence on N=500 pilot)
`arousal`	Emotional intensity (1 = calm, 9 = excited)	1–9	Same

The Warriner D (dominance) axis was not re-derived in the PhonoLex in-house build — see freq_age_adult → freq_age_all migration note in CHANGELOG v5.2.1 for similar retirement framing.

Morphological¶

Property	What it measures	Source
`morpheme_count`, `n_prefixes`, `n_suffixes`, `is_monomorphemic`	Algorithmic morpheme decomposition	In-house algorithmic morphology + MorphyNet (Batsuren et al., 2021, CC BY-SA 3.0)

Percentiles¶

Every numeric column has a parallel {column}_percentile (0–100) for cross-property comparison and for the percentile-mode UI filters. Frequency-class properties (anything starting with freq_) treat value = 0 as NULL when computing percentiles — a word that never occurs in the source corpus shouldn't cluster at a misleading mid-rank.

Phonological similarity¶

Phonological similarity in PhonoLex uses a method that respects internal syllable structure. Each word is decomposed into syllables, and each syllable into onset (initial consonants), nucleus (vowel), and coda (final consonants).

How it works¶

Two levels:

Within each syllable component, compare the phoneme sequences using soft Levenshtein distance. Substitution cost between two phonemes is based on how articulatorily similar they are — phonemes sharing many features (e.g., /p/ and /b/, differing only in voicing) have low substitution cost; phonemes differing in many features (e.g., /s/ and /m/) have high cost. Consonant clusters are handled naturally — comparing a /kr/ onset to a /k/ onset incurs an appropriate length penalty.
Across syllables, compute a weighted average of onset, nucleus, and coda similarity, then run another soft Levenshtein over syllable sequences. Handles words of different lengths gracefully.

Phoneme similarity comes from learned 26-d Bayesian feature vectors (packages/features/): theory-assigned articulatory features (Hayes 2009) serve as Dirichlet priors over feature posteriors; ECCC perceptual confusion (Marxer et al., 2016) + Hillenbrand acoustic measurements (Hillenbrand et al., 1995) serve as evidence. Validation: r=0.987 cosine correlation against the theory-assigned feature inventory at convergence.

Component weights¶

Adjustable onset / nucleus / coda weights let the same algorithm surface different relationship types:

Preset	Onset	Nucleus	Coda	What it finds
Balanced	0.33	0.33	0.33	Overall sound similarity
Rhymes	0.0	0.5	0.5	Rhyming words (matching vowel + ending)
Alliteration	1.0	0.5	0.0	Words with similar onsets
Assonance	0.0	1.0	0.0	Words with matching vowel sounds
Consonance	0.5	0.0	0.5	Words with matching consonant frame

Surfaced as the "Sound Similarity" rule inside Custom Word Lists, and used by the Lookup tool's similar-words panel.

Interpreting similarity scores¶

Scores range 0.0 (completely different) to 1.0 (identical):

0.90+ — near-identical (e.g., cat / bat, perfect rhymes)
0.75–0.85 — strong resemblance (e.g., computer / commuter)
0.40–0.60 — moderate
0.20–0.30 — low

Word similarity graph¶

The ~1.6M edges in the edges D1 table:

Source	Edges	What it measures
Qwensim	~1,627,000	Neural-embedding cosine similarity over FineWeb-Edu via Qwen3-Embedding-0.6B. Bulk of the graph; semantic similarity from a sentence-transformer, not free-association norm data.
ECCC	~2,456	Perceptual confusability in noise (Marxer et al., 2016, CC BY 4.0)
WordSim-353	~351	Human-judged semantic relatedness (Finkelstein et al., 2001)

The "word association" framing the older data layer used (USF, MEN, SPP, SimLex, SWOW) is retired — those datasets were removed during the licensing audit and replaced by the much larger Qwensim layer. Honest framing for SLP-facing UX: "neighboring words via Qwensim semantic similarity."

Contrastive intervention data¶

Three evidence-based contrastive approaches. All three are available at the word level (Contrast Sets tool). Minimal pair and maximal opposition are also available at the sentence level (Sentences tool); multiple opposition is word-level only, because a single attested sentence almost never witnesses one substitute against several target phonemes at once.

Minimal pairs¶

Two words differing by exactly one phoneme in the same position (cat / bat, pat / pan). PhonoLex precomputes 642K minimal pairs across the lexicon. Filter by phoneme contrast and position (initial / medial / final / any). Each pair carries a feature_distance (continuous L2 over learned vectors) and a sonorant_diff (boolean — whether the contrast crosses the sonorant class).

Maximal opposition (Gierut, 1989)¶

Minimal pair where the contrasting phonemes also differ in major class (typically obstruent vs. sonorant). The theory: targeting maximally different contrasts promotes broader generalization across the phonological system. PhonoLex surfaces this by filtering minimal pairs on sonorant_diff >= threshold.

Multiple opposition (Williams, 2000)¶

Targets phoneme collapse — when a child substitutes one sound for multiple different target sounds (e.g., /t/ for /k/, /s/, /ʃ/). Treatment selects a set of words that all contrast against the substitute at the same position, so the child must differentiate multiple contrasts simultaneously.

Corpus¶

The Sentences tool draws from ~236K curated naturalistic English sentences. Build pipeline gates (in order):

Unicode normalize + char whitelist + length bounds + repetition entropy
Punctuation balance + terminator + clause repetition + no-garbage URL/citation/HTML
Letter-spelled-word rejection (R-o-b-a-r-d patterns)
Vocab coverage (every alpha token must have CMU phonology)
Profanity filter (better-profanity)
SLP content-suitability gate (in-house V/A norms + AFINN NEG_3/4/5 buckets — drops violence/conflict)
Verbal-filler rejection (uh) + Spanish-loanword denylist
spaCy parse + single-ROOT verb-headed validation + parataxis rejection
Obscure-PROPN rejection (non-canonical PROPN with FineWeb-Edu freq < 5/million)
PROPN cap of 2 per sentence
spaCy contraction-glue (stem + suffix → whole-word don't / won't / it's)
Coverage-aware rarity-driven dedup + cross-source identical-text merge

Sources currently feeding the corpus: CoLA, UD English-EWT, GUM, Tatoeba, OpenSubtitles. CHILDES + PhonBank conversational transcripts were retired 2026-05-25 — the CHAT-transcript cleanup couldn't reliably distinguish locally-plausible fragments from genuine sentences. Sentences are ranked by per-query match_count first (multi-hit > single-hit), then by static rarity_score (coverage-aware sum over satisfied phoneme-position + bigram + top-50 CV-shape constraints of 1 / n_sentences[C]).

Full citations¶

Pronunciation data¶

Carnegie Mellon University. (2014). The CMU Pronouncing Dictionary (~134,000 words). License: Modified BSD. http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Phonological features (priors + evidence)¶

Hayes, B. (2009). Introductory Phonology (theory-assigned 26-feature inventory used as prior). Wiley-Blackwell.

Marxer, R., Barker, J., Martin, N., & Coleman, J. (2016). Modelling speech intelligibility in adverse conditions: a corpus study (ECCC v1.2; evidence layer for feature posteriors and perceptual confusability edges). License: CC BY 4.0. https://datashare.ed.ac.uk/handle/10283/2791

Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels (evidence layer for vowel feature posteriors). Journal of the Acoustical Society of America, 97(5 Pt 1), 3099–3111. DOI: 10.1121/1.411872

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.

Phonological complexity¶

Stoel-Gammon, C. (2010). The Word Complexity Measure: Description and application to developmental phonology and disorders. Clinical Linguistics & Phonetics, 24(4-5), 271–282. DOI: 10.3109/02699200903581059

Phonotactic probability¶

Vitevitch, M. S., & Luce, P. A. (2004). A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481–487. DOI: 10.3758/BF03195594 — method origin; PhonoLex computes values directly from the CMU dict.

Lexical frequency¶

PhonoLex (in-house derivation). In-house derivation from HuggingFace FineWeb-Edu (~800M tokens, ~1M docs). License: ODC-BY 1.0. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

Korochkina, M., Marelli, M., Brysbaert, M., & Rastle, K. (2024). The Children and Young People's Books Lexicon (CYP-LEX). Quarterly Journal of Experimental Psychology, 77(11), 2197–2214. CC BY 4.0. https://osf.io/squ49/

Developmental frequency¶

MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.). Lawrence Erlbaum Associates. https://childes.talkbank.org/

Rose, Y., & MacWhinney, B. (2014). The PhonBank Project: Data and software-assisted methods for the study of phonology and phonological development. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford Handbook of Corpus Phonology. https://phonbank.talkbank.org/

Lexical timing (AoA — validation oracles)¶

Scott, G. G., Keitel, A., Becirspahic, M., Yao, B., & Sereno, S. C. (2019). The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods, 51(3), 1258–1270. DOI: 10.3758/s13428-018-1099-3 — primary validation oracle (Spearman 0.868 on N=5,551).

Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978–990. DOI: 10.3758/s13428-012-0210-4 — secondary cross-construct oracle (Pearson 0.816 on Glasgow-unseen N=500). Not redistributed.

Semantic (methodological anchors)¶

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904–911. DOI: 10.3758/s13428-013-0403-5 — scale anchor for in-house concreteness.

Pexman, P. M., Muraki, E., Sidhu, D. M., Siakaluk, P. D., & Yap, M. J. (2019). Quantifying sensorimotor experience: Body-object interaction ratings for more than 9,000 English words. Behavior Research Methods, 51(2), 453–466. DOI: 10.3758/s13428-018-1171-z — scale anchor for in-house BOI.

Winter, B., Lupyan, G., Perry, L. K., Dingemanse, M., & Perlman, M. (2024). Iconicity ratings for 14,000+ English words. Behavior Research Methods, 56(3), 1640–1655. DOI: 10.3758/s13428-023-02112-6 — scale anchor for in-house iconicity.

Diveica, V., Pexman, P. M., & Binney, R. J. (2023). Quantifying social semantics: An inclusive definition of socialness and ratings for 8,388 English words. Behavior Research Methods, 55(2), 461–473. DOI: 10.3758/s13428-022-01810-x — scale anchor for in-house socialness.

Hoffman, P., Lambon Ralph, M. A., & Rogers, T. T. (2013). Semantic diversity: A measure of semantic ambiguity based on variability in the contextual usage of words. Behavior Research Methods, 45(3), 718–730. DOI: 10.3758/s13428-012-0278-x — scale anchor for in-house semantic diversity.

Affective (methodological anchor)¶

Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4), 1191–1207. DOI: 10.3758/s13428-012-0314-x — scale anchor for in-house valence + arousal; validated against held-out Warriner oracle.

Word similarity graph¶

PhonoLex (in-house derivation). Qwensim: ~1.6M word-similarity edges from Qwen3-Embedding-0.6B cosine over FineWeb-Edu. Bulk of the graph.

Marxer et al. (2016) ECCC — see Phonological features section above.

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001). Placing search in context: The concept revisited (WordSim-353). Proceedings of the 10th International Conference on World Wide Web, 406–414. DOI: 10.1145/371920.372094

Morphology¶

Batsuren, K., Bella, G., & Giunchiglia, F. (2021). MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology. Proceedings of the 18th SIGMORPHON Workshop, 39–48. License: CC BY-SA 3.0. https://github.com/kbatsuren/MorphyNet

Sentences corpus¶

Penedo, G., Kydlíček, H., Lozhkov, A., et al. (2024). FineWeb-Edu: an open and high-quality dataset for educational content. License: ODC-BY 1.0. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

Nivre, J., et al. (2020). Universal Dependencies v2 (UD English-EWT + GUM corpora). License: CC BY-SA. https://universaldependencies.org/

Warstadt, A., Singh, A., & Bowman, S. R. (2019). CoLA: The Corpus of Linguistic Acceptability (positive examples). Transactions of the ACL. https://nyu-mll.github.io/CoLA/

Tatoeba contributors. (2024). Tatoeba sentence collection (English subset). License: CC BY 2.0 FR. https://tatoeba.org/

Lison, P., & Tiedemann, J. (2016). OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora. Proceedings of LREC. http://www.opensubtitles.org/

Clinical intervention approaches¶

Gierut, J. A. (1989). Maximal opposition approach to phonological treatment. Journal of Speech and Hearing Disorders, 54(1), 9–19. DOI: 10.1044/jshd.5401.09

Gierut, J. A. (1990). Differential learning of phonological oppositions. Journal of Speech and Hearing Research, 33(3), 540–549. DOI: 10.1044/jshr.3303.540

Williams, A. L. (2000). Multiple oppositions: theoretical foundations for an alternative contrastive intervention approach. American Journal of Speech-Language Pathology, 9(4), 282–288. DOI: 10.1044/1058-0360.0904.282

Storkel, H. L. (2022). Minimal, Maximal, or Multiple: Which Contrastive Intervention Approach to Use With Children With Speech Sound Disorders? Language, Speech, and Hearing Services in Schools, 53(3), 632–645. DOI: 10.1044/2022_LSHSS-21-00137