Skip to content

Data & Methods

PhonoLex is a phonological analysis and corpus-retrieval platform for SLPs working on phonological targets, with a broader REST API for phonological research and language-education tooling. This page explains what's in the database, where it comes from, and how the computational methods work.


Overview

PhonoLex provides five categories of data:

  1. Lexicon — ~125K CMU-phonology entries; the ~47K canonical content-POS subset (NOUN / VERB / ADJ / ADV) carries the full ~150-column norm set.
  2. Word properties — ~150 in-house psycholinguistic + phonological columns derived from PhonoLex pipelines (→ ), with original-author papers cited as methodological anchors for the scales we adopt.
  3. Word similarity graph — ~1.63M edges (~99.8% Qwensim neural-embedding cosine over FineWeb-Edu + ~2.5K ECCC perceptual confusability + ~351 WordSim-353 human-rated similarity).
  4. Minimal-pair lexicon — 642K precomputed minimal pairs with learned-feature distance + sonorant-difference metrics.
  5. Curated corpus — ~236K naturalistic English sentences (CoLA, UD-EWT, GUM, Tatoeba, OpenSubtitles) gated at build time for SLP suitability.

All data is keyed on General American English pronunciations from the CMU Pronouncing Dictionary (primary pronunciations only).

The in-house norm columns are derived locally via gpt-4.1-mini cloze-prompt ratings adapting the scales of published norm sets, validated against held-out oracles. This replaces the original-author CSVs that PhonoLex used to redistribute — those CSVs had licensing terms that didn't sit cleanly with proprietary deployment.

The LLM-cloze methodology is grounded in two validation papers showing that LLM-derived psycholinguistic ratings correlate r = .74–.95 with human raters and outperform human raters on downstream prediction tasks:

  • Martínez, G., Conde, J., Reviriego, P., & Brysbaert, M. (2025). Using Large Language Models to Generate Psycholinguistic Norms. Behavior Research Methods (in press).
  • Brysbaert, M. (2024). Validating LLM-derived psycholinguistic ratings for word stimuli. Memory & Cognition.

Word properties

PhonoLex assigns up to ~150 columns to each word. Coverage varies by column — the canonical content-POS subset (~47K words) carries the full set; non-canonical entries (proper nouns, function words) carry phonology only.

Phonological complexity

Structural complexity of a word's sound form — central to clinical decision-making because phonological complexity predicts production difficulty in children with speech sound disorders.

Property What it measures Scale Source
syllable_count Number of syllables 1–8 Computed from CMU
phoneme_count Number of speech sounds 1–15+ Computed from CMU
cv_shape Syllable structure as a CV string (CVC, CCVCC, CV-CVC, ...) categorical Computed from CMU
wcm_score Composite Word Complexity Measure (clusters, velars, fricatives, liquids, stress pattern, word-final consonants) 0–20+ Method: Stoel-Gammon (2010); computed locally

Phonotactic probability

How common a word's sound patterns are in English. Children learn high-probability sound sequences earlier and produce them more accurately.

Property What it measures Scale Source
phono_prob_avg Average biphone probability 0–1 Method: Vitevitch & Luce (2004); computed from CMU
positional_prob_avg Positional segment probability (per syllable position) 0–1 Method: Vitevitch & Luce (2004); computed from CMU
str_phono_prob_avg, str_positional_prob_avg Stressed-syllable variants 0–1 Same
neighborhood_density, str_neighborhood_density Number of one-phoneme-substitution neighbors in the lexicon counts Computed from CMU

Lexical frequency

Property What it measures Scale Source
frequency General-corpus words-per-million 0–80,000+ PhonoLex derivation from FineWeb-Edu (~800M tokens, ~1M docs)
log_frequency Log-Zipf normalization of frequency 0–10 Same
contextual_diversity Document-level diversity proxy 0–1 Same
freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13 CYP-LEX child-corpus age bands (7-9, 10-12, 13+ years) wpm Korochkina et al. (2024), CC BY 4.0

Developmental frequency

Aggregated from CHILDES + PhonBank TalkBank corpora. 2026-05-25 rework: child PRODUCTION channels (not caregiver INPUT), so freq_age_2y answers "what do 2-year-olds actually say" rather than "what do adults say to them."

Property Source Note
freq_age_2y PB + CHILDES prod_12-24mo + prod_24-36mo mean child production
freq_age_5y PB + CHILDES prod_36-48mo + prod_48-72mo mean child production
freq_age_8y CHILDES prod_72-108mo child production
freq_age_12y CHILDES prod_108-144mo child production
freq_age_all Aliases frequency (FineWeb-Edu general) general-corpus reference at the top of the developmental ladder

Source: MacWhinney (2000) CHILDES + Rose & MacWhinney (2014) PhonBank; PhonoLex derivation.

Lexical timing

Property What it measures Scale Source
aoa Age at which a word is typically learned 1–7 (age-band anchored: 1≈0-2y, 4≈7-8y, 7≈13y+) PhonoLex in-house: gpt-4.1-mini cloze with logprob expected-value extraction. Validated Spearman 0.868 vs Glasgow Norms (N=5,551), Pearson 0.816 vs Kuperman (held-out N=500).

Semantic

All derived locally via gpt-4.1-mini cloze-prompt over ~47K non-PROPN content words. Methodology adapts the published scale referenced in each row's source.

Property What it measures Scale Source / scale anchor
concreteness Concrete (tangible) vs. abstract 1–5 ; Brysbaert et al. (2014) scale
familiarity How familiar the word feels 1–7
boi Body-object interaction strength 1–7 ; Pexman et al. (2019) scale
iconicity Sound-meaning correspondence -5 to +5 ; Winter et al. (2024) scale
socialness Degree of social content 1–7 ; Diveica et al. (2023) scale
semantic_diversity, semd_topic, semd_vn, semd_h13, n_topics_for_word Semantic diversity / topic statistics various ; Hoffman et al. (2013) scale

Affective

Property What it measures Scale Source
valence Emotional positivity (1 = negative, 9 = positive) 1–9 PhonoLex (gpt-4.1-mini cloze; Warriner et al. (2013) scale; Spearman 0.836 vs held-out Warriner oracle for valence on N=500 pilot)
arousal Emotional intensity (1 = calm, 9 = excited) 1–9 Same

The Warriner D (dominance) axis was not re-derived in the PhonoLex in-house build — see freq_age_adultfreq_age_all migration note in CHANGELOG v5.2.1 for similar retirement framing.

Morphological

Property What it measures Source
morpheme_count, n_prefixes, n_suffixes, is_monomorphemic Algorithmic morpheme decomposition In-house algorithmic morphology + MorphyNet (Batsuren et al., 2021, CC BY-SA 3.0)

Percentiles

Every numeric column has a parallel {column}_percentile (0–100) for cross-property comparison and for the percentile-mode UI filters. Frequency-class properties (anything starting with freq_) treat value = 0 as NULL when computing percentiles — a word that never occurs in the source corpus shouldn't cluster at a misleading mid-rank.


Phonological similarity

Phonological similarity in PhonoLex uses a method that respects internal syllable structure. Each word is decomposed into syllables, and each syllable into onset (initial consonants), nucleus (vowel), and coda (final consonants).

How it works

Two levels:

  • Within each syllable component, compare the phoneme sequences using soft Levenshtein distance. Substitution cost between two phonemes is based on how articulatorily similar they are — phonemes sharing many features (e.g., /p/ and /b/, differing only in voicing) have low substitution cost; phonemes differing in many features (e.g., /s/ and /m/) have high cost. Consonant clusters are handled naturally — comparing a /kr/ onset to a /k/ onset incurs an appropriate length penalty.
  • Across syllables, compute a weighted average of onset, nucleus, and coda similarity, then run another soft Levenshtein over syllable sequences. Handles words of different lengths gracefully.

Phoneme similarity comes from learned 26-d Bayesian feature vectors (packages/features/): theory-assigned articulatory features (Hayes 2009) serve as Dirichlet priors over feature posteriors; ECCC perceptual confusion (Marxer et al., 2016) + Hillenbrand acoustic measurements (Hillenbrand et al., 1995) serve as evidence. Validation: r=0.987 cosine correlation against the theory-assigned feature inventory at convergence.

Component weights

Adjustable onset / nucleus / coda weights let the same algorithm surface different relationship types:

Preset Onset Nucleus Coda What it finds
Balanced 0.33 0.33 0.33 Overall sound similarity
Rhymes 0.0 0.5 0.5 Rhyming words (matching vowel + ending)
Alliteration 1.0 0.5 0.0 Words with similar onsets
Assonance 0.0 1.0 0.0 Words with matching vowel sounds
Consonance 0.5 0.0 0.5 Words with matching consonant frame

Surfaced as the "Sound Similarity" rule inside Custom Word Lists, and used by the Lookup tool's similar-words panel.

Interpreting similarity scores

Scores range 0.0 (completely different) to 1.0 (identical):

  • 0.90+ — near-identical (e.g., cat / bat, perfect rhymes)
  • 0.75–0.85 — strong resemblance (e.g., computer / commuter)
  • 0.40–0.60 — moderate
  • 0.20–0.30 — low

Word similarity graph

The ~1.6M edges in the edges D1 table:

Source Edges What it measures
Qwensim ~1,627,000 Neural-embedding cosine similarity over FineWeb-Edu via Qwen3-Embedding-0.6B. Bulk of the graph; semantic similarity from a sentence-transformer, not free-association norm data.
ECCC ~2,456 Perceptual confusability in noise (Marxer et al., 2016, CC BY 4.0)
WordSim-353 ~351 Human-judged semantic relatedness (Finkelstein et al., 2001)

The "word association" framing the older data layer used (USF, MEN, SPP, SimLex, SWOW) is retired — those datasets were removed during the licensing audit and replaced by the much larger Qwensim layer. Honest framing for SLP-facing UX: "neighboring words via Qwensim semantic similarity."


Contrastive intervention data

Three evidence-based contrastive approaches. All three are available at the word level (Contrast Sets tool). Minimal pair and maximal opposition are also available at the sentence level (Sentences tool); multiple opposition is word-level only, because a single attested sentence almost never witnesses one substitute against several target phonemes at once.

Minimal pairs

Two words differing by exactly one phoneme in the same position (cat / bat, pat / pan). PhonoLex precomputes 642K minimal pairs across the lexicon. Filter by phoneme contrast and position (initial / medial / final / any). Each pair carries a feature_distance (continuous L2 over learned vectors) and a sonorant_diff (boolean — whether the contrast crosses the sonorant class).

Maximal opposition (Gierut, 1989)

Minimal pair where the contrasting phonemes also differ in major class (typically obstruent vs. sonorant). The theory: targeting maximally different contrasts promotes broader generalization across the phonological system. PhonoLex surfaces this by filtering minimal pairs on sonorant_diff >= threshold.

Multiple opposition (Williams, 2000)

Targets phoneme collapse — when a child substitutes one sound for multiple different target sounds (e.g., /t/ for /k/, /s/, /ʃ/). Treatment selects a set of words that all contrast against the substitute at the same position, so the child must differentiate multiple contrasts simultaneously.


Corpus

The Sentences tool draws from ~236K curated naturalistic English sentences. Build pipeline gates (in order):

  1. Unicode normalize + char whitelist + length bounds + repetition entropy
  2. Punctuation balance + terminator + clause repetition + no-garbage URL/citation/HTML
  3. Letter-spelled-word rejection (R-o-b-a-r-d patterns)
  4. Vocab coverage (every alpha token must have CMU phonology)
  5. Profanity filter (better-profanity)
  6. SLP content-suitability gate (in-house V/A norms + AFINN NEG_3/4/5 buckets — drops violence/conflict)
  7. Verbal-filler rejection (uh) + Spanish-loanword denylist
  8. spaCy parse + single-ROOT verb-headed validation + parataxis rejection
  9. Obscure-PROPN rejection (non-canonical PROPN with FineWeb-Edu freq < 5/million)
  10. PROPN cap of 2 per sentence
  11. spaCy contraction-glue (stem + suffix → whole-word don't / won't / it's)
  12. Coverage-aware rarity-driven dedup + cross-source identical-text merge

Sources currently feeding the corpus: CoLA, UD English-EWT, GUM, Tatoeba, OpenSubtitles. CHILDES + PhonBank conversational transcripts were retired 2026-05-25 — the CHAT-transcript cleanup couldn't reliably distinguish locally-plausible fragments from genuine sentences. Sentences are ranked by per-query match_count first (multi-hit > single-hit), then by static rarity_score (coverage-aware sum over satisfied phoneme-position + bigram + top-50 CV-shape constraints of 1 / n_sentences[C]).


Full citations

Pronunciation data

Carnegie Mellon University. (2014). The CMU Pronouncing Dictionary (~134,000 words). License: Modified BSD. http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Phonological features (priors + evidence)

Hayes, B. (2009). Introductory Phonology (theory-assigned 26-feature inventory used as prior). Wiley-Blackwell.

Marxer, R., Barker, J., Martin, N., & Coleman, J. (2016). Modelling speech intelligibility in adverse conditions: a corpus study (ECCC v1.2; evidence layer for feature posteriors and perceptual confusability edges). License: CC BY 4.0. https://datashare.ed.ac.uk/handle/10283/2791

Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels (evidence layer for vowel feature posteriors). Journal of the Acoustical Society of America, 97(5 Pt 1), 3099–3111. DOI: 10.1121/1.411872

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.

Phonological complexity

Stoel-Gammon, C. (2010). The Word Complexity Measure: Description and application to developmental phonology and disorders. Clinical Linguistics & Phonetics, 24(4-5), 271–282. DOI: 10.3109/02699200903581059

Phonotactic probability

Vitevitch, M. S., & Luce, P. A. (2004). A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481–487. DOI: 10.3758/BF03195594 — method origin; PhonoLex computes values directly from the CMU dict.

Lexical frequency

PhonoLex (in-house derivation). In-house derivation from HuggingFace FineWeb-Edu (~800M tokens, ~1M docs). License: ODC-BY 1.0. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

Korochkina, M., Marelli, M., Brysbaert, M., & Rastle, K. (2024). The Children and Young People's Books Lexicon (CYP-LEX). Quarterly Journal of Experimental Psychology, 77(11), 2197–2214. CC BY 4.0. https://osf.io/squ49/

Developmental frequency

MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.). Lawrence Erlbaum Associates. https://childes.talkbank.org/

Rose, Y., & MacWhinney, B. (2014). The PhonBank Project: Data and software-assisted methods for the study of phonology and phonological development. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford Handbook of Corpus Phonology. https://phonbank.talkbank.org/

Lexical timing (AoA — validation oracles)

Scott, G. G., Keitel, A., Becirspahic, M., Yao, B., & Sereno, S. C. (2019). The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods, 51(3), 1258–1270. DOI: 10.3758/s13428-018-1099-3 — primary validation oracle (Spearman 0.868 on N=5,551).

Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978–990. DOI: 10.3758/s13428-012-0210-4 — secondary cross-construct oracle (Pearson 0.816 on Glasgow-unseen N=500). Not redistributed.

Semantic (methodological anchors)

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904–911. DOI: 10.3758/s13428-013-0403-5 — scale anchor for in-house concreteness.

Pexman, P. M., Muraki, E., Sidhu, D. M., Siakaluk, P. D., & Yap, M. J. (2019). Quantifying sensorimotor experience: Body-object interaction ratings for more than 9,000 English words. Behavior Research Methods, 51(2), 453–466. DOI: 10.3758/s13428-018-1171-z — scale anchor for in-house BOI.

Winter, B., Lupyan, G., Perry, L. K., Dingemanse, M., & Perlman, M. (2024). Iconicity ratings for 14,000+ English words. Behavior Research Methods, 56(3), 1640–1655. DOI: 10.3758/s13428-023-02112-6 — scale anchor for in-house iconicity.

Diveica, V., Pexman, P. M., & Binney, R. J. (2023). Quantifying social semantics: An inclusive definition of socialness and ratings for 8,388 English words. Behavior Research Methods, 55(2), 461–473. DOI: 10.3758/s13428-022-01810-x — scale anchor for in-house socialness.

Hoffman, P., Lambon Ralph, M. A., & Rogers, T. T. (2013). Semantic diversity: A measure of semantic ambiguity based on variability in the contextual usage of words. Behavior Research Methods, 45(3), 718–730. DOI: 10.3758/s13428-012-0278-x — scale anchor for in-house semantic diversity.

Affective (methodological anchor)

Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4), 1191–1207. DOI: 10.3758/s13428-012-0314-x — scale anchor for in-house valence + arousal; validated against held-out Warriner oracle.

Word similarity graph

PhonoLex (in-house derivation). Qwensim: ~1.6M word-similarity edges from Qwen3-Embedding-0.6B cosine over FineWeb-Edu. Bulk of the graph.

Marxer et al. (2016) ECCC — see Phonological features section above.

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001). Placing search in context: The concept revisited (WordSim-353). Proceedings of the 10th International Conference on World Wide Web, 406–414. DOI: 10.1145/371920.372094

Morphology

Batsuren, K., Bella, G., & Giunchiglia, F. (2021). MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology. Proceedings of the 18th SIGMORPHON Workshop, 39–48. License: CC BY-SA 3.0. https://github.com/kbatsuren/MorphyNet

Sentences corpus

Penedo, G., Kydlíček, H., Lozhkov, A., et al. (2024). FineWeb-Edu: an open and high-quality dataset for educational content. License: ODC-BY 1.0. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

Nivre, J., et al. (2020). Universal Dependencies v2 (UD English-EWT + GUM corpora). License: CC BY-SA. https://universaldependencies.org/

Warstadt, A., Singh, A., & Bowman, S. R. (2019). CoLA: The Corpus of Linguistic Acceptability (positive examples). Transactions of the ACL. https://nyu-mll.github.io/CoLA/

Tatoeba contributors. (2024). Tatoeba sentence collection (English subset). License: CC BY 2.0 FR. https://tatoeba.org/

Lison, P., & Tiedemann, J. (2016). OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora. Proceedings of LREC. http://www.opensubtitles.org/

Clinical intervention approaches

Gierut, J. A. (1989). Maximal opposition approach to phonological treatment. Journal of Speech and Hearing Disorders, 54(1), 9–19. DOI: 10.1044/jshd.5401.09

Gierut, J. A. (1990). Differential learning of phonological oppositions. Journal of Speech and Hearing Research, 33(3), 540–549. DOI: 10.1044/jshr.3303.540

Williams, A. L. (2000). Multiple oppositions: theoretical foundations for an alternative contrastive intervention approach. American Journal of Speech-Language Pathology, 9(4), 282–288. DOI: 10.1044/1058-0360.0904.282

Storkel, H. L. (2022). Minimal, Maximal, or Multiple: Which Contrastive Intervention Approach to Use With Children With Speech Sound Disorders? Language, Speech, and Hearing Services in Schools, 53(3), 632–645. DOI: 10.1044/2022_LSHSS-21-00137