Skip to content

Data & Methods

PhonoLex is a psycholinguistic data platform for speech-language pathologists and researchers. It consolidates 15 peer-reviewed research datasets covering 44,011 English words, providing a single interface for word property lookup, phonological analysis, similarity search, and intervention planning.

This page explains what PhonoLex measures, where the data comes from, and how the computational methods work.


Overview

PhonoLex provides four categories of data:

  1. Word Properties -- 35 filterable psycholinguistic and phonological properties drawn from 15 research datasets, organized into 9 categories
  2. Phonological Similarity -- a phoneme-based similarity algorithm using learned articulatory feature vectors
  3. Cognitive Association Data -- 68,000+ word-to-word relationships from 6 human-subjects datasets (free association, semantic relatedness, perceptual confusability, semantic priming, similarity judgments)
  4. Contrastive Intervention Data -- minimal pairs, maximal opposition pairs, and multiple opposition sets grounded in the clinical intervention literature

All data is based on General American English pronunciations from the CMU Pronouncing Dictionary (primary pronunciations only). The vocabulary of 44,011 words is filtered to include only words that have an IPA transcription, word frequency data, and at least one psycholinguistic norm value.


Word Properties

PhonoLex assigns up to 35 filterable properties to each word, organized into 9 categories. Not all words have values for every property -- coverage varies by dataset (30--100%). When filtering by a property, words without that property value are excluded.

Phonological Complexity

These properties describe the structural complexity of a word's sound form. They are central to clinical decision-making because phonological complexity predicts production difficulty in children with speech sound disorders.

Property What It Measures Scale Source
Syllable Count Number of syllables 1--8 CMU Pronouncing Dictionary
Phoneme Count Number of speech sounds (phonemes) 1--15+ CMU Pronouncing Dictionary
Word Complexity Measure (WCM) Composite complexity from 8 phonological parameters (clusters, velars, fricatives, liquids, stress pattern, word-final consonants) 0--20+ Stoel-Gammon (2010)

Why it matters for SLPs: Early intervention typically targets simple words (1 syllable, low WCM). As treatment progresses, clinicians systematically increase complexity. PhonoLex lets you filter for words at specific complexity levels while controlling other variables like frequency and age of acquisition.

Phonotactic Probability

Phonotactic probability measures how common a word's sound patterns are in English. Children learn high-probability sound sequences earlier and produce them more accurately.

Property What It Measures Scale Source
Biphone Probability (Avg) How common adjacent phoneme pairs are in English 0--1 Vitevitch & Luce (2004)
Positional Segment Probability (Avg) How common each phoneme is in its syllable position (onset, nucleus, or coda) 0--1 Vitevitch & Luce (2004)

Why it matters for SLPs: High phonotactic probability facilitates word learning. When selecting treatment words, you may want high-probability words for easier targets or low-probability words to challenge generalization.

Lexical Properties

Lexical properties describe how words are used and learned in everyday language.

Property What It Measures Scale Source
Word Frequency How often the word appears in spoken language (film subtitles) 0--1000+ per million SUBTLEX-US (Brysbaert & New, 2009)
Contextual Diversity Percentage of films containing the word 0--100 SUBTLEX-US (Brysbaert & New, 2009)
Word Prevalence Proportion of people who know the word 0--1 Brysbaert et al. (2019)
Age of Acquisition (Glasgow) Rated age when the word was learned (1 = earliest, 7 = latest) 1--7 Glasgow Norms (Scott et al., 2019)
Age of Acquisition (years) Estimated age of acquisition in years 1--25 Kuperman et al. (2012)
Lexical Decision RT Mean reaction time to recognize the word (milliseconds) 400--1200 English Lexicon Project (Balota et al., 2007)

Why it matters for SLPs: Frequency and age of acquisition are essential for selecting developmentally appropriate treatment words. High-frequency, early-acquired words are more functional for daily communication. Lexical decision RT provides an objective measure of word recognition difficulty.

Semantic Properties

Semantic properties describe aspects of word meaning that affect processing and learning.

Property What It Measures Scale Source
Imageability Ease of forming a mental image 1--7 Glasgow Norms (Scott et al., 2019)
Familiarity How familiar the word feels 1--7 Glasgow Norms (Scott et al., 2019)
Concreteness Concrete (tangible) vs. abstract 1--5 Brysbaert et al. (2014)
Size Perceived size of the referent 1--7 Glasgow Norms (Scott et al., 2019)

Why it matters for SLPs: High-imageability and high-concreteness words are easier to teach because they can be demonstrated, pictured, and physically interacted with. These properties help you select treatment stimuli that are semantically accessible to your client.

Affective Properties

Affective properties capture the emotional dimensions of word meaning.

Property What It Measures Scale Source
Valence Emotional positivity/negativity (1 = very negative, 9 = very positive) 1--9 Warriner et al. (2013)
Arousal Emotional intensity (1 = very calm, 9 = very excited) 1--9 Warriner et al. (2013)
Dominance Sense of control (1 = very weak, 9 = very powerful) 1--9 Warriner et al. (2013)

Why it matters for SLPs: Affective properties are relevant for social-emotional language therapy, narrative intervention, and ensuring that treatment words are emotionally appropriate for the client.

Cognitive / Embodied Properties

These properties capture how words relate to physical experience and social interaction.

Property What It Measures Scale Source
Iconicity Sound-meaning correspondence (does the word sound like what it means?) -5 to +5 Winter et al. (2023)
Body-Object Interaction (BOI) Ease of physical interaction with the referent 1--7 Pexman et al. (2019)
Socialness Degree of social content 1--7 Diveica et al. (2021)

Why it matters for SLPs: Iconic words (where sound maps to meaning) may be easier to learn. High-BOI words can be paired with physical activities in therapy. Socialness helps select words relevant to social communication goals.

Sensorimotor Properties -- Perceptual

Perceptual strength ratings measure how strongly a word is associated with each sensory modality.

Property What It Measures Scale Source
Auditory Association with hearing 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)
Visual Association with seeing 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)
Haptic Association with touch 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)
Gustatory Association with taste 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)
Olfactory Association with smell 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)
Interoceptive Association with internal bodily sensation 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)

Why it matters for SLPs: Sensorimotor properties help select words that connect to specific sensory experiences, which can support multisensory therapy approaches and vocabulary instruction tied to real-world perception.

Sensorimotor Properties -- Action

Action strength ratings measure how strongly a word is associated with actions performed by different body parts.

Property What It Measures Scale Source
Hand / Arm Association with hand/arm actions 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)
Foot / Leg Association with foot/leg actions 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)
Head (excl. mouth) Association with head movements 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)
Mouth / Throat Association with mouth/throat actions 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)
Torso Association with torso actions 0--5 Lancaster Sensorimotor Norms (Lynott et al., 2020)

Why it matters for SLPs: Action strength ratings support embodied learning approaches. Words with high mouth/throat action strength, for example, may pair naturally with oral-motor activities.

Morphological Properties

Morphological properties describe the internal structure of words.

Property What It Measures Scale Source
Morpheme Count Number of meaningful parts (morphemes) 1--6+ MorphoLex-en (Sanchez-Gutierrez et al., 2018)
Prefix Count Number of prefixes 0--4 MorphoLex-en (Sanchez-Gutierrez et al., 2018)
Suffix Count Number of suffixes 0--4 MorphoLex-en (Sanchez-Gutierrez et al., 2018)

Why it matters for SLPs: Morphological complexity affects word learning and reading development. Monomorphemic words (morpheme count = 1) are structurally simpler. Filtering by morpheme count lets you control this variable when building word lists.


Phonological Similarity

PhonoLex computes phonological similarity between words using a method that respects the internal structure of syllables. Rather than treating words as flat sequences of sounds, the algorithm decomposes each word into syllables, and each syllable into three components: onset (initial consonants), nucleus (vowel), and coda (final consonants).

How It Works

The similarity computation has two levels:

Within each syllable component, the algorithm compares the phoneme sequences of two words using a soft edit distance. Instead of treating all phoneme substitutions as equally costly, the cost of substituting one phoneme for another is based on how articulatorily similar they are. Phonemes that share many articulatory features (e.g., /p/ and /b/, which differ only in voicing) have a low substitution cost, while phonemes that differ in many features (e.g., /s/ and /m/) have a high cost. This means consonant clusters are handled naturally -- comparing a /kr/ onset to a /k/ onset incurs an appropriate length penalty.

Across syllables, the algorithm computes a weighted average of onset, nucleus, and coda similarity, then uses another soft edit distance to compare syllable sequences. This handles words of different lengths (e.g., "cat" vs. "catalog") gracefully.

The articulatory similarity between phonemes is derived from learned feature vectors (26 articulatory features via Bayesian inference). Each phoneme is represented as a continuous feature vector, and cosine similarity between vectors determines how similar two phonemes sound.

Component Weights

The relative importance of onset, nucleus, and coda in the similarity computation is controlled by adjustable weights. This lets you find different types of phonological relationships:

Preset Onset Nucleus Coda What It Finds
Balanced 0.33 0.33 0.33 Overall sound similarity, weighting all positions equally
Rhymes 0.0 0.5 0.5 Words that rhyme (matching vowel + ending)
Alliteration 1.0 0.5 0.0 Words that start with similar sounds
Assonance 0.0 1.0 0.0 Words with matching vowel sounds
Consonance 0.5 0.0 0.5 Words with matching consonant patterns

Why it matters for SLPs: Different clinical goals call for different types of phonological similarity. Finding rhymes supports phonological awareness goals. Finding words with similar onsets supports work on initial consonant production. The weight system lets you tailor similarity search to your specific clinical question.

Interpreting Similarity Scores

Similarity scores range from 0.0 (completely different) to 1.0 (identical). Typical ranges:

  • 0.90+: Near-identical sound patterns (e.g., "cat" and "bat" -- perfect rhymes)
  • 0.75--0.85: Strong resemblance (e.g., "computer" and "commuter")
  • 0.40--0.60: Moderate similarity (some shared sounds)
  • 0.20--0.30: Low similarity (e.g., "cat" and "dog" -- unrelated sound patterns)

Cognitive Association Data

PhonoLex includes over 1 million word-to-word relationships from 7 research datasets. These capture how words are connected in human cognition -- through meaning, association, perception, and priming.

Free Association

Dataset Relationships What It Measures Source
University of South Florida (USF) 72,172 Forward and backward word association norms collected over 20 years from university students. Nelson et al. (2004)

Why it matters for SLPs: Free association data reveals how words are connected in the mental lexicon. High-association word pairs (e.g., "dog" and "cat") can be used for semantic therapy, word retrieval exercises, and understanding a client's associative network.

Semantic Relatedness and Similarity

Dataset Relationships What It Measures Source
MEN 3,000 Human judgments of semantic relatedness. Raters assessed how related two words are (regardless of the type of relationship). Bruni et al. (2014)
SimLex-999 999 Human judgments of semantic similarity specifically (not just relatedness). "Car" and "bicycle" are similar; "car" and "road" are related but not similar. Hill et al. (2015)
WordSim-353 351 Human judgments of semantic relatedness, a widely-used benchmark. Finkelstein et al. (2001)

Why it matters for SLPs: Semantic relatedness and similarity data help identify word pairs for semantic feature analysis, category sorting tasks, and understanding the structure of a client's semantic knowledge.

Perceptual Confusability

Dataset Relationships What It Measures Source
Edinburgh Closed-set Confusability Corpus (ECCC) 2,489 Which words do listeners confuse with each other when hearing speech in noise? Based on listening experiments where participants identified words in degraded conditions. Marxer et al. (2016)

Why it matters for SLPs: Confusability data is directly relevant to auditory processing and speech perception. If a client confuses certain words, ECCC data can help you identify other words likely to be confusable and design targeted listening exercises.

Semantic Priming

Dataset Relationships What It Measures Source
Semantic Priming Project (SPP) 1,661 How much does seeing one word speed up recognition of a related word? Measured via lexical decision reaction times at short and long stimulus onset asynchronies. Hutchison et al. (2013)

Why it matters for SLPs: Priming data shows how strongly two words activate each other during language processing. Strong priming pairs can support word retrieval therapy and semantic network activation.


Contrastive Intervention Research

PhonoLex implements three evidence-based contrastive approaches used in phonological intervention for speech sound disorders.

Minimal Pairs

Minimal pairs are two words that differ by exactly one phoneme in the same position (e.g., "cat" vs. "bat," "pat" vs. "pan"). PhonoLex has precomputed all minimal pairs in the vocabulary, allowing you to search for pairs that contrast any two phonemes.

You can filter by:

  • Phoneme contrast (e.g., /k/ vs. /t/)
  • Position (word-initial, word-medial, word-final)

Maximal Opposition (Gierut, 1989)

Maximal opposition therapy selects phoneme pairs that are maximally different in their articulatory features -- differing in major class, place, manner, and voicing. The theory is that targeting maximally different sound contrasts promotes broader generalization across the phonological system.

PhonoLex computes feature distance between phoneme pairs using learned articulatory features, then finds word pairs that contrast those maximally opposed phonemes. The algorithm prioritizes:

  1. Major class differences (e.g., a stop vs. a fricative)
  2. Maximum number of differing distinctive features

Reference: Gierut, J. A. (1989). Maximal opposition approach to phonological treatment. Journal of Speech and Hearing Disorders, 54(1), 9--19.

Multiple Opposition (Storkel, 2022)

Multiple opposition therapy targets phoneme collapse -- when a child substitutes one sound for multiple different target sounds (e.g., using /t/ for /k/, /s/, /tS/). Treatment selects a set of words that all share the same rime but differ in the collapsed phoneme position, so the child must learn to differentiate multiple contrasts simultaneously.

PhonoLex identifies representative target phonemes for a given collapse pattern and finds word sets (triplets, quadruplets, or larger) where all words differ only in the target phoneme position.

Reference: Storkel, H. L. (2022). Minimal, Maximal, or Multiple: Which Contrastive Intervention Approach to Use With Children With Speech Sound Disorders? Language, Speech, and Hearing Services in Schools, 53(3), 632--645.


Phonological Features

Phonological similarity in PhonoLex is grounded in learned Bayesian feature vectors (packages/features/), which replace PHOIBLE's binary features with continuous values learned from empirical evidence.

Each of the 39 English phonemes in PhonoLex (24 consonants and 15 vowels, including diphthongs) is represented using 38 distinctive features from the Hayes (2009) and Moisik & Esling (2011) feature systems. These features describe the articulatory properties of each sound:

  • Manner features: Whether the sound involves full closure (stops), partial closure (fricatives), nasal airflow (nasals), etc.
  • Place features: Where in the vocal tract the sound is produced (lips, teeth, alveolar ridge, palate, velum, glottis)
  • Laryngeal features: Voicing, aspiration, and glottal state
  • Vowel features: Height, backness, rounding, tenseness

Each feature has three possible values: present (+), absent (-), or not applicable (0). These ternary features are converted to continuous values (+1.0, -1.0, 0.0) for similarity computation. Each feature is encoded as a two-dimensional start/end pair to model articulation dynamics, yielding a 76-dimensional feature vector per phoneme. Diphthongs use 152 dimensions (76 for each vowel endpoint).

Why it matters for SLPs: The feature system is what makes PhonoLex's similarity algorithm phonologically grounded rather than arbitrary. When PhonoLex says two words sound similar, that judgment is based on the same articulatory features that SLPs use to describe speech sounds -- place, manner, and voicing. The maximal opposition tool also uses these features to find maximally contrastive phoneme pairs.


Full Citations

Pronunciation Data

Carnegie Mellon University. (2014). The CMU Pronouncing Dictionary (125,764 words). http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Phonological Features

Hayes, B. (2009). Introductory Phonology. Wiley-Blackwell.

Moisik, S. R., & Esling, J. H. (2011). The 'whole larynx' approach to laryngeal features. Proceedings of the International Congress of Phonetic Sciences XVII, 1406--1409.

Phonological Complexity

Stoel-Gammon, C. (2010). The Word Complexity Measure: Description and application to developmental phonology and disorders. Clinical Linguistics & Phonetics, 24(4-5), 271--282. DOI: 10.3109/02699200903581059

Phonotactic Probability

Vitevitch, M. S., & Luce, P. A. (2004). A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481--487. DOI: 10.3758/BF03195586

Word Frequency and Contextual Diversity

Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977--990. DOI: 10.3758/BRM.41.4.977

Word Prevalence

Brysbaert, M., Mandera, P., McCormick, S. F., & Keuleers, E. (2019). Word prevalence norms for 62,000 English lemmas. Behavior Research Methods, 51, 467--479. DOI: 10.3758/s13428-018-1077-9

Age of Acquisition

Scott, G. G., Keitel, A., Becirspahic, M., Yao, B., & Sereno, S. C. (2019). The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods, 51, 1258--1270. DOI: 10.3758/s13428-018-1099-3

Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978--990. DOI: 10.3758/s13428-012-0210-4

Lexical Decision

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... & Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39(3), 445--459. DOI: 10.3758/BF03193014

Concreteness

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904--911. DOI: 10.3758/s13428-013-0403-5

Affective Norms (Valence, Arousal, Dominance)

Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4), 1191--1207. DOI: 10.3758/s13428-012-0314-x

Iconicity

Winter, B., Perlman, M., Perry, L. K., & Lupyan, G. (2023). Iconicity ratings for 14,000+ English words. Behavior Research Methods, 55, 1640--1655. DOI: 10.3758/s13428-022-01847-0

Body-Object Interaction

Pexman, P. M., Muraki, E., Sidhu, D. M., Siakaluk, P. D., & Yap, M. J. (2019). Quantifying sensorimotor experience: Body-object interaction ratings for more than 9,000 English words. Behavior Research Methods, 51(2), 453--466. DOI: 10.3758/s13428-018-1171-z

Socialness

Diveica, V., Pexman, P. M., & Binney, R. J. (2021). Quantifying social semantics: An inclusive definition of socialness and ratings for 8,388 English words. Behavior Research Methods, 55, 461--479. DOI: 10.3758/s13428-022-01810-z

Sensorimotor Norms

Lynott, D., Connell, L., Brysbaert, M., Brand, J., & Carney, J. (2020). The Lancaster Sensorimotor Norms: Multidimensional measures of perceptual and action strength for 40,000 English word lemmas. Behavior Research Methods, 52(3), 1271--1291. DOI: 10.3758/s13428-019-01316-z

Morphological Data

Sanchez-Gutierrez, C. H., Mailhot, H., Deacon, S. H., & Wilson, M. A. (2018). MorphoLex: A derivational morphological database for 70,000 English words. Behavior Research Methods, 50(4), 1568--1580. DOI: 10.3758/s13428-017-0981-8

Cognitive Association Datasets

De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms, G. (2019). The Small World of Words English word association norms for over 12,000 cue words. Behavior Research Methods, 51(3), 987--1006. DOI: 10.3758/s13428-018-1115-7

Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, & Computers, 36(3), 402--407. DOI: 10.3758/BF03195588

Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal distributional semantics. Journal of Artificial Intelligence Research, 49, 1--47. DOI: 10.1613/jair.4135

Marxer, R., Barker, J., Martin, N., & Coleman, J. (2016). Modelling speech intelligibility in adverse conditions: A corpus study. Edinburgh DataShare. https://datashare.ed.ac.uk/handle/10283/2791

Hutchison, K. A., Balota, D. A., Neely, J. H., Cortese, M. J., Cohen-Shikora, E. R., Tse, C.-S., et al. (2013). The Semantic Priming Project. Behavior Research Methods, 45(4), 1099--1114. DOI: 10.3758/s13428-012-0304-z

Hill, F., Reichart, R., & Korhonen, A. (2015). SimLex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665--695. DOI: 10.1162/COLI_a_00237

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001). Placing search in context: The concept revisited. Proceedings of the 10th International Conference on World Wide Web, 406--414. DOI: 10.1145/371920.372094

Clinical Intervention Approaches

Gierut, J. A. (1989). Maximal opposition approach to phonological treatment. Journal of Speech and Hearing Disorders, 54(1), 9--19. DOI: 10.1044/jshd.5401.09

Gierut, J. A. (1990). Differential learning of phonological oppositions. Journal of Speech and Hearing Research, 33(3), 540--549. DOI: 10.1044/jshr.3303.540

Gierut, J. A., & Neumann, H. J. (1992). Teaching and learning /th/: A non-confound. Clinical Linguistics & Phonetics, 6(3), 191--200. DOI: 10.3109/02699209208985533

Storkel, H. L. (2022). Minimal, Maximal, or Multiple: Which Contrastive Intervention Approach to Use With Children With Speech Sound Disorders? Language, Speech, and Hearing Services in Schools, 53(3), 632--645. DOI: 10.1044/2022_LSHSS-21-00137

Phonological Similarity Method

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707--710.