Data & Methods¶
PhonoLex is a psycholinguistic data platform for speech-language pathologists and researchers. It consolidates 15 peer-reviewed research datasets covering 44,011 English words, providing a single interface for word property lookup, phonological analysis, similarity search, and intervention planning.
This page explains what PhonoLex measures, where the data comes from, and how the computational methods work.
Overview¶
PhonoLex provides four categories of data:
- Word Properties -- 35 filterable psycholinguistic and phonological properties drawn from 15 research datasets, organized into 9 categories
- Phonological Similarity -- a phoneme-based similarity algorithm using learned articulatory feature vectors
- Cognitive Association Data -- 68,000+ word-to-word relationships from 6 human-subjects datasets (free association, semantic relatedness, perceptual confusability, semantic priming, similarity judgments)
- Contrastive Intervention Data -- minimal pairs, maximal opposition pairs, and multiple opposition sets grounded in the clinical intervention literature
All data is based on General American English pronunciations from the CMU Pronouncing Dictionary (primary pronunciations only). The vocabulary of 44,011 words is filtered to include only words that have an IPA transcription, word frequency data, and at least one psycholinguistic norm value.
Word Properties¶
PhonoLex assigns up to 35 filterable properties to each word, organized into 9 categories. Not all words have values for every property -- coverage varies by dataset (30--100%). When filtering by a property, words without that property value are excluded.
Phonological Complexity¶
These properties describe the structural complexity of a word's sound form. They are central to clinical decision-making because phonological complexity predicts production difficulty in children with speech sound disorders.
| Property | What It Measures | Scale | Source |
|---|---|---|---|
| Syllable Count | Number of syllables | 1--8 | CMU Pronouncing Dictionary |
| Phoneme Count | Number of speech sounds (phonemes) | 1--15+ | CMU Pronouncing Dictionary |
| Word Complexity Measure (WCM) | Composite complexity from 8 phonological parameters (clusters, velars, fricatives, liquids, stress pattern, word-final consonants) | 0--20+ | Stoel-Gammon (2010) |
Why it matters for SLPs: Early intervention typically targets simple words (1 syllable, low WCM). As treatment progresses, clinicians systematically increase complexity. PhonoLex lets you filter for words at specific complexity levels while controlling other variables like frequency and age of acquisition.
Phonotactic Probability¶
Phonotactic probability measures how common a word's sound patterns are in English. Children learn high-probability sound sequences earlier and produce them more accurately.
| Property | What It Measures | Scale | Source |
|---|---|---|---|
| Biphone Probability (Avg) | How common adjacent phoneme pairs are in English | 0--1 | Vitevitch & Luce (2004) |
| Positional Segment Probability (Avg) | How common each phoneme is in its syllable position (onset, nucleus, or coda) | 0--1 | Vitevitch & Luce (2004) |
Why it matters for SLPs: High phonotactic probability facilitates word learning. When selecting treatment words, you may want high-probability words for easier targets or low-probability words to challenge generalization.
Lexical Properties¶
Lexical properties describe how words are used and learned in everyday language.
| Property | What It Measures | Scale | Source |
|---|---|---|---|
| Word Frequency | How often the word appears in spoken language (film subtitles) | 0--1000+ per million | SUBTLEX-US (Brysbaert & New, 2009) |
| Contextual Diversity | Percentage of films containing the word | 0--100 | SUBTLEX-US (Brysbaert & New, 2009) |
| Word Prevalence | Proportion of people who know the word | 0--1 | Brysbaert et al. (2019) |
| Age of Acquisition (Glasgow) | Rated age when the word was learned (1 = earliest, 7 = latest) | 1--7 | Glasgow Norms (Scott et al., 2019) |
| Age of Acquisition (years) | Estimated age of acquisition in years | 1--25 | Kuperman et al. (2012) |
| Lexical Decision RT | Mean reaction time to recognize the word (milliseconds) | 400--1200 | English Lexicon Project (Balota et al., 2007) |
Why it matters for SLPs: Frequency and age of acquisition are essential for selecting developmentally appropriate treatment words. High-frequency, early-acquired words are more functional for daily communication. Lexical decision RT provides an objective measure of word recognition difficulty.
Semantic Properties¶
Semantic properties describe aspects of word meaning that affect processing and learning.
| Property | What It Measures | Scale | Source |
|---|---|---|---|
| Imageability | Ease of forming a mental image | 1--7 | Glasgow Norms (Scott et al., 2019) |
| Familiarity | How familiar the word feels | 1--7 | Glasgow Norms (Scott et al., 2019) |
| Concreteness | Concrete (tangible) vs. abstract | 1--5 | Brysbaert et al. (2014) |
| Size | Perceived size of the referent | 1--7 | Glasgow Norms (Scott et al., 2019) |
Why it matters for SLPs: High-imageability and high-concreteness words are easier to teach because they can be demonstrated, pictured, and physically interacted with. These properties help you select treatment stimuli that are semantically accessible to your client.
Affective Properties¶
Affective properties capture the emotional dimensions of word meaning.
| Property | What It Measures | Scale | Source |
|---|---|---|---|
| Valence | Emotional positivity/negativity (1 = very negative, 9 = very positive) | 1--9 | Warriner et al. (2013) |
| Arousal | Emotional intensity (1 = very calm, 9 = very excited) | 1--9 | Warriner et al. (2013) |
| Dominance | Sense of control (1 = very weak, 9 = very powerful) | 1--9 | Warriner et al. (2013) |
Why it matters for SLPs: Affective properties are relevant for social-emotional language therapy, narrative intervention, and ensuring that treatment words are emotionally appropriate for the client.
Cognitive / Embodied Properties¶
These properties capture how words relate to physical experience and social interaction.
| Property | What It Measures | Scale | Source |
|---|---|---|---|
| Iconicity | Sound-meaning correspondence (does the word sound like what it means?) | -5 to +5 | Winter et al. (2023) |
| Body-Object Interaction (BOI) | Ease of physical interaction with the referent | 1--7 | Pexman et al. (2019) |
| Socialness | Degree of social content | 1--7 | Diveica et al. (2021) |
Why it matters for SLPs: Iconic words (where sound maps to meaning) may be easier to learn. High-BOI words can be paired with physical activities in therapy. Socialness helps select words relevant to social communication goals.
Sensorimotor Properties -- Perceptual¶
Perceptual strength ratings measure how strongly a word is associated with each sensory modality.
| Property | What It Measures | Scale | Source |
|---|---|---|---|
| Auditory | Association with hearing | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
| Visual | Association with seeing | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
| Haptic | Association with touch | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
| Gustatory | Association with taste | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
| Olfactory | Association with smell | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
| Interoceptive | Association with internal bodily sensation | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
Why it matters for SLPs: Sensorimotor properties help select words that connect to specific sensory experiences, which can support multisensory therapy approaches and vocabulary instruction tied to real-world perception.
Sensorimotor Properties -- Action¶
Action strength ratings measure how strongly a word is associated with actions performed by different body parts.
| Property | What It Measures | Scale | Source |
|---|---|---|---|
| Hand / Arm | Association with hand/arm actions | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
| Foot / Leg | Association with foot/leg actions | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
| Head (excl. mouth) | Association with head movements | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
| Mouth / Throat | Association with mouth/throat actions | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
| Torso | Association with torso actions | 0--5 | Lancaster Sensorimotor Norms (Lynott et al., 2020) |
Why it matters for SLPs: Action strength ratings support embodied learning approaches. Words with high mouth/throat action strength, for example, may pair naturally with oral-motor activities.
Morphological Properties¶
Morphological properties describe the internal structure of words.
| Property | What It Measures | Scale | Source |
|---|---|---|---|
| Morpheme Count | Number of meaningful parts (morphemes) | 1--6+ | MorphoLex-en (Sanchez-Gutierrez et al., 2018) |
| Prefix Count | Number of prefixes | 0--4 | MorphoLex-en (Sanchez-Gutierrez et al., 2018) |
| Suffix Count | Number of suffixes | 0--4 | MorphoLex-en (Sanchez-Gutierrez et al., 2018) |
Why it matters for SLPs: Morphological complexity affects word learning and reading development. Monomorphemic words (morpheme count = 1) are structurally simpler. Filtering by morpheme count lets you control this variable when building word lists.
Phonological Similarity¶
PhonoLex computes phonological similarity between words using a method that respects the internal structure of syllables. Rather than treating words as flat sequences of sounds, the algorithm decomposes each word into syllables, and each syllable into three components: onset (initial consonants), nucleus (vowel), and coda (final consonants).
How It Works¶
The similarity computation has two levels:
Within each syllable component, the algorithm compares the phoneme sequences of two words using a soft edit distance. Instead of treating all phoneme substitutions as equally costly, the cost of substituting one phoneme for another is based on how articulatorily similar they are. Phonemes that share many articulatory features (e.g., /p/ and /b/, which differ only in voicing) have a low substitution cost, while phonemes that differ in many features (e.g., /s/ and /m/) have a high cost. This means consonant clusters are handled naturally -- comparing a /kr/ onset to a /k/ onset incurs an appropriate length penalty.
Across syllables, the algorithm computes a weighted average of onset, nucleus, and coda similarity, then uses another soft edit distance to compare syllable sequences. This handles words of different lengths (e.g., "cat" vs. "catalog") gracefully.
The articulatory similarity between phonemes is derived from learned feature vectors (26 articulatory features via Bayesian inference). Each phoneme is represented as a continuous feature vector, and cosine similarity between vectors determines how similar two phonemes sound.
Component Weights¶
The relative importance of onset, nucleus, and coda in the similarity computation is controlled by adjustable weights. This lets you find different types of phonological relationships:
| Preset | Onset | Nucleus | Coda | What It Finds |
|---|---|---|---|---|
| Balanced | 0.33 | 0.33 | 0.33 | Overall sound similarity, weighting all positions equally |
| Rhymes | 0.0 | 0.5 | 0.5 | Words that rhyme (matching vowel + ending) |
| Alliteration | 1.0 | 0.5 | 0.0 | Words that start with similar sounds |
| Assonance | 0.0 | 1.0 | 0.0 | Words with matching vowel sounds |
| Consonance | 0.5 | 0.0 | 0.5 | Words with matching consonant patterns |
Why it matters for SLPs: Different clinical goals call for different types of phonological similarity. Finding rhymes supports phonological awareness goals. Finding words with similar onsets supports work on initial consonant production. The weight system lets you tailor similarity search to your specific clinical question.
Interpreting Similarity Scores¶
Similarity scores range from 0.0 (completely different) to 1.0 (identical). Typical ranges:
- 0.90+: Near-identical sound patterns (e.g., "cat" and "bat" -- perfect rhymes)
- 0.75--0.85: Strong resemblance (e.g., "computer" and "commuter")
- 0.40--0.60: Moderate similarity (some shared sounds)
- 0.20--0.30: Low similarity (e.g., "cat" and "dog" -- unrelated sound patterns)
Cognitive Association Data¶
PhonoLex includes over 1 million word-to-word relationships from 7 research datasets. These capture how words are connected in human cognition -- through meaning, association, perception, and priming.
Free Association¶
| Dataset | Relationships | What It Measures | Source |
|---|---|---|---|
| University of South Florida (USF) | 72,172 | Forward and backward word association norms collected over 20 years from university students. | Nelson et al. (2004) |
Why it matters for SLPs: Free association data reveals how words are connected in the mental lexicon. High-association word pairs (e.g., "dog" and "cat") can be used for semantic therapy, word retrieval exercises, and understanding a client's associative network.
Semantic Relatedness and Similarity¶
| Dataset | Relationships | What It Measures | Source |
|---|---|---|---|
| MEN | 3,000 | Human judgments of semantic relatedness. Raters assessed how related two words are (regardless of the type of relationship). | Bruni et al. (2014) |
| SimLex-999 | 999 | Human judgments of semantic similarity specifically (not just relatedness). "Car" and "bicycle" are similar; "car" and "road" are related but not similar. | Hill et al. (2015) |
| WordSim-353 | 351 | Human judgments of semantic relatedness, a widely-used benchmark. | Finkelstein et al. (2001) |
Why it matters for SLPs: Semantic relatedness and similarity data help identify word pairs for semantic feature analysis, category sorting tasks, and understanding the structure of a client's semantic knowledge.
Perceptual Confusability¶
| Dataset | Relationships | What It Measures | Source |
|---|---|---|---|
| Edinburgh Closed-set Confusability Corpus (ECCC) | 2,489 | Which words do listeners confuse with each other when hearing speech in noise? Based on listening experiments where participants identified words in degraded conditions. | Marxer et al. (2016) |
Why it matters for SLPs: Confusability data is directly relevant to auditory processing and speech perception. If a client confuses certain words, ECCC data can help you identify other words likely to be confusable and design targeted listening exercises.
Semantic Priming¶
| Dataset | Relationships | What It Measures | Source |
|---|---|---|---|
| Semantic Priming Project (SPP) | 1,661 | How much does seeing one word speed up recognition of a related word? Measured via lexical decision reaction times at short and long stimulus onset asynchronies. | Hutchison et al. (2013) |
Why it matters for SLPs: Priming data shows how strongly two words activate each other during language processing. Strong priming pairs can support word retrieval therapy and semantic network activation.
Contrastive Intervention Research¶
PhonoLex implements three evidence-based contrastive approaches used in phonological intervention for speech sound disorders.
Minimal Pairs¶
Minimal pairs are two words that differ by exactly one phoneme in the same position (e.g., "cat" vs. "bat," "pat" vs. "pan"). PhonoLex has precomputed all minimal pairs in the vocabulary, allowing you to search for pairs that contrast any two phonemes.
You can filter by:
- Phoneme contrast (e.g., /k/ vs. /t/)
- Position (word-initial, word-medial, word-final)
Maximal Opposition (Gierut, 1989)¶
Maximal opposition therapy selects phoneme pairs that are maximally different in their articulatory features -- differing in major class, place, manner, and voicing. The theory is that targeting maximally different sound contrasts promotes broader generalization across the phonological system.
PhonoLex computes feature distance between phoneme pairs using learned articulatory features, then finds word pairs that contrast those maximally opposed phonemes. The algorithm prioritizes:
- Major class differences (e.g., a stop vs. a fricative)
- Maximum number of differing distinctive features
Reference: Gierut, J. A. (1989). Maximal opposition approach to phonological treatment. Journal of Speech and Hearing Disorders, 54(1), 9--19.
Multiple Opposition (Storkel, 2022)¶
Multiple opposition therapy targets phoneme collapse -- when a child substitutes one sound for multiple different target sounds (e.g., using /t/ for /k/, /s/, /tS/). Treatment selects a set of words that all share the same rime but differ in the collapsed phoneme position, so the child must learn to differentiate multiple contrasts simultaneously.
PhonoLex identifies representative target phonemes for a given collapse pattern and finds word sets (triplets, quadruplets, or larger) where all words differ only in the target phoneme position.
Reference: Storkel, H. L. (2022). Minimal, Maximal, or Multiple: Which Contrastive Intervention Approach to Use With Children With Speech Sound Disorders? Language, Speech, and Hearing Services in Schools, 53(3), 632--645.
Phonological Features¶
Phonological similarity in PhonoLex is grounded in learned Bayesian feature vectors (packages/features/), which replace PHOIBLE's binary features with continuous values learned from empirical evidence.
Each of the 39 English phonemes in PhonoLex (24 consonants and 15 vowels, including diphthongs) is represented using 38 distinctive features from the Hayes (2009) and Moisik & Esling (2011) feature systems. These features describe the articulatory properties of each sound:
- Manner features: Whether the sound involves full closure (stops), partial closure (fricatives), nasal airflow (nasals), etc.
- Place features: Where in the vocal tract the sound is produced (lips, teeth, alveolar ridge, palate, velum, glottis)
- Laryngeal features: Voicing, aspiration, and glottal state
- Vowel features: Height, backness, rounding, tenseness
Each feature has three possible values: present (+), absent (-), or not applicable (0). These ternary features are converted to continuous values (+1.0, -1.0, 0.0) for similarity computation. Each feature is encoded as a two-dimensional start/end pair to model articulation dynamics, yielding a 76-dimensional feature vector per phoneme. Diphthongs use 152 dimensions (76 for each vowel endpoint).
Why it matters for SLPs: The feature system is what makes PhonoLex's similarity algorithm phonologically grounded rather than arbitrary. When PhonoLex says two words sound similar, that judgment is based on the same articulatory features that SLPs use to describe speech sounds -- place, manner, and voicing. The maximal opposition tool also uses these features to find maximally contrastive phoneme pairs.
Full Citations¶
Pronunciation Data¶
Carnegie Mellon University. (2014). The CMU Pronouncing Dictionary (125,764 words). http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Phonological Features¶
Hayes, B. (2009). Introductory Phonology. Wiley-Blackwell.
Moisik, S. R., & Esling, J. H. (2011). The 'whole larynx' approach to laryngeal features. Proceedings of the International Congress of Phonetic Sciences XVII, 1406--1409.
Phonological Complexity¶
Stoel-Gammon, C. (2010). The Word Complexity Measure: Description and application to developmental phonology and disorders. Clinical Linguistics & Phonetics, 24(4-5), 271--282. DOI: 10.3109/02699200903581059
Phonotactic Probability¶
Vitevitch, M. S., & Luce, P. A. (2004). A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481--487. DOI: 10.3758/BF03195586
Word Frequency and Contextual Diversity¶
Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977--990. DOI: 10.3758/BRM.41.4.977
Word Prevalence¶
Brysbaert, M., Mandera, P., McCormick, S. F., & Keuleers, E. (2019). Word prevalence norms for 62,000 English lemmas. Behavior Research Methods, 51, 467--479. DOI: 10.3758/s13428-018-1077-9
Age of Acquisition¶
Scott, G. G., Keitel, A., Becirspahic, M., Yao, B., & Sereno, S. C. (2019). The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods, 51, 1258--1270. DOI: 10.3758/s13428-018-1099-3
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978--990. DOI: 10.3758/s13428-012-0210-4
Lexical Decision¶
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... & Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39(3), 445--459. DOI: 10.3758/BF03193014
Concreteness¶
Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904--911. DOI: 10.3758/s13428-013-0403-5
Affective Norms (Valence, Arousal, Dominance)¶
Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4), 1191--1207. DOI: 10.3758/s13428-012-0314-x
Iconicity¶
Winter, B., Perlman, M., Perry, L. K., & Lupyan, G. (2023). Iconicity ratings for 14,000+ English words. Behavior Research Methods, 55, 1640--1655. DOI: 10.3758/s13428-022-01847-0
Body-Object Interaction¶
Pexman, P. M., Muraki, E., Sidhu, D. M., Siakaluk, P. D., & Yap, M. J. (2019). Quantifying sensorimotor experience: Body-object interaction ratings for more than 9,000 English words. Behavior Research Methods, 51(2), 453--466. DOI: 10.3758/s13428-018-1171-z
Socialness¶
Diveica, V., Pexman, P. M., & Binney, R. J. (2021). Quantifying social semantics: An inclusive definition of socialness and ratings for 8,388 English words. Behavior Research Methods, 55, 461--479. DOI: 10.3758/s13428-022-01810-z
Sensorimotor Norms¶
Lynott, D., Connell, L., Brysbaert, M., Brand, J., & Carney, J. (2020). The Lancaster Sensorimotor Norms: Multidimensional measures of perceptual and action strength for 40,000 English word lemmas. Behavior Research Methods, 52(3), 1271--1291. DOI: 10.3758/s13428-019-01316-z
Morphological Data¶
Sanchez-Gutierrez, C. H., Mailhot, H., Deacon, S. H., & Wilson, M. A. (2018). MorphoLex: A derivational morphological database for 70,000 English words. Behavior Research Methods, 50(4), 1568--1580. DOI: 10.3758/s13428-017-0981-8
Cognitive Association Datasets¶
De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms, G. (2019). The Small World of Words English word association norms for over 12,000 cue words. Behavior Research Methods, 51(3), 987--1006. DOI: 10.3758/s13428-018-1115-7
Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, & Computers, 36(3), 402--407. DOI: 10.3758/BF03195588
Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal distributional semantics. Journal of Artificial Intelligence Research, 49, 1--47. DOI: 10.1613/jair.4135
Marxer, R., Barker, J., Martin, N., & Coleman, J. (2016). Modelling speech intelligibility in adverse conditions: A corpus study. Edinburgh DataShare. https://datashare.ed.ac.uk/handle/10283/2791
Hutchison, K. A., Balota, D. A., Neely, J. H., Cortese, M. J., Cohen-Shikora, E. R., Tse, C.-S., et al. (2013). The Semantic Priming Project. Behavior Research Methods, 45(4), 1099--1114. DOI: 10.3758/s13428-012-0304-z
Hill, F., Reichart, R., & Korhonen, A. (2015). SimLex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665--695. DOI: 10.1162/COLI_a_00237
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001). Placing search in context: The concept revisited. Proceedings of the 10th International Conference on World Wide Web, 406--414. DOI: 10.1145/371920.372094
Clinical Intervention Approaches¶
Gierut, J. A. (1989). Maximal opposition approach to phonological treatment. Journal of Speech and Hearing Disorders, 54(1), 9--19. DOI: 10.1044/jshd.5401.09
Gierut, J. A. (1990). Differential learning of phonological oppositions. Journal of Speech and Hearing Research, 33(3), 540--549. DOI: 10.1044/jshr.3303.540
Gierut, J. A., & Neumann, H. J. (1992). Teaching and learning /th/: A non-confound. Clinical Linguistics & Phonetics, 6(3), 191--200. DOI: 10.3109/02699209208985533
Storkel, H. L. (2022). Minimal, Maximal, or Multiple: Which Contrastive Intervention Approach to Use With Children With Speech Sound Disorders? Language, Speech, and Hearing Services in Schools, 53(3), 632--645. DOI: 10.1044/2022_LSHSS-21-00137
Phonological Similarity Method¶
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707--710.