Custom Word Lists¶

The Custom Word Lists tool is the most powerful feature in PhonoLex, allowing you to build targeted word lists using multiple criteria across phonological, lexical, semantic, and affective domains.

Overview¶

Build word lists by combining: - Phoneme patterns (STARTS_WITH, ENDS_WITH, CONTAINS, CONTAINS_MEDIAL) - Property filters (35 filterable properties across 9 categories) - Phoneme exclusions (exclude words containing specific phonemes) - AND logic (words must match ALL criteria)

Vocabulary size: ~47K canonical content-POS English words (the full ~125K CMU-phonology lexicon for similarity + lookup)

Basic Usage¶

Add a pattern: Click "Add Pattern" and select a pattern type
Choose a phoneme: Use the IPA keyboard or type directly
Add filters (optional): Set ranges for frequency, imageability, etc.
Generate: Click "Generate List" to see results
Export: Download as CSV or copy individual words

Pattern Types¶

Pattern Matching Algorithm¶

Patterns use IPA transcriptions to match phoneme sequences:

STARTS_WITH /k/

Matches: cat /kæt/, king /kɪŋ/, crest /kɹɛst/
Does not match: back /bæk/, attack /ətæk/

ENDS_WITH /t/

Matches: cat /kæt/, fight /faɪt/, rest /ɹɛst/
Does not match: cats /kæts/ (ends with /s/)

CONTAINS /s/

Matches: sit /sɪt/, pass /pæs/, outside /aʊtsaɪd/
Matches any position: initial, medial, or final

CONTAINS_MEDIAL /s/

Matches: missile /mɪsəl/ (medial /s/)
Does not match: sit /sɪt/ (initial), pass /pæs/ (final)

Technical Details¶

Implementation: - Uses regular expression matching on IPA strings - Case-sensitive IPA matching (e.g., /i/ ≠ /ɪ/) - Matches exact phoneme boundaries (e.g., /s/ won't match /ʃ/) - Diphthongs treated as single units (e.g., /aɪ/ is one phoneme)

Performance: ~10-50ms for pattern search across full vocabulary

Limitations: - Cannot match phoneme features directly (use Lookup tool for feature-based search) - Cannot use wildcards or phonological classes (e.g., cannot search "any fricative")

Property Filters¶

Complete Property Reference¶

Phonological Complexity (4 properties)¶

Property	Range	Source	Description	Coverage
Syllables	1-5	CMU Dictionary	Number of syllables	100%
Phonemes	1-10+	CMU Dictionary	Number of phonemes (IPA segments)	100%
WCM	0-15	Stoel-Gammon (2010)	Word Complexity Measure (8 parameters)	~95%
MSH	1-6	Motor Speech Hierarchy	Mean Syllable Height (motor complexity)	~95%

Syllables: - Counted from syllabification algorithm - Example: "cat" = 1, "window" = 2, "computer" = 3

Phonemes: - Counted from IPA transcription - Diphthongs count as 1 phoneme (e.g., /aɪ/ in "time") - Example: "cat" /kæt/ = 3, "spray" /spreɪ/ = 4

WCM (Word Complexity Measure):

8 parameters from Stoel-Gammon (2010): 1. More than 2 syllables: +1 2. Non-initial stress: +1 3. Word-final consonant: +1 4. Consonant cluster: +1 per cluster 5. Velar (k, g, ŋ): +1 per occurrence 6. Liquid/rhotic (l, ɹ): +1 per occurrence 7. Fricative/affricate (f, v, θ, ð, s, z, ʃ, ʒ, h, tʃ, dʒ): +1 per occurrence 8. Voiced fricative/affricate: +1 additional

Examples: - "cat" /kæt/ = 2 (velar /k/, final consonant) - "spray" /spreɪ/ = 5 (cluster, fricative /s/, liquid /ɹ/) - "strength" /strɛŋkθ/ = 11 (very high complexity)

MSH (Mean Syllable Height):

Stages from Motor Speech Hierarchy (Namasivayam et al., 2021): - Stage I-II: Vowels, /h/ - Stage III: Bilabials (p, b, m), nasals (n, ŋ) - Stage IV: Stops/glides (t, d, k, g, w, j) - Stage V: Fricatives (f, v, s, z, θ, ð, ʃ, ʒ) - Stage VI: Liquids/affricates (l, ɹ, ʧ, ʤ)

Calculation: Average stage across all syllables

Examples: - "cat" = 4.0 (stops only) - "fish" = 5.0 (fricatives) - "splash" = 6.0 (liquid in cluster)

Phonotactic Probability (3 properties)¶

Property	Range	Source	Description	Coverage
Biphone Probability	0-1	Method: Vitevitch & Luce (2004); computed from CMU dict	Mean probability of phoneme sequences (higher = more typical)	~100%
Sum Log Probability	-10 to 0	Method: Vitevitch & Luce (2004); computed from CMU dict	Sum of log biphone probabilities (less negative = more typical)	~100%
Positional Probability	0-1	Method: Vitevitch & Luce (2004); computed from CMU dict	Mean probability of phonemes in their syllable positions	~100%

Biphone Probability: - Measures how typical the sound sequences are in English - Computed on full CMU Pronouncing Dictionary (117K words) for unbiased estimates - Higher values = more phonotactically "legal" or common sequences

Interpretation: - 0.00-0.02: Very low (unusual sequences like "strengths") - 0.02-0.05: Low-moderate (e.g., "splash", "squid") - 0.05-0.10: Moderate-high (e.g., "cat", "dog", "jump") - 0.10+: Very high (very typical sequences like "mama", "see")

Sum Log Probability: - Standard metric originating with Vitevitch & Luce (2004); PhonoLex computes the value directly from the CMU Pronouncing Dictionary - Negative values (more negative = less typical sequences) - Useful for replicating published research studies

Positional Probability: - Measures individual phoneme frequencies in onset/nucleus/coda positions - Independent of sequence probability (biphone) - Higher values = phonemes that occur frequently in their positions

Clinical/Research use: - High phonotactic probability correlates with faster word learning - Children acquire high-probability patterns before low-probability patterns - Useful for controlling word learning difficulty in intervention or research

Lexical Properties (2 properties)¶

Property	Range	Source	Description	Coverage
Frequency	0-1000+	SUBTLEX-US (Brysbaert & New, 2009)	Occurrences per million words in film subtitles	~99%
Age of Acquisition (AoA)	1-7 (age-banded: 1≈0-2y, 7≈13y+)	PhonoLex in-house gpt-4.1-mini cloze	Age band at which a word is typically learned (1=earliest, 7=latest). Validated Spearman 0.868 vs Glasgow Norms	~100% canonical

Frequency: - Based on 51 million words from film and television subtitles - More representative of spoken language than written corpora - Log-transformed for UI (actual values are log10 per million)

Interpretation: - 0-5: Very rare words - 5-20: Uncommon words - 20-100: Common words - 100+: Very high frequency words

Age of Acquisition: - Based on adult ratings of when they learned each word - Scale: 1 (very early, <3 years) to 7 (late, adult years) - Correlates with processing speed and naming accuracy

Interpretation: - 1-2: Early childhood words (mommy, cat, eat) - 3-4: Elementary school words (book, teacher, happy) - 5-6: Middle/high school words (concept, analyze, determine) - 7: Late acquisition words (arcane, ephemeral, juxtapose)

Semantic Properties (3 properties)¶

Property	Range	Source	Description	Coverage
Imageability	1-7	PhonoLex (Glasgow-scale anchor)	Ease of mental imagery (1=hard to imagine, 7=easy)	~100% canonical
Familiarity	1-7	PhonoLex (Glasgow-scale anchor)	Word familiarity (1=unfamiliar, 7=very familiar)	~100% canonical
Concreteness	1-5	PhonoLex (Brysbaert-scale anchor)	Concrete vs. abstract (1=abstract, 5=concrete)	~60%

Imageability: - Measures how easily a word evokes a mental image - High imageability: cat, tree, house (tangible objects) - Low imageability: truth, concept, democracy (abstract ideas)

Familiarity: - Self-reported familiarity ratings from adults - Distinct from frequency (can be familiar but rarely used) - Example: "elephant" = high familiarity, moderate frequency

Concreteness: - Measures how concrete (physical) vs. abstract a concept is - Based on ratings from 40,000 English words - High concreteness: table, water, run - Low concreteness: truth, love, think

Affective Properties (2 properties)¶

Property	Range	Source	Description	Coverage
Valence	1-9	PhonoLex (Warriner-scale anchor)	Emotional valence (1=very negative, 9=very positive)	~50%
Arousal	1-9	PhonoLex (Warriner-scale anchor)	Emotional arousal (1=calm, 9=excited/intense)	~50%

Valence: - Emotional positivity/negativity - Negative (1-3): war, death, hate, fear - Neutral (4-6): table, walk, window - Positive (7-9): love, happy, success, joy

Arousal: - Emotional intensity/activation - Low arousal (1-3): calm, sleep, relax, quiet - Medium arousal (4-6): walk, think, read - High arousal (7-9): excited, angry, panic, thrill

Filter Logic¶

AND Logic: All filters must be satisfied

Example query:

Pattern: STARTS_WITH /s/
Filter: Frequency ≥ 20
Filter: Syllables = 1
Filter: Concreteness ≥ 4.0

Result: Words that start with /s/ AND are high-frequency
 AND are monosyllabic AND are concrete

Matches: sun, sea, sock, snow
Does not match: sad (concreteness too low),
 seven (two syllables),
 see (frequency too low)

Missing Data: Words without a property are excluded when filtering by that property

Example:

Filter: Imageability ≥ 5.0

Only words with imageability data are considered.
Words without imageability ratings are excluded from results.

Phoneme Exclusions¶

Purpose: Exclude words containing specific phonemes

Use cases: - Avoiding error sounds (e.g., exclude /s/ when child substitutes s→θ) - Creating phoneme-specific lists (e.g., /k/ words without /g/) - Controlling phonological context

Example:

Pattern: STARTS_WITH /k/
Exclude: /g/
Exclude: /s/

Result: /k/ words without /g/ or /s/
Matches: cat, car, cut, candy
Does not match: cat+s (has /s/), big (has /g/)

Technical details: - Exclusions apply to entire IPA transcription - Multiple exclusions create additional AND conditions - Case-sensitive (e.g., excluding /i/ won't exclude /ɪ/)

Export Formats¶

CSV Export¶

Format:

word,ipa,syllables,phonemes,wcm,msh,phono_prob_avg,phono_prob_sum_log,positional_prob_avg,frequency,aoa,imageability,familiarity,concreteness,valence,arousal,dominance
cat,kæt,1,3,2,4.0,0.062,-2.46,0.081,182.5,2.1,6.8,6.9,4.93,7.2,3.8,5.2
dog,dɔg,1,3,1,4.0,0.054,-2.89,0.073,245.3,1.8,6.9,7.0,5.0,7.5,4.2,5.5

Details: - Header row with property names - One word per row - Empty cells for missing properties - IPA in standard Unicode characters - Numbers use decimal notation

File size: ~1 KB per word

Copy Individual Words¶

Format: Plain text, one word per line

cat
dog
house

Use case: Quick copying for clinical materials

Advanced Query Examples¶

See Practical Examples for detailed walkthroughs, including:

Example 1: Simple CVC words for early intervention
Example 2: Initial /s/ words with semantic scaffolding
Example 3: Late-developing sounds in simple contexts
Example 4: Negative valence words for emotional language
Example 5: Excluding problematic phonemes

Performance Characteristics¶

Operation	Time	Notes
Pattern matching	10-50 ms	Full vocabulary regex scan
Property filtering	5-20 ms	In-memory array filter
Combined query	15-70 ms	Pattern + multiple filters
Export to CSV	~50 ms	Serialization + download
Copy to clipboard	~10 ms	Direct text copy

Factors affecting speed: - Number of filters (more filters = slightly slower) - Result set size (larger results = slower export) - Browser performance (Chrome/Edge slightly faster)

Data Coverage & Limitations¶

Coverage by Property¶

Category	Properties	Average Coverage
Phonological	Syllables, Phonemes, WCM, MSH	98%
Phonotactic	Biphone Prob, Sum Log Prob, Positional Prob	100%
Lexical	Frequency, AoA	87%
Semantic	Imageability, Familiarity, Concreteness	47%
Affective	Valence, Arousal	~100% canonical

Important: Filtering by properties with lower coverage (e.g., imageability) will reduce result set size significantly.

Vocabulary Limitations¶

Included: - Primary pronunciations only (no variants) - General American English dialect - Common words with established psycholinguistic norms

Excluded: - Pronunciation variants (CMU entries with (1), (2), etc.) - Proper nouns - Words without any psycholinguistic properties - Non-English loanwords without standard pronunciations

Total vocabulary: ~125K CMU-phonology entries (~47K canonical content-POS subset carries the full norm set)

Technical Limitations¶

Cannot filter by: - Phoneme features directly (e.g., "all fricatives") - use pattern matching or Lookup tool - Orthographic properties (spelling patterns) - Grammatical category (noun, verb, etc.) - Morphological complexity (prefixes, suffixes)

Pattern matching: - Exact phoneme matching only (no fuzzy matching) - Cannot match phonological classes (e.g., "any stop") - Cannot use regular expressions directly

Tips & Best Practices¶

Getting Started¶

Start simple: Begin with one pattern, then add filters incrementally
Check the count: Preview shows how many words match before generating
Iterate: Adjust filters to get desired word count (aim for 20-50 words for clinical use)

Optimization¶

Use frequency filters: Ensures functional, commonly-used words
Combine complexity measures: Use WCM + MSH for precise developmental targeting
Consider coverage: Properties with lower coverage (imageability, familiarity) will reduce results more

Clinical Applications¶

Early intervention: Low WCM + high frequency + high imageability
Phoneme-specific practice: Pattern matching + exclusions + frequency
Semantic therapy: High imageability + concreteness + valence/arousal filters
Literacy support: Monosyllabic + high frequency + moderate AoA

Research Applications¶

Stimulus control: Match WCM, frequency, AoA across conditions
Semantic variables: Control concreteness, imageability, familiarity
Affective content: Select words by valence, arousal, dominance
Phonological complexity: Systematic manipulation of WCM, MSH, syllable/phoneme counts

References¶

Phonological Complexity: - Stoel-Gammon, C. (2010). The Word Complexity Measure: Description and application to developmental phonology and disorders. Clinical Linguistics & Phonetics, 24(4-5), 271-282. - Namasivayam, A. K., et al. (2021). Milestones of speech production in children. Journal of Speech, Language, and Hearing Research.

Phonotactic Probability: - Vitevitch, M. S., & Luce, P. A. (2004). A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481-487. (Method origin; PhonoLex computes the values directly from the CMU Pronouncing Dictionary.)

Psycholinguistic Norms: - Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis. Behavior Research Methods, 41(4), 977-990. - Brysbaert, M., et al. (2014). Concreteness ratings for 40 thousand English words. Behavior Research Methods, 46, 904-911. - Scott, G. G., et al. (2019). The Glasgow Norms: Ratings of 5,500 words. Behavior Research Methods, 51, 1258-1270. - Warriner, A. B., et al. (2013). Norms of valence, arousal, and dominance. Behavior Research Methods, 45, 1191-1207.