Custom Word Lists¶

The Custom Word Lists tool is the most powerful feature in PhonoLex, allowing you to build targeted word lists using multiple criteria across phonological, lexical, semantic, and affective domains.

Overview¶

Build word lists by combining: - Phoneme patterns (STARTS_WITH, ENDS_WITH, CONTAINS, CONTAINS_MEDIAL) - Property filters (35 filterable properties across 9 categories) - Phoneme exclusions (exclude words containing specific phonemes) - AND logic (words must match ALL criteria)

Vocabulary size: 44,011 English words from the CMU Pronouncing Dictionary

Basic Usage¶

Add a pattern: Click "Add Pattern" and select a pattern type
Choose a phoneme: Use the IPA keyboard or type directly
Add filters (optional): Set ranges for frequency, imageability, etc.
Generate: Click "Generate List" to see results
Export: Download as CSV or copy individual words

Pattern Types¶

Pattern Matching Algorithm¶

Patterns use IPA transcriptions to match phoneme sequences:

STARTS_WITH /k/

Matches: cat /kæt/, king /kɪŋ/, crest /kɹɛst/
Does not match: back /bæk/, attack /ətæk/

ENDS_WITH /t/

Matches: cat /kæt/, fight /faɪt/, rest /ɹɛst/
Does not match: cats /kæts/ (ends with /s/)

CONTAINS /s/

Matches: sit /sɪt/, pass /pæs/, outside /aʊtsaɪd/
Matches any position: initial, medial, or final

CONTAINS_MEDIAL /s/

Matches: missile /mɪsəl/ (medial /s/)
Does not match: sit /sɪt/ (initial), pass /pæs/ (final)

Technical Details¶

Implementation: - Uses regular expression matching on IPA strings - Case-sensitive IPA matching (e.g., /i/ ≠ /ɪ/) - Matches exact phoneme boundaries (e.g., /s/ won't match /ʃ/) - Diphthongs treated as single units (e.g., /aɪ/ is one phoneme)

Performance: ~10-50ms for pattern search across full vocabulary

Limitations: - Cannot match phoneme features directly (use Lookup tool for feature-based search) - Cannot use wildcards or phonological classes (e.g., cannot search "any fricative")

Property Filters¶

Complete Property Reference¶

Phonological Complexity (4 properties)¶

Property	Range	Source	Description	Coverage
Syllables	1-5	CMU Dictionary	Number of syllables	100%
Phonemes	1-10+	CMU Dictionary	Number of phonemes (IPA segments)	100%
WCM	0-15	Stoel-Gammon (2010)	Word Complexity Measure (8 parameters)	~95%
MSH	1-6	Motor Speech Hierarchy	Mean Syllable Height (motor complexity)	~95%

Syllables: - Counted from syllabification algorithm - Example: "cat" = 1, "window" = 2, "computer" = 3

Phonemes: - Counted from IPA transcription - Diphthongs count as 1 phoneme (e.g., /aɪ/ in "time") - Example: "cat" /kæt/ = 3, "spray" /spreɪ/ = 4

WCM (Word Complexity Measure):

8 parameters from Stoel-Gammon (2010): 1. More than 2 syllables: +1 2. Non-initial stress: +1 3. Word-final consonant: +1 4. Consonant cluster: +1 per cluster 5. Velar (k, g, ŋ): +1 per occurrence 6. Liquid/rhotic (l, ɹ): +1 per occurrence 7. Fricative/affricate (f, v, θ, ð, s, z, ʃ, ʒ, h, tʃ, dʒ): +1 per occurrence 8. Voiced fricative/affricate: +1 additional

Examples: - "cat" /kæt/ = 2 (velar /k/, final consonant) - "spray" /spreɪ/ = 5 (cluster, fricative /s/, liquid /ɹ/) - "strength" /strɛŋkθ/ = 11 (very high complexity)

MSH (Mean Syllable Height):

Stages from Motor Speech Hierarchy (Namasivayam et al., 2021): - Stage I-II: Vowels, /h/ - Stage III: Bilabials (p, b, m), nasals (n, ŋ) - Stage IV: Stops/glides (t, d, k, g, w, j) - Stage V: Fricatives (f, v, s, z, θ, ð, ʃ, ʒ) - Stage VI: Liquids/affricates (l, ɹ, ʧ, ʤ)

Calculation: Average stage across all syllables

Examples: - "cat" = 4.0 (stops only) - "fish" = 5.0 (fricatives) - "splash" = 6.0 (liquid in cluster)

Phonotactic Probability (3 properties)¶

Property	Range	Source	Description	Coverage
Biphone Probability	0-1	Vitevitch & Luce (2004)	Mean probability of phoneme sequences (higher = more typical)	~100%
Sum Log Probability	-10 to 0	Vitevitch & Luce (2004)	Sum of log biphone probabilities (less negative = more typical)	~100%
Positional Probability	0-1	Vitevitch & Luce (2004)	Mean probability of phonemes in their syllable positions	~100%

Biphone Probability: - Measures how typical the sound sequences are in English - Computed on full CMU Pronouncing Dictionary (117K words) for unbiased estimates - Higher values = more phonotactically "legal" or common sequences

Interpretation: - 0.00-0.02: Very low (unusual sequences like "strengths") - 0.02-0.05: Low-moderate (e.g., "splash", "squid") - 0.05-0.10: Moderate-high (e.g., "cat", "dog", "jump") - 0.10+: Very high (very typical sequences like "mama", "see")

Sum Log Probability: - Standard metric from Vitevitch & Luce (2004) - Negative values (more negative = less typical sequences) - Useful for replicating published research studies

Positional Probability: - Measures individual phoneme frequencies in onset/nucleus/coda positions - Independent of sequence probability (biphone) - Higher values = phonemes that occur frequently in their positions

Clinical/Research use: - High phonotactic probability correlates with faster word learning - Children acquire high-probability patterns before low-probability patterns - Useful for controlling word learning difficulty in intervention or research

Lexical Properties (2 properties)¶

Property	Range	Source	Description	Coverage
Frequency	0-1000+	SUBTLEX-US (Brysbaert & New, 2009)	Occurrences per million words in film subtitles	~99%
Age of Acquisition (AoA)	1-7	Glasgow Norms (Scott et al., 2019)	Age when word is typically learned (1=earliest, 7=latest)	~75%

Frequency: - Based on 51 million words from film and television subtitles - More representative of spoken language than written corpora - Log-transformed for UI (actual values are log10 per million)

Interpretation: - 0-5: Very rare words - 5-20: Uncommon words - 20-100: Common words - 100+: Very high frequency words

Age of Acquisition: - Based on adult ratings of when they learned each word - Scale: 1 (very early, <3 years) to 7 (late, adult years) - Correlates with processing speed and naming accuracy

Interpretation: - 1-2: Early childhood words (mommy, cat, eat) - 3-4: Elementary school words (book, teacher, happy) - 5-6: Middle/high school words (concept, analyze, determine) - 7: Late acquisition words (arcane, ephemeral, juxtapose)

Semantic Properties (3 properties)¶

Property	Range	Source	Description	Coverage
Imageability	1-7	Glasgow Norms	Ease of mental imagery (1=hard to imagine, 7=easy)	~40%
Familiarity	1-7	Glasgow Norms	Word familiarity (1=unfamiliar, 7=very familiar)	~40%
Concreteness	1-5	Brysbaert et al. (2014)	Concrete vs. abstract (1=abstract, 5=concrete)	~60%

Imageability: - Measures how easily a word evokes a mental image - High imageability: cat, tree, house (tangible objects) - Low imageability: truth, concept, democracy (abstract ideas)

Familiarity: - Self-reported familiarity ratings from adults - Distinct from frequency (can be familiar but rarely used) - Example: "elephant" = high familiarity, moderate frequency

Concreteness: - Measures how concrete (physical) vs. abstract a concept is - Based on ratings from 40,000 English words - High concreteness: table, water, run - Low concreteness: truth, love, think

Affective Properties (3 properties)¶

Property	Range	Source	Description	Coverage
Valence	1-9	Warriner et al. (2013)	Emotional valence (1=very negative, 9=very positive)	~50%
Arousal	1-9	Warriner et al. (2013)	Emotional arousal (1=calm, 9=excited/intense)	~50%
Dominance	1-9	Warriner et al. (2013)	Sense of control (1=weak/submissive, 9=powerful/in-control)	~50%

Valence: - Emotional positivity/negativity - Negative (1-3): war, death, hate, fear - Neutral (4-6): table, walk, window - Positive (7-9): love, happy, success, joy

Arousal: - Emotional intensity/activation - Low arousal (1-3): calm, sleep, relax, quiet - Medium arousal (4-6): walk, think, read - High arousal (7-9): excited, angry, panic, thrill

Dominance: - Sense of power or control - Low dominance (1-3): helpless, weak, afraid, victim - Medium dominance (4-6): walk, see, think - High dominance (7-9): powerful, boss, control, leader

Filter Logic¶

AND Logic: All filters must be satisfied

Example query:

Pattern: STARTS_WITH /s/
Filter: Frequency ≥ 20
Filter: Syllables = 1
Filter: Concreteness ≥ 4.0

Result: Words that start with /s/ AND are high-frequency
        AND are monosyllabic AND are concrete

Matches: sun, sea, sock, snow
Does not match: sad (concreteness too low),
                 seven (two syllables),
                 see (frequency too low)

Missing Data: Words without a property are excluded when filtering by that property

Example:

Filter: Imageability ≥ 5.0

Only words with imageability data are considered.
Words without imageability ratings are excluded from results.

Phoneme Exclusions¶

Purpose: Exclude words containing specific phonemes

Use cases: - Avoiding error sounds (e.g., exclude /s/ when child substitutes s→θ) - Creating phoneme-specific lists (e.g., /k/ words without /g/) - Controlling phonological context

Example:

Pattern: STARTS_WITH /k/
Exclude: /g/
Exclude: /s/

Result: /k/ words without /g/ or /s/
Matches: cat, car, cut, candy
Does not match: cat+s (has /s/), big (has /g/)

Technical details: - Exclusions apply to entire IPA transcription - Multiple exclusions create additional AND conditions - Case-sensitive (e.g., excluding /i/ won't exclude /ɪ/)

Export Formats¶

CSV Export¶

Format:

word,ipa,syllables,phonemes,wcm,msh,phono_prob_avg,phono_prob_sum_log,positional_prob_avg,frequency,aoa,imageability,familiarity,concreteness,valence,arousal,dominance
cat,kæt,1,3,2,4.0,0.062,-2.46,0.081,182.5,2.1,6.8,6.9,4.93,7.2,3.8,5.2
dog,dɔg,1,3,1,4.0,0.054,-2.89,0.073,245.3,1.8,6.9,7.0,5.0,7.5,4.2,5.5

Details: - Header row with property names - One word per row - Empty cells for missing properties - IPA in standard Unicode characters - Numbers use decimal notation

File size: ~1 KB per word

Copy Individual Words¶

Format: Plain text, one word per line

cat
dog
house

Use case: Quick copying for clinical materials

Advanced Query Examples¶

See Practical Examples for detailed walkthroughs, including:

Example 1: Simple CVC words for early intervention
Example 2: Initial /s/ words with semantic scaffolding
Example 3: Late-developing sounds in simple contexts
Example 4: Negative valence words for emotional language
Example 5: Excluding problematic phonemes

Performance Characteristics¶

Operation	Time	Notes
Pattern matching	10-50 ms	Full vocabulary regex scan
Property filtering	5-20 ms	In-memory array filter
Combined query	15-70 ms	Pattern + multiple filters
Export to CSV	~50 ms	Serialization + download
Copy to clipboard	~10 ms	Direct text copy

Factors affecting speed: - Number of filters (more filters = slightly slower) - Result set size (larger results = slower export) - Browser performance (Chrome/Edge slightly faster)

Data Coverage & Limitations¶

Coverage by Property¶

Category	Properties	Average Coverage
Phonological	Syllables, Phonemes, WCM, MSH	98%
Phonotactic	Biphone Prob, Sum Log Prob, Positional Prob	100%
Lexical	Frequency, AoA	87%
Semantic	Imageability, Familiarity, Concreteness	47%
Affective	Valence, Arousal, Dominance	50%

Important: Filtering by properties with lower coverage (e.g., imageability) will reduce result set size significantly.

Vocabulary Limitations¶

Included: - Primary pronunciations only (no variants) - General American English dialect - Common words with established psycholinguistic norms

Excluded: - Pronunciation variants (CMU entries with (1), (2), etc.) - Proper nouns - Words without any psycholinguistic properties - Non-English loanwords without standard pronunciations

Total vocabulary: 44,011 words

Technical Limitations¶

Cannot filter by: - Phoneme features directly (e.g., "all fricatives") - use pattern matching or Lookup tool - Orthographic properties (spelling patterns) - Grammatical category (noun, verb, etc.) - Morphological complexity (prefixes, suffixes)

Pattern matching: - Exact phoneme matching only (no fuzzy matching) - Cannot match phonological classes (e.g., "any stop") - Cannot use regular expressions directly

Tips & Best Practices¶

Getting Started¶

Start simple: Begin with one pattern, then add filters incrementally
Check the count: Preview shows how many words match before generating
Iterate: Adjust filters to get desired word count (aim for 20-50 words for clinical use)

Optimization¶

Use frequency filters: Ensures functional, commonly-used words
Combine complexity measures: Use WCM + MSH for precise developmental targeting
Consider coverage: Properties with lower coverage (imageability, familiarity) will reduce results more

Clinical Applications¶

Early intervention: Low WCM + high frequency + high imageability
Phoneme-specific practice: Pattern matching + exclusions + frequency
Semantic therapy: High imageability + concreteness + valence/arousal filters
Literacy support: Monosyllabic + high frequency + moderate AoA

Research Applications¶

Stimulus control: Match WCM, frequency, AoA across conditions
Semantic variables: Control concreteness, imageability, familiarity
Affective content: Select words by valence, arousal, dominance
Phonological complexity: Systematic manipulation of WCM, MSH, syllable/phoneme counts

References¶

Phonological Complexity: - Stoel-Gammon, C. (2010). The Word Complexity Measure: Description and application to developmental phonology and disorders. Clinical Linguistics & Phonetics, 24(4-5), 271-282. - Namasivayam, A. K., et al. (2021). Milestones of speech production in children. Journal of Speech, Language, and Hearing Research.

Phonotactic Probability: - Vitevitch, M. S., & Luce, P. A. (2004). A Web-based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36(3), 481-487.

Psycholinguistic Norms: - Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis. Behavior Research Methods, 41(4), 977-990. - Brysbaert, M., et al. (2014). Concreteness ratings for 40 thousand English words. Behavior Research Methods, 46, 904-911. - Scott, G. G., et al. (2019). The Glasgow Norms: Ratings of 5,500 words. Behavior Research Methods, 51, 1258-1270. - Warriner, A. B., et al. (2013). Norms of valence, arousal, and dominance. Behavior Research Methods, 45, 1191-1207.