PhonoLex Data Derivation Manifest
Last updated: 2026-05-12
Scope: Every per-word column shipped in data/runtime/words.parquet and routed to the D1 words / word_properties / word_percentiles / word_freq_bands tables.
Why this exists: PhonoLex's clinical/research credibility rests on traceable derivation — for every norm we ship, a clinician must be able to point at this document and find (a) the method by which the column was produced, (b) the validation result that gates inclusion, and (c) the methodology citation that supports the inference. The audit pattern: if the column has no row here, it should not ship.
Anchor methodology citations (for LLM-derived columns):
- Martínez, G., Conde, J., Reviriego, P., & Brysbaert, M. (2025). Using Large Language Models to Generate Psycholinguistic Norms. Behavior Research Methods (in press).
- Brysbaert, M. (2024). Validating LLM-derived psycholinguistic ratings for word stimuli. Memory & Cognition.
The methodology is logprob expected-value extraction over a 1-N rating scale: the model is given a cloze-rating prompt with anchor examples, the next-token logprob distribution is read, probabilities for the rating tokens (1..N) are normalized, and the expected value E[r] = Σ r·p(r) is the fine-grained rating. Validated across family + + this audit to correlate with human behavioral norms at Spearman 0.56-0.90 on abstract-semantic axes (concreteness, valence, familiarity, BOI, iconicity, AoA, socialness). Does NOT transfer to embodied-perception axes (see research/2026-05-12-sensorimotor-pilot/ for the Lancaster NO-GO finding) — apply the methodology pre-check via small pilot first.
Method buckets
| Method |
What |
Anchor authority |
| A. Computed (algorithmic) |
Derived from CMU pronunciations + linguistic theory; no third-party empirical data input. |
Hayes (2009) phonology, Aronoff & Fudeman (2011) morphology. |
| B. Corpus-derived (FineWeb-Edu) |
Token / lemma counts from FineWeb-Edu 1.27M docs / 1.9B tokens (ODC-BY 1.0). PhonoLex per-word statistics, not per-row redistribution. |
HuggingFace (2024) FineWeb-Edu corpus card. |
| C. Corpus-derived (developmental) |
Token counts from TalkBank corpora (CHILDES Eng-NA + Eng-UK + PhonBank) + FineWeb-Edu reading-grade bands. Per-word age-banded statistics, not per-utterance redistribution. |
MacWhinney (2000) CHILDES; Rose & MacWhinney (2014) PhonBank. |
| D. LLM-cloze (pattern) |
gpt-4.1-mini logprob expected-value over a validated rating prompt. Per-word PhonoLex-owned LLM output, oracles kept at data/norms/_oracles/ for validation only. |
Martínez et al. (2025), Brysbaert (2024). |
| E. Embedding-derived |
Qwen3-Embedding-4B representations + UMAP + HDBSCAN clustering, per-word entropy or covariance metrics over FineWeb-Edu chunks. PhonoLex aggregate statistics. |
Hoffman et al. (2013) semantic diversity theory; Qwen3 model card. |
Shipped columns
A. Computed (algorithmic)
| Column |
Method |
Source / authority |
Notes |
syllable_count |
CMU pronunciation → syllabification via maximal-onset rule |
Hayes (2009) syllabification |
Per-word integer. |
phoneme_count |
CMU pronunciation length |
CMU Pronouncing Dictionary v0.7b (modified BSD) |
Per-word integer. |
neighborhood_density |
Levenshtein-1 phonological neighbors over the CMU vocabulary |
Standard psycholinguistic metric (Luce & Pisoni 1998) |
Computed in pipeline. |
str_neighborhood_density |
Edit-1 stressed-phoneme neighbors |
Same |
Stress-sensitive variant. |
phono_prob_avg, positional_prob_avg, str_phono_prob_avg, str_positional_prob_avg |
Phonotactic probability over CMU + FineWeb-Edu frequency weights |
Vitevitch & Luce (2004) |
Pipeline-computed. |
wcm_score |
Word Complexity Measure |
Stoel-Gammon (2010) |
Pipeline-computed from CMU. |
morpheme_count, n_prefixes, n_suffixes |
In-house algorithmic morphology analyzer (Aronoff & Fudeman affix tables + Hayes 2009 prior) |
PhonoLex packages/data/src/phonolex_data/morphology/ |
replacement of MorphyNet (CC BY-SA 3.0 — share-alike incompatible with proprietary). 32 unit tests validate SLP-relevant cases. |
is_monomorphemic |
Derived from morpheme_count == 1 |
Same |
Boolean. |
pos, pos_alt, pos_dominant_freq, all_pos, all_freqs |
spaCy en_core_web_trf UPOS tags + frequency distribution |
spaCy (MIT) over FineWeb-Edu |
; replaces SUBTLEX-US POS (CC BY-NC-SA 4.0). |
B. Corpus-derived (FineWeb-Edu base)
| Column |
Method |
Source / authority |
Validation |
frequency |
Token frequency in FineWeb-Edu (~800M tokens, 760K word types) |
: FineWeb-Edu ODC-BY 1.0, spaCy POS pass |
Replaces SUBTLEX-US (no posted license). Caveat: FineWeb-Edu has educational-register skew — see memory/feedback_corpus_register_match.md. |
log_frequency |
log10(frequency) |
Same |
Same caveat. |
contextual_diversity |
Document-count divided by frequency (CD index, Adelman et al. 2006) |
Same |
Same caveat. |
lemma_frequency, lemma_log_frequency |
en_core_web_sm lemma counts over the same FineWeb-Edu pass |
|
Same caveat. |
lemma_frequency_b1..b5 |
Lemma freq within 5 reading-grade quantile bins |
|
Educational-register caveat; reading-grade bands derived via Flesch-Kincaid quantiles over FineWeb-Edu docs. |
C. Corpus-derived (developmental frequency)
| Column |
Method |
Source / authority |
Validation |
freq_age_2y |
Child PRODUCTION wpm at 12-36mo, mean across CHILDES + PhonBank prod channels |
aggregated headline; TalkBank corpora |
Reframed 2026-05-25 from caregiver INPUT to child PRODUCTION — input aggregates surfaced adult vocabulary as a "2y filter," fixed in v5.2.1. |
freq_age_5y |
Child PRODUCTION wpm at 36-72mo (CHILDES + PhonBank prod channels) |
Same |
Same reframing. |
freq_age_8y |
Child PRODUCTION wpm at 72-108mo (CHILDES prod channel only) |
|
CYP-LEX reading-grade fallback dropped 2026-05-25 — wasn't production data. |
freq_age_12y |
Child PRODUCTION wpm at 108-144mo (CHILDES prod channel only) |
Same |
Same. |
freq_age_all |
Alias of frequency — FineWeb-Edu derived frequency, surfaced at the top of the developmental ladder as the general-corpus reference |
|
Replaces the legacy freq_age_adult which mistakenly aggregated CYP-LEX reading bands. |
freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13 |
CYP-LEX child-corpus age-banded frequency |
Sheridan & Jakobson (2019) CYP-LEX, CC BY 4.0 (OSF) |
Independent corpus shape from CHILDES/PhonBank; complementary age coverage. |
Percentile semantics for frequency-class properties: when computing percentiles for freq_* columns, value=0 is treated as NULL — a word that never occurred in the source corpus shouldn't cluster at a misleading mid-rank (which is where tied zeros used to land). Rating-scale norms (AoA, concreteness, valence, ...) keep zero as a valid data point.
D. LLM-cloze (family + + this audit)
All built via gpt-4.1-mini cloze-prompt with top_logprobs=20 over ~47K non-PROPN content words. Per-word PhonoLex-owned output; oracle CSVs/XLSXs moved to data/norms/_oracles/. Build scripts at research/2026-04-30-llm-word-features/build_*.py.
| Column |
Build script |
Oracle |
Full-vocab Spearman |
Notes |
concreteness |
build_concreteness.py |
Brysbaert et al. (2014), _oracles/concreteness_brysbaert2014.txt |
0.878 (N=24,109 overlap) |
closeout 2026-05-03. |
valence |
build_valence.py |
Warriner et al. (2013), _oracles/Ratings_VAD_WarrinerEtAl.csv |
0.853 (N=12,537) |
closeout 2026-05-03. |
arousal |
build_arousal.py |
Same Warriner CSV |
0.626 (N=12,537; ceiling-bound construct) |
closeout 2026-05-03. Ceiling effect on this axis — well above 0.50 convergent-validity floor. |
familiarity |
build_familiarity.py |
Glasgow Norms FAM column, _oracles/GlasgowNorms.xlsx |
0.786 (N=4,401, rank-order) |
closeout 2026-05-03; replaces Brysbaert Word Prevalence 2019. |
aoa |
build_aoa.py |
Glasgow Norms AoA column (primary) + Kuperman 2012 (cross-construct sanity), _oracles/{GlasgowNorms,kuperman_aoa}.xlsx |
0.898 (N=4,399 Glasgow) / 0.829 (N=17,572 Kuperman-Glasgow-unseen) |
closeout 2026-05-12. See research/2026-05-11-phon-115-aoa-pilot/report.md. Retires Kuperman + Glasgow as pipeline sources; both kept as eval oracles. Also retires the orphaned imageability + size columns. |
boi |
build_boi.py |
Pexman et al. (2019), _oracles/boi_pexman2019.xlsx |
0.820 (N=7,940) |
closeout 2026-05-02. |
iconicity |
build_iconicity.py |
Winter et al. (2024), _oracles/iconicity_winter2024.csv |
0.564 (N=13,062 full-vocab; 0.594 N=500 held-out) |
closeout 2026-05-02. Single-oracle ρ in Winter's own inter-rater band; reproduces 4/4 published convergent-validity patterns (POS, AoA, concreteness, reduplication). See research/2026-04-30-llm-word-features/iconicity_convergent_validity.md. |
socialness |
build_socialness.py |
Diveica, Pexman & Binney (2021), _oracles/SocialnessNorms_DiveicaPexmanBinney2021.csv |
0.820 (N=7,850 full overlap; pilot 0.865 N=200) |
This audit 2026-05-12. See research/2026-05-12-socialness-pilot/full_build_report.md. Replaces Diveica as pipeline source; Diveica kept as eval oracle. |
E. Embedding-derived (Semantic Diversity)
Built via Qwen3-Embedding-4B (Apache 2.0) + UMAP + HDBSCAN clustering over FineWeb-Edu chunks. Per-word topic-distribution and context-covariance statistics.
| Column |
Method |
Validation |
semd_topic |
Entropy of P(topic|word) over HDBSCAN clusters |
Primary SemD metric; full ELP behavioral battery validated. |
semd_vn |
Von Neumann entropy of context-vector covariance |
Information-geometric variant. |
semd_h13 |
-log(mean off-diagonal cosine), Hoffman 2013 recipe applied to PhonoLex chunks |
Direct correspondence with Hoffman 2013 published values; Spearman 0.74 vs Hoffman 30K oracle. |
n_topics_for_word |
Count of distinct non-noise HDBSCAN topics the word participates in |
Topic-richness metric. |
semantic_diversity |
Alias of semd_topic (backward-compat) |
Same. |
Phonological pattern fields
| Column |
Method |
Source |
phonemes, phonemes_str, syllables, ipa, initial_phoneme, final_phoneme |
Direct from CMU + ARPAbet→IPA conversion in phonolex_data.phonology |
CMU Pronouncing Dictionary v0.7b (modified BSD). |
root, is_monomorphemic, variants |
In-house morphology analyzer output |
Aronoff & Fudeman affix tables + algorithmic decomposer. |
Edges (association graph, data/runtime/edges.parquet)
| Edge field |
Method |
Source / authority |
qwensim |
Pairwise cosine over Qwen3-Embedding-4B word vectors |
PhonoLex in-house. Spearman 0.73 vs SimLex-999, 0.85 vs MEN-3000 on overlap. Replaces SimLex/MEN as pipeline edges. |
eccc_consistency, eccc_n_instances, eccc_phoneme_distance |
Edinburgh Children's Corpus Confusions (Marxer et al. 2016) |
CC BY 4.0, kept as direct redistribution. |
wordsim_relatedness |
WordSim-353 (Finkelstein et al. 2002) |
CC BY 4.0; surfaced in LookupTool. |
Retired columns (kept for transparency)
These columns were shipped in earlier PhonoLex versions but have been retired:
| Column |
Retired |
Reason |
Replacement |
aoa_kuperman |
2026-05-12 |
License audit (Kuperman 2012 no posted license) |
aoa via gpt-4.1-mini cloze (Spearman 0.898 vs Glasgow oracle). |
imageability |
2026-05-12 |
Orphan post-Glasgow-relocation (no clinical consumer surfaced; failed feedback_ship_big_norms_only.md test) |
— |
size |
2026-05-12 |
Same — Glasgow-sourced, no consumer |
— |
11 Lancaster sensorimotor channels (auditory, visual, haptic, gustatory, olfactory, interoceptive, hand_arm, foot_leg, head, mouth, torso) |
2026-05-12 (data audit) |
LLM cloze-rating methodology failed the 0.70 Spearman gate on every Lynott channel (best haptic 0.683, worst head 0.226). See research/2026-05-12-sensorimotor-pilot/report.md. Lynott 2020 is CC BY 4.0 (clean), but no Custom Word Lists / Lookup / Text Analyzer consumer surfaced the channels in v5.2 with a named clinical workflow. Per feedback_ship_big_norms_only.md retired. |
— (Lynott 2020 retained as eval oracle at _oracles/ for future revisit.) |
vocab_memberships |
2026-05-12 (data audit) |
The AVL/Ogden/Roget/Swadesh set-membership concept had no clinical consumer surfaced in v5.2. |
— (Individual vocab loaders remain available for ad-hoc research via phonolex_data.loaders.{load_ogden,load_afinn,…}.) |
dominance |
2026-05-02 |
LLM-derived signal r=0.41 vs Warriner (below 0.50 convergent-validity target); V-A-D third axis conflates agency + authority. |
— (PhonoLex uses 2-axis valence-arousal, Russell circumplex.) |
prevalence (Brysbaert 2019) |
2026-05-03 |
Replaced by AI familiarity proxy |
familiarity via Glasgow FAM oracle. |
simlex_similarity, men_relatedness (edges) |
2026-05-02 |
Replaced by Qwensim |
qwensim edge field. |
morpheme_count (MorphyNet source) |
2026-04-30 |
MorphyNet CC BY-SA 3.0 (share-alike incompatible with proprietary) |
In-house algorithmic decomposer. |
frequency (SUBTLEX-US source) |
2026-05-03 |
SUBTLEX-US no posted license |
FineWeb-Edu corpus. |
elp_lexical_decision_rt |
2026-04-30 |
ELP no posted license; behavioral RT cannot be AI-estimated |
Tracked in (in-house Mechanical Turk / Prolific behavioral collection, deferred). |
Open / parked
- Frequency UX (parked 2026-05-12): the ~8-10 surfaced frequency columns currently leave clinicians without a clear "which frequency do I use for age X?" guide. See
memory/project_frequency_columns_audit_parked.md for the deployment / UX revisit scoped for v5.3+.
- Lancaster Sensorimotor revisit: Lynott 2020 remains at
_oracles/. Future approaches that might pass the gate include (a) prompt iteration with embodied-experience anchor examples, (b) small-model fine-tuning on Lynott data, or (c) accepting Lynott CC BY 4.0 as a third-party ship if a clinical consumer materializes. v5.3+ scope.
- Behavioral norms (USF Free Association / SPP / ELP): in-house Mechanical Turk / Prolific collection track. Deferred.