Skip to content

PhonoLex Data Derivation Manifest

Last updated: 2026-05-12 Scope: Every per-word column shipped in data/runtime/words.parquet and routed to the D1 words / word_properties / word_percentiles / word_freq_bands tables.

Why this exists: PhonoLex's clinical/research credibility rests on traceable derivation — for every norm we ship, a clinician must be able to point at this document and find (a) the method by which the column was produced, (b) the validation result that gates inclusion, and (c) the methodology citation that supports the inference. The audit pattern: if the column has no row here, it should not ship.

Anchor methodology citations (for LLM-derived columns): - Martínez, G., Conde, J., Reviriego, P., & Brysbaert, M. (2025). Using Large Language Models to Generate Psycholinguistic Norms. Behavior Research Methods (in press). - Brysbaert, M. (2024). Validating LLM-derived psycholinguistic ratings for word stimuli. Memory & Cognition.

The methodology is logprob expected-value extraction over a 1-N rating scale: the model is given a cloze-rating prompt with anchor examples, the next-token logprob distribution is read, probabilities for the rating tokens (1..N) are normalized, and the expected value E[r] = Σ r·p(r) is the fine-grained rating. Validated across family + + this audit to correlate with human behavioral norms at Spearman 0.56-0.90 on abstract-semantic axes (concreteness, valence, familiarity, BOI, iconicity, AoA, socialness). Does NOT transfer to embodied-perception axes (see research/2026-05-12-sensorimotor-pilot/ for the Lancaster NO-GO finding) — apply the methodology pre-check via small pilot first.


Method buckets

Method What Anchor authority
A. Computed (algorithmic) Derived from CMU pronunciations + linguistic theory; no third-party empirical data input. Hayes (2009) phonology, Aronoff & Fudeman (2011) morphology.
B. Corpus-derived (FineWeb-Edu) Token / lemma counts from FineWeb-Edu 1.27M docs / 1.9B tokens (ODC-BY 1.0). PhonoLex per-word statistics, not per-row redistribution. HuggingFace (2024) FineWeb-Edu corpus card.
C. Corpus-derived (developmental) Token counts from TalkBank corpora (CHILDES Eng-NA + Eng-UK + PhonBank) + FineWeb-Edu reading-grade bands. Per-word age-banded statistics, not per-utterance redistribution. MacWhinney (2000) CHILDES; Rose & MacWhinney (2014) PhonBank.
D. LLM-cloze (pattern) gpt-4.1-mini logprob expected-value over a validated rating prompt. Per-word PhonoLex-owned LLM output, oracles kept at data/norms/_oracles/ for validation only. Martínez et al. (2025), Brysbaert (2024).
E. Embedding-derived Qwen3-Embedding-4B representations + UMAP + HDBSCAN clustering, per-word entropy or covariance metrics over FineWeb-Edu chunks. PhonoLex aggregate statistics. Hoffman et al. (2013) semantic diversity theory; Qwen3 model card.

Shipped columns

A. Computed (algorithmic)

Column Method Source / authority Notes
syllable_count CMU pronunciation → syllabification via maximal-onset rule Hayes (2009) syllabification Per-word integer.
phoneme_count CMU pronunciation length CMU Pronouncing Dictionary v0.7b (modified BSD) Per-word integer.
neighborhood_density Levenshtein-1 phonological neighbors over the CMU vocabulary Standard psycholinguistic metric (Luce & Pisoni 1998) Computed in pipeline.
str_neighborhood_density Edit-1 stressed-phoneme neighbors Same Stress-sensitive variant.
phono_prob_avg, positional_prob_avg, str_phono_prob_avg, str_positional_prob_avg Phonotactic probability over CMU + FineWeb-Edu frequency weights Vitevitch & Luce (2004) Pipeline-computed.
wcm_score Word Complexity Measure Stoel-Gammon (2010) Pipeline-computed from CMU.
morpheme_count, n_prefixes, n_suffixes In-house algorithmic morphology analyzer (Aronoff & Fudeman affix tables + Hayes 2009 prior) PhonoLex packages/data/src/phonolex_data/morphology/ replacement of MorphyNet (CC BY-SA 3.0 — share-alike incompatible with proprietary). 32 unit tests validate SLP-relevant cases.
is_monomorphemic Derived from morpheme_count == 1 Same Boolean.
pos, pos_alt, pos_dominant_freq, all_pos, all_freqs spaCy en_core_web_trf UPOS tags + frequency distribution spaCy (MIT) over FineWeb-Edu ; replaces SUBTLEX-US POS (CC BY-NC-SA 4.0).

B. Corpus-derived (FineWeb-Edu base)

Column Method Source / authority Validation
frequency Token frequency in FineWeb-Edu (~800M tokens, 760K word types) : FineWeb-Edu ODC-BY 1.0, spaCy POS pass Replaces SUBTLEX-US (no posted license). Caveat: FineWeb-Edu has educational-register skew — see memory/feedback_corpus_register_match.md.
log_frequency log10(frequency) Same Same caveat.
contextual_diversity Document-count divided by frequency (CD index, Adelman et al. 2006) Same Same caveat.
lemma_frequency, lemma_log_frequency en_core_web_sm lemma counts over the same FineWeb-Edu pass Same caveat.
lemma_frequency_b1..b5 Lemma freq within 5 reading-grade quantile bins Educational-register caveat; reading-grade bands derived via Flesch-Kincaid quantiles over FineWeb-Edu docs.

C. Corpus-derived (developmental frequency)

Column Method Source / authority Validation
freq_age_2y Child PRODUCTION wpm at 12-36mo, mean across CHILDES + PhonBank prod channels aggregated headline; TalkBank corpora Reframed 2026-05-25 from caregiver INPUT to child PRODUCTION — input aggregates surfaced adult vocabulary as a "2y filter," fixed in v5.2.1.
freq_age_5y Child PRODUCTION wpm at 36-72mo (CHILDES + PhonBank prod channels) Same Same reframing.
freq_age_8y Child PRODUCTION wpm at 72-108mo (CHILDES prod channel only) CYP-LEX reading-grade fallback dropped 2026-05-25 — wasn't production data.
freq_age_12y Child PRODUCTION wpm at 108-144mo (CHILDES prod channel only) Same Same.
freq_age_all Alias of frequency — FineWeb-Edu derived frequency, surfaced at the top of the developmental ladder as the general-corpus reference Replaces the legacy freq_age_adult which mistakenly aggregated CYP-LEX reading bands.
freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13 CYP-LEX child-corpus age-banded frequency Sheridan & Jakobson (2019) CYP-LEX, CC BY 4.0 (OSF) Independent corpus shape from CHILDES/PhonBank; complementary age coverage.

Percentile semantics for frequency-class properties: when computing percentiles for freq_* columns, value=0 is treated as NULL — a word that never occurred in the source corpus shouldn't cluster at a misleading mid-rank (which is where tied zeros used to land). Rating-scale norms (AoA, concreteness, valence, ...) keep zero as a valid data point.

D. LLM-cloze (family + + this audit)

All built via gpt-4.1-mini cloze-prompt with top_logprobs=20 over ~47K non-PROPN content words. Per-word PhonoLex-owned output; oracle CSVs/XLSXs moved to data/norms/_oracles/. Build scripts at research/2026-04-30-llm-word-features/build_*.py.

Column Build script Oracle Full-vocab Spearman Notes
concreteness build_concreteness.py Brysbaert et al. (2014), _oracles/concreteness_brysbaert2014.txt 0.878 (N=24,109 overlap) closeout 2026-05-03.
valence build_valence.py Warriner et al. (2013), _oracles/Ratings_VAD_WarrinerEtAl.csv 0.853 (N=12,537) closeout 2026-05-03.
arousal build_arousal.py Same Warriner CSV 0.626 (N=12,537; ceiling-bound construct) closeout 2026-05-03. Ceiling effect on this axis — well above 0.50 convergent-validity floor.
familiarity build_familiarity.py Glasgow Norms FAM column, _oracles/GlasgowNorms.xlsx 0.786 (N=4,401, rank-order) closeout 2026-05-03; replaces Brysbaert Word Prevalence 2019.
aoa build_aoa.py Glasgow Norms AoA column (primary) + Kuperman 2012 (cross-construct sanity), _oracles/{GlasgowNorms,kuperman_aoa}.xlsx 0.898 (N=4,399 Glasgow) / 0.829 (N=17,572 Kuperman-Glasgow-unseen) closeout 2026-05-12. See research/2026-05-11-phon-115-aoa-pilot/report.md. Retires Kuperman + Glasgow as pipeline sources; both kept as eval oracles. Also retires the orphaned imageability + size columns.
boi build_boi.py Pexman et al. (2019), _oracles/boi_pexman2019.xlsx 0.820 (N=7,940) closeout 2026-05-02.
iconicity build_iconicity.py Winter et al. (2024), _oracles/iconicity_winter2024.csv 0.564 (N=13,062 full-vocab; 0.594 N=500 held-out) closeout 2026-05-02. Single-oracle ρ in Winter's own inter-rater band; reproduces 4/4 published convergent-validity patterns (POS, AoA, concreteness, reduplication). See research/2026-04-30-llm-word-features/iconicity_convergent_validity.md.
socialness build_socialness.py Diveica, Pexman & Binney (2021), _oracles/SocialnessNorms_DiveicaPexmanBinney2021.csv 0.820 (N=7,850 full overlap; pilot 0.865 N=200) This audit 2026-05-12. See research/2026-05-12-socialness-pilot/full_build_report.md. Replaces Diveica as pipeline source; Diveica kept as eval oracle.

E. Embedding-derived (Semantic Diversity)

Built via Qwen3-Embedding-4B (Apache 2.0) + UMAP + HDBSCAN clustering over FineWeb-Edu chunks. Per-word topic-distribution and context-covariance statistics.

Column Method Validation
semd_topic Entropy of P(topic|word) over HDBSCAN clusters Primary SemD metric; full ELP behavioral battery validated.
semd_vn Von Neumann entropy of context-vector covariance Information-geometric variant.
semd_h13 -log(mean off-diagonal cosine), Hoffman 2013 recipe applied to PhonoLex chunks Direct correspondence with Hoffman 2013 published values; Spearman 0.74 vs Hoffman 30K oracle.
n_topics_for_word Count of distinct non-noise HDBSCAN topics the word participates in Topic-richness metric.
semantic_diversity Alias of semd_topic (backward-compat) Same.

Phonological pattern fields

Column Method Source
phonemes, phonemes_str, syllables, ipa, initial_phoneme, final_phoneme Direct from CMU + ARPAbet→IPA conversion in phonolex_data.phonology CMU Pronouncing Dictionary v0.7b (modified BSD).
root, is_monomorphemic, variants In-house morphology analyzer output Aronoff & Fudeman affix tables + algorithmic decomposer.

Edges (association graph, data/runtime/edges.parquet)

Edge field Method Source / authority
qwensim Pairwise cosine over Qwen3-Embedding-4B word vectors PhonoLex in-house. Spearman 0.73 vs SimLex-999, 0.85 vs MEN-3000 on overlap. Replaces SimLex/MEN as pipeline edges.
eccc_consistency, eccc_n_instances, eccc_phoneme_distance Edinburgh Children's Corpus Confusions (Marxer et al. 2016) CC BY 4.0, kept as direct redistribution.
wordsim_relatedness WordSim-353 (Finkelstein et al. 2002) CC BY 4.0; surfaced in LookupTool.

Retired columns (kept for transparency)

These columns were shipped in earlier PhonoLex versions but have been retired:

Column Retired Reason Replacement
aoa_kuperman 2026-05-12 License audit (Kuperman 2012 no posted license) aoa via gpt-4.1-mini cloze (Spearman 0.898 vs Glasgow oracle).
imageability 2026-05-12 Orphan post-Glasgow-relocation (no clinical consumer surfaced; failed feedback_ship_big_norms_only.md test)
size 2026-05-12 Same — Glasgow-sourced, no consumer
11 Lancaster sensorimotor channels (auditory, visual, haptic, gustatory, olfactory, interoceptive, hand_arm, foot_leg, head, mouth, torso) 2026-05-12 (data audit) LLM cloze-rating methodology failed the 0.70 Spearman gate on every Lynott channel (best haptic 0.683, worst head 0.226). See research/2026-05-12-sensorimotor-pilot/report.md. Lynott 2020 is CC BY 4.0 (clean), but no Custom Word Lists / Lookup / Text Analyzer consumer surfaced the channels in v5.2 with a named clinical workflow. Per feedback_ship_big_norms_only.md retired. — (Lynott 2020 retained as eval oracle at _oracles/ for future revisit.)
vocab_memberships 2026-05-12 (data audit) The AVL/Ogden/Roget/Swadesh set-membership concept had no clinical consumer surfaced in v5.2. — (Individual vocab loaders remain available for ad-hoc research via phonolex_data.loaders.{load_ogden,load_afinn,…}.)
dominance 2026-05-02 LLM-derived signal r=0.41 vs Warriner (below 0.50 convergent-validity target); V-A-D third axis conflates agency + authority. — (PhonoLex uses 2-axis valence-arousal, Russell circumplex.)
prevalence (Brysbaert 2019) 2026-05-03 Replaced by AI familiarity proxy familiarity via Glasgow FAM oracle.
simlex_similarity, men_relatedness (edges) 2026-05-02 Replaced by Qwensim qwensim edge field.
morpheme_count (MorphyNet source) 2026-04-30 MorphyNet CC BY-SA 3.0 (share-alike incompatible with proprietary) In-house algorithmic decomposer.
frequency (SUBTLEX-US source) 2026-05-03 SUBTLEX-US no posted license FineWeb-Edu corpus.
elp_lexical_decision_rt 2026-04-30 ELP no posted license; behavioral RT cannot be AI-estimated Tracked in (in-house Mechanical Turk / Prolific behavioral collection, deferred).

Open / parked

  • Frequency UX (parked 2026-05-12): the ~8-10 surfaced frequency columns currently leave clinicians without a clear "which frequency do I use for age X?" guide. See memory/project_frequency_columns_audit_parked.md for the deployment / UX revisit scoped for v5.3+.
  • Lancaster Sensorimotor revisit: Lynott 2020 remains at _oracles/. Future approaches that might pass the gate include (a) prompt iteration with embodied-experience anchor examples, (b) small-model fine-tuning on Lynott data, or (c) accepting Lynott CC BY 4.0 as a third-party ship if a clinical consumer materializes. v5.3+ scope.
  • Behavioral norms (USF Free Association / SPP / ELP): in-house Mechanical Turk / Prolific collection track. Deferred.