PhonoLex Data Derivation Manifest¶

Last updated: 2026-07-12 (PHON-161: TalkBank developmental frequencies removed) Scope: Every per-word column shipped in data/runtime/words.parquet and routed to the D1 words / word_properties / word_percentiles / word_freq_bands tables.

Why this exists: PhonoLex's clinical/research credibility rests on traceable derivation — for every norm we ship, a clinician must be able to point at this document and find (a) the method by which the column was produced, (b) the validation result that gates inclusion, and (c) the methodology citation that supports the inference. The audit pattern: if the column has no row here, it should not ship.

Anchor methodology citations (for LLM-derived columns): - Martínez, G., Conde, J., Reviriego, P., & Brysbaert, M. (2025). Using Large Language Models to Generate Psycholinguistic Norms. Behavior Research Methods (in press). - Brysbaert, M. (2024). Validating LLM-derived psycholinguistic ratings for word stimuli. Memory & Cognition.

The methodology is logprob expected-value extraction over a 1-N rating scale: the model is given a cloze-rating prompt with anchor examples, the next-token logprob distribution is read, probabilities for the rating tokens (1..N) are normalized, and the expected value E[r] = Σ r·p(r) is the fine-grained rating. Validated across family + + this audit to correlate with human behavioral norms at Spearman 0.56-0.90 on abstract-semantic axes (concreteness, valence, familiarity, BOI, iconicity, AoA, socialness). Does NOT transfer to embodied-perception axes (see research/2026-05-12-sensorimotor-pilot/ for the Lancaster NO-GO finding) — apply the methodology pre-check via small pilot first.

Method buckets¶

Method	What	Anchor authority
A. Computed (algorithmic)	Derived from CMU pronunciations + linguistic theory; no third-party empirical data input.	Hayes (2009) phonology, Aronoff & Fudeman (2011) morphology.
B. Corpus-derived (FineWeb-Edu)	Token / lemma counts from FineWeb-Edu 1.27M docs / 1.9B tokens (ODC-BY 1.0). PhonoLex per-word statistics, not per-row redistribution.	HuggingFace (2024) FineWeb-Edu corpus card.
C. Corpus-derived (child/grade-banded)	FineWeb-Edu reading-grade band statistics + CYP-LEX child-corpus age-banded frequency. TalkBank corpora (CHILDES + PhonBank) removed 2026-07-12 (PHON-161).	HuggingFace (2024) FineWeb-Edu corpus card; Korochkina et al. (2024) CYP-LEX.
D. LLM-cloze (pattern)	gpt-4.1-mini logprob expected-value over a validated rating prompt. Per-word PhonoLex-owned LLM output, oracles kept at `data/norms/_oracles/` for validation only.	Martínez et al. (2025), Brysbaert (2024).
E. Embedding-derived	Qwen3-Embedding-4B representations + UMAP + HDBSCAN clustering, per-word entropy or covariance metrics over FineWeb-Edu chunks. PhonoLex aggregate statistics.	Hoffman et al. (2013) semantic diversity theory; Qwen3 model card.

Shipped columns¶

A. Computed (algorithmic)¶

Column	Method	Source / authority	Notes
`syllable_count`	CMU pronunciation → syllabification via maximal-onset rule	Hayes (2009) syllabification	Per-word integer.
`phoneme_count`	CMU pronunciation length	CMU Pronouncing Dictionary v0.7b (modified BSD)	Per-word integer.
`neighborhood_density`	Levenshtein-1 phonological neighbors over the CMU vocabulary	Standard psycholinguistic metric (Luce & Pisoni 1998)	Computed in pipeline.
`str_neighborhood_density`	Edit-1 stressed-phoneme neighbors	Same	Stress-sensitive variant.
`phono_prob_avg`, `positional_prob_avg`, `str_phono_prob_avg`, `str_positional_prob_avg`	Phonotactic probability over CMU + FineWeb-Edu frequency weights	Vitevitch & Luce (2004)	Pipeline-computed.
`wcm_score`	Word Complexity Measure	Stoel-Gammon (2010)	Pipeline-computed from CMU.
`morpheme_count`, `n_prefixes`, `n_suffixes`	In-house algorithmic morphology analyzer (Aronoff & Fudeman affix tables + Hayes 2009 prior)	PhonoLex `packages/data/src/phonolex_data/morphology/`	replacement of MorphyNet (CC BY-SA 3.0 — share-alike incompatible with proprietary). 32 unit tests validate SLP-relevant cases.
`is_monomorphemic`	Derived from `morpheme_count == 1`	Same	Boolean.
`pos`, `pos_alt`, `pos_dominant_freq`, `all_pos`, `all_freqs`	spaCy `en_core_web_trf` UPOS tags + frequency distribution	spaCy (MIT) over FineWeb-Edu	; replaces SUBTLEX-US POS (CC BY-NC-SA 4.0).

B. Corpus-derived (FineWeb-Edu base)¶

Column	Method	Source / authority	Validation
`frequency`	Token frequency in FineWeb-Edu (~800M tokens, 760K word types)	: FineWeb-Edu ODC-BY 1.0, spaCy POS pass	Replaces SUBTLEX-US (no posted license). Caveat: FineWeb-Edu has educational-register skew — see `memory/feedback_corpus_register_match.md`.
`log_frequency`	log10(frequency)	Same	Same caveat.
`contextual_diversity`	Document-count divided by frequency (CD index, Adelman et al. 2006)	Same	Same caveat.
`lemma_frequency`, `lemma_log_frequency`	en_core_web_sm lemma counts over the same FineWeb-Edu pass		Same caveat.
`lemma_frequency_b1..b5`	Lemma freq within 5 reading-grade quantile bins		Educational-register caveat; reading-grade bands derived via Flesch-Kincaid quantiles over FineWeb-Edu docs.

C. Corpus-derived (child/grade-banded frequency)¶

Column	Method	Source / authority	Validation
`freq_cyplex_7_9`, `freq_cyplex_10_12`, `freq_cyplex_13`	CYP-LEX child-corpus age-banded frequency	Sheridan & Jakobson (2019) CYP-LEX, CC BY 4.0 (OSF)	Complementary age coverage to AoA.

Removed 2026-07-12 (PHON-161): the TalkBank-derived developmental frequencies — freq_age_2y/5y/8y/12y headlines, the freq_age_all alias, and the 78 CHILDES/PhonBank raw band columns (freq_pb_*, wpm_pb_*, log_freq_pb_*, freq_childes_*, wpm_childes_*, log_freq_childes_*) — no longer ship in any artifact. The FineWeb-Edu reading-grade bands (freq_b1..b5, wpm_b1..b5, log_freq_b1..b5; ODC-BY 1.0) remain in word_freq_bands. "Age-appropriate" filtering maps to AoA + CYP-LEX.

Percentile semantics for frequency-class properties: when computing percentiles for freq_-prefixed columns, value=0 is treated as NULL — a word that never occurred in the source corpus shouldn't cluster at a misleading mid-rank (which is where tied zeros used to land). Rating-scale norms (AoA, concreteness, valence, ...) keep zero as a valid data point. (With the freq_age_* headlines removed, no freq_-prefixed column currently carries a percentile; the rule remains in the pipeline for any future frequency-class percentile.)

D. LLM-cloze (family + + this audit)¶

All built via gpt-4.1-mini cloze-prompt with top_logprobs=20 over ~47K non-PROPN content words. Per-word PhonoLex-owned output; oracle CSVs/XLSXs moved to data/norms/_oracles/. Build scripts at research/2026-04-30-llm-word-features/build_*.py.

Column	Build script	Oracle	Full-vocab Spearman	Notes
`concreteness`	`build_concreteness.py`	Brysbaert et al. (2014), `_oracles/concreteness_brysbaert2014.txt`	0.878 (N=24,109 overlap)	closeout 2026-05-03.
`valence`	`build_valence.py`	Warriner et al. (2013), `_oracles/Ratings_VAD_WarrinerEtAl.csv`	0.853 (N=12,537)	closeout 2026-05-03.
`arousal`	`build_arousal.py`	Same Warriner CSV	0.626 (N=12,537; ceiling-bound construct)	closeout 2026-05-03. Ceiling effect on this axis — well above 0.50 convergent-validity floor.
`familiarity`	`build_familiarity.py`	Glasgow Norms FAM column, `_oracles/GlasgowNorms.xlsx`	0.786 (N=4,401, rank-order)	closeout 2026-05-03; replaces Brysbaert Word Prevalence 2019.
`aoa`	`build_aoa.py`	Glasgow Norms AoA column (primary) + Kuperman 2012 (cross-construct sanity), `_oracles/{GlasgowNorms,kuperman_aoa}.xlsx`	0.898 (N=4,399 Glasgow) / 0.829 (N=17,572 Kuperman-Glasgow-unseen)	closeout 2026-05-12. See `research/2026-05-11-phon-115-aoa-pilot/report.md`. Retires Kuperman + Glasgow as pipeline sources; both kept as eval oracles. Also retires the orphaned `imageability` + `size` columns.
`boi`	`build_boi.py`	Pexman et al. (2019), `_oracles/boi_pexman2019.xlsx`	0.820 (N=7,940)	closeout 2026-05-02.
`iconicity`	`build_iconicity.py`	Winter et al. (2024), `_oracles/iconicity_winter2024.csv`	0.564 (N=13,062 full-vocab; 0.594 N=500 held-out)	closeout 2026-05-02. Single-oracle ρ in Winter's own inter-rater band; reproduces 4/4 published convergent-validity patterns (POS, AoA, concreteness, reduplication). See `research/2026-04-30-llm-word-features/iconicity_convergent_validity.md`.
`socialness`	`build_socialness.py`	Diveica, Pexman & Binney (2021), `_oracles/SocialnessNorms_DiveicaPexmanBinney2021.csv`	0.820 (N=7,850 full overlap; pilot 0.865 N=200)	This audit 2026-05-12. See `research/2026-05-12-socialness-pilot/full_build_report.md`. Replaces Diveica as pipeline source; Diveica kept as eval oracle.

E. Embedding-derived (Semantic Diversity)¶

Built via Qwen3-Embedding-4B (Apache 2.0) + UMAP + HDBSCAN clustering over FineWeb-Edu chunks. Per-word topic-distribution and context-covariance statistics.

Column	Method	Validation
`semd_topic`	Entropy of P(topic\|word) over HDBSCAN clusters	Primary SemD metric; full ELP behavioral battery validated.
`semd_vn`	Von Neumann entropy of context-vector covariance	Information-geometric variant.
`semd_h13`	-log(mean off-diagonal cosine), Hoffman 2013 recipe applied to PhonoLex chunks	Direct correspondence with Hoffman 2013 published values; Spearman 0.74 vs Hoffman 30K oracle.
`n_topics_for_word`	Count of distinct non-noise HDBSCAN topics the word participates in	Topic-richness metric.
`semantic_diversity`	Alias of `semd_topic` (backward-compat)	Same.

Phonological pattern fields¶

Column	Method	Source
`phonemes`, `phonemes_str`, `syllables`, `ipa`, `initial_phoneme`, `final_phoneme`	Direct from CMU + ARPAbet→IPA conversion in `phonolex_data.phonology`	CMU Pronouncing Dictionary v0.7b (modified BSD).
`root`, `is_monomorphemic`, `variants`	In-house morphology analyzer output	Aronoff & Fudeman affix tables + algorithmic decomposer.

Edges (association graph, `data/runtime/edges.parquet`)¶

Edge field	Method	Source / authority
`qwensim`	Pairwise cosine over Qwen3-Embedding-4B word vectors	PhonoLex in-house. Spearman 0.73 vs SimLex-999, 0.85 vs MEN-3000 on overlap. Replaces SimLex/MEN as pipeline edges.
`eccc_consistency`, `eccc_n_instances`, `eccc_phoneme_distance`	Edinburgh Children's Corpus Confusions (Marxer et al. 2016)	CC BY 4.0, kept as direct redistribution.
`wordsim_relatedness`	WordSim-353 (Finkelstein et al. 2002)	CC BY 4.0; surfaced in LookupTool.

Retired columns (kept for transparency)¶

These columns were shipped in earlier PhonoLex versions but have been retired:

Column	Retired	Reason	Replacement
`aoa_kuperman`	2026-05-12	License audit (Kuperman 2012 no posted license)	`aoa` via gpt-4.1-mini cloze (Spearman 0.898 vs Glasgow oracle).
`imageability`	2026-05-12	Orphan post-Glasgow-relocation (no clinical consumer surfaced; failed `feedback_ship_big_norms_only.md` test)	—
`size`	2026-05-12	Same — Glasgow-sourced, no consumer	—
11 Lancaster sensorimotor channels (`auditory`, `visual`, `haptic`, `gustatory`, `olfactory`, `interoceptive`, `hand_arm`, `foot_leg`, `head`, `mouth`, `torso`)	2026-05-12 (data audit)	LLM cloze-rating methodology failed the 0.70 Spearman gate on every Lynott channel (best haptic 0.683, worst head 0.226). See `research/2026-05-12-sensorimotor-pilot/report.md`. Lynott 2020 is CC BY 4.0 (clean), but no Custom Word Lists / Lookup / Text Analyzer consumer surfaced the channels in v5.2 with a named clinical workflow. Per `feedback_ship_big_norms_only.md` retired.	— (Lynott 2020 retained as eval oracle at `_oracles/` for future revisit.)
`vocab_memberships`	2026-05-12 (data audit)	The AVL/Ogden/Roget/Swadesh set-membership concept had no clinical consumer surfaced in v5.2.	— (Individual vocab loaders remain available for ad-hoc research via `phonolex_data.loaders.{load_ogden,load_afinn,…}`.)
`dominance`	2026-05-02	LLM-derived signal r=0.41 vs Warriner (below 0.50 convergent-validity target); V-A-D third axis conflates agency + authority.	— (PhonoLex uses 2-axis valence-arousal, Russell circumplex.)
`prevalence` (Brysbaert 2019)	2026-05-03	Replaced by AI familiarity proxy	`familiarity` via Glasgow FAM oracle.
`simlex_similarity`, `men_relatedness` (edges)	2026-05-02	Replaced by Qwensim	`qwensim` edge field.
`morpheme_count` (MorphyNet source)	2026-04-30	MorphyNet CC BY-SA 3.0 (share-alike incompatible with proprietary)	In-house algorithmic decomposer.
`frequency` (SUBTLEX-US source)	2026-05-03	SUBTLEX-US no posted license	FineWeb-Edu corpus.
`elp_lexical_decision_rt`	2026-04-30	ELP no posted license; behavioral RT cannot be AI-estimated	Tracked in (in-house Mechanical Turk / Prolific behavioral collection, deferred).

Open / parked¶

Frequency UX (parked 2026-05-12): the ~8-10 surfaced frequency columns currently leave clinicians without a clear "which frequency do I use for age X?" guide. See memory/project_frequency_columns_audit_parked.md for the deployment / UX revisit scoped for v5.3+.
Lancaster Sensorimotor revisit: Lynott 2020 remains at _oracles/. Future approaches that might pass the gate include (a) prompt iteration with embodied-experience anchor examples, (b) small-model fine-tuning on Lynott data, or (c) accepting Lynott CC BY 4.0 as a third-party ship if a clinical consumer materializes. v5.3+ scope.
Behavioral norms (USF Free Association / SPP / ELP): in-house Mechanical Turk / Prolific collection track. Deferred.