Data Licensing Remediation Checklist¶

Tracking ticket: PHON-71 Branch: feature/phon-71-data-license-remediation Started: 2026-04-30 Goal state: All third-party data items GREEN, or at minimum every RED resolved and YELLOW count reduced to ≤ 5.

Status legend¶

Status	Meaning	Action
🔴 RED	Incompatible with proprietary commercial use; ships in production today	Remove or replace
🟡 YELLOW	License unposted, unverified, or carries compliance burden	Verify, replace, or formally clear
🟢 GREEN	Verified clean for proprietary commercial redistribution	No action
⚪ N/A	Not shipped in product DB / not in scope	Verify absence and document

RED — must resolve¶

Item	License posture	Where shipped	Target	Status
SUBTLEX-US frequency (Brysbaert & New 2009)	No posted license; default copyright. NOTICE's "per author permission" claim is unverified.	`/api/words/:word` → `frequency`, `log_frequency`, `contextual_diversity` per word; `phoneme_rates.json` in generation runtime	Replace with FineWeb-Edu-derived corpus (PHON-72)	🟢 Done 2026-05-03. PHON-72 corpus (`data/norms/phonolex_frequency.tsv`, 800M tokens, 760K word types, ODC-BY 1.0) is integrated via `load_phonolex_frequency` in `pipeline/words.py`; ships `frequency`, `log_frequency`, `contextual_diversity` in D1 `word_properties`.
SUBTLEX-US POS (Brysbaert/New/Keuleers 2012)	Explicit CC BY-NC-SA 4.0 (OSF distribution)	Not yet shipped (PHON-70 was halted before integration)	Don't integrate; replace via PHON-72's spaCy POS pass	🟢 Done 2026-05-03. PHON-72's spaCy `en_core_web_trf` UPOS tags integrated. Pipeline ships `pos`, `pos_alt`, `pos_dominant_freq`, `all_pos`, `all_freqs` in D1 `word_properties` / `words`.
MorphyNet (Batsuren et al. 2021)	CC BY-SA 3.0 (Wiktionary-derived; share-alike incompatible with proprietary)	`morpheme_count`, `n_prefixes`, `n_suffixes`, `is_monomorphemic` shipped via `word_properties`	Replace with in-house algorithmic decomposer (Hayes 2009 / Aronoff & Fudeman affix tables, no third-party data)	🟢 Done. `packages/data/src/phonolex_data/morphology/` provides `analyze()` + `compute_for_lexicon()`. Pipeline now uses our analyzer. CMU coverage: 100% of 125,756 words. SLP-relevant accuracy validated by 32 unit tests. Quality-upgrade path: license CELEX2 once funded.
NGSL (Browne, Culligan, Phillips 2013-2023)	CC BY-SA 4.0 (verified 2026-04-30; project-homepage "least restrictive CC" copy is misleading — actual license badge on the wordlists is BY-SA)	`vocab_memberships` included `gsl_new` (NEW.json)	Stripped. Replacement tracked in PHON-74 (curated in-house wordlists from FineWeb-Edu freq + AI features)	🟢 stripped
GSL West 1953	Still copyrighted (US PD 2049, UK PD 2043; West died 1973)	`vocab_memberships` included `gsl_original` (ORIGINAL.json)	Stripped. Replacement tracked in PHON-74	🟢 stripped
Hillenbrand vowels (re-evaluated)	No formal license; used only as one training input to the Bayesian inference in `packages/features/` (alongside Hayes 2009 prior + ECCC CC BY 4.0). The learned 40×26 phoneme feature matrix is not per-row redistribution: 28 of 40 segments (all consonants) have zero Hillenbrand contribution, and each vowel cell is a posterior across multiple data sources + regularization. No output cell is traceable to any single Hillenbrand measurement.	This is ML-trained transformation, not direct redistribution. Same legal posture as ML model weights generally.	⚪ Re-classified N/A. The original PHOIBLE→learned-vectors architecture choice was specifically to escape CC BY-SA contagion and works. See feedback memory `feedback_distinguish_direct_vs_trained`.

YELLOW — verify, replace, or clear¶

Brysbaert YELLOWs — superseded by AI-estimate path (see PHON-73)¶

CLLD re-source was investigated and rejected: NoRaRe is keyed to Concepticon's ~3K-concept inventory, so re-sourcing from CLLD drops coverage from ~14K-60K → ~2.3-2.7K per dataset (90%+ data loss). Script preserved at research/2026-04-30-clld-resource/ as a documented dead-end.

Replaced by AI-estimate path (PHON-73): Martínez/Brysbaert et al. (2024-2025) demonstrate that LLM-derived word feature estimates correlate r = .74-.95 with human ratings and outperform them on downstream prediction tasks. We generate our own estimates with an open-weight LLM on RunPod, vocabulary scope = full CMU dict, license = ours.

Item	Replacement	Status
Brysbaert Concreteness 2014	PhonoLex Concreteness (PHON-73) — gpt-4.1-mini cloze-prompt	🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.878 vs Brysbaert oracle (N=24,109 overlap). Brysbaert TXT moved to `data/norms/_oracles/concreteness_brysbaert2014.txt`. Pipeline ships `concreteness` repointed to PhonoLex value via `load_phonolex_concreteness`.
Warriner Valence 2013	PhonoLex Valence (PHON-73) — gpt-4.1-mini cloze-prompt	🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.853 vs Warriner oracle (N=12,537 overlap). Warriner CSV moved to `data/norms/_oracles/Ratings_VAD_WarrinerEtAl.csv`. Pipeline ships `valence` repointed via `load_phonolex_valence`.
Warriner Arousal 2013	PhonoLex Arousal (PHON-73) — gpt-4.1-mini cloze-prompt	🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.626 vs Warriner oracle (N=12,537 overlap; ceiling-bound construct, comfortably above 0.50 convergent-validity target — rank-order signal is what's load-bearing). Same Warriner CSV oracle as valence. Pipeline ships `arousal` repointed via `load_phonolex_arousal`.
Warriner Dominance 2013	DROPPED (no replacement)	🟢 Done 2026-05-02 (commits 2629b23 + 7eb3bd8). LLM-derived signal vs Warriner (r=0.41) underperformed 0.50 convergent-validity target; V-A-D third axis conflates agency + authority. PhonoLex uses 2-axis valence-arousal (Russell circumplex).
Kuperman AoA 2012	PhonoLex AoA (PHON-115) — gpt-4.1-mini cloze-prompt with logprob expected-value	🟢 Done 2026-05-12. Production build 47,724 words; full-vocab Glasgow Spearman 0.898 / Pearson 0.892 (N=4,399 overlap); Kuperman cross-construct Spearman 0.829 / Pearson 0.825 on N=17,572 Glasgow-unseen rows. Kuperman XLSX + Glasgow XLSX moved to `data/norms/_oracles/`. Pipeline ships `aoa` repointed via `load_phonolex_aoa`. `imageability` + `size` columns also retired (orphaned post-Glasgow-relocation, no consumer). Validation report at `research/2026-05-11-phon-115-aoa-pilot/report.md`.
Brysbaert Word Prevalence 2019	PhonoLex Familiarity (PHON-73 AI familiarity proxy) — gpt-4.1-mini cloze-prompt	🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.786 (rank-order) vs Glasgow FAM on N=4,401 overlap. Brysbaert prevalence directory moved to `data/norms/_oracles/prevalence_brysbaert2019/`. Pipeline ships `familiarity` repointed via `load_phonolex_familiarity` — loads AFTER Glasgow so the Glasgow FAM column for the ~5K Glasgow words is overwritten by the PhonoLex value.

Author-contact path — collapsed via PHON-73 (LLM) + PHON-75 (in-house behavioral)¶

The original "11 author-contact" list collapsed substantially after we identified that 9 of 11 have AI/algorithmic replacement paths (PHON-73 for LLM-estimable features; PHON-74 for vocab lists). The remaining 3 are behavioral measurements that cannot be substituted by LLM/embeddings — they're cognitive measurements requiring real human data.

Item	Replacement track	Status
USF Free Association	PHON-75 (in-house Mechanical Turk / Prolific behavioral collection)	🟢 stripped 2026-04-30
SPP semantic priming	PHON-75	🟢 stripped 2026-04-30
ELP lexical decision RT	PHON-75	🟢 stripped 2026-04-30
SimLex-999 / MEN	PHON-81 — PhonoLex Qwensim (pairwise cosine over PHON-76 4B word vectors)	🟢 Done 2026-05-02. Validation: Spearman 0.73 vs SimLex (n=999), 0.85 vs MEN (n=2988). Source tables moved to `data/norms/_oracles/`; pipeline now ships `qwensim` (replaces simlex_similarity + men_relatedness). PROPN-filtered (54% of CMU∩freq is proper nouns; excluded from edge graph).
WordSim-353	(already shipping; CC BY 4.0 verified)	🟢 GREEN
BOI	PHON-82 — LLM-rating estimate (PHON-73 harness)	🟢 Done 2026-05-02. 47,724 non-PROPN content words rated via gpt-4.1-mini ($~$5, 1.6h, 0 fails). Validation: Spearman 0.820 / Pearson 0.804 vs Pexman oracle on 7,940-word overlap. Pexman xlsx moved to `data/norms/_oracles/`. Pipeline ships `boi` field repointed to PhonoLex value.
Iconicity	PHON-83 — LLM-rating with form (IPA) + WordNet-gloss inputs	🟢 Done 2026-05-02. 47,724 non-PROPN content words rated via gpt-4.1-mini (~$5, 1.6h, 0 fails). Held-out validation Spearman 0.594 vs Winter on N=500; full-vocab Spearman 0.564 on 13,062-word overlap. Single-oracle correlation sits in Winter's own inter-rater band; the column reproduces 4/4 published convergent-validity patterns (POS, AoA, concreteness, reduplication) at magnitudes meeting or exceeding Winter on the same rows — see `research/2026-04-30-llm-word-features/iconicity_convergent_validity.md`. Winter csv moved to `data/norms/_oracles/`. Pipeline ships `iconicity` field repointed to PhonoLex value.
CYP-LEX	OSF API verified CC BY 4.0	🟢 GREEN
Hoffman SemDiv 2013	PHON-76 — Qwen3-Embedding-4B / FineWeb-Edu pipeline	🟢 Done 2026-05-02. Production v1 (0.6B) + v2 (4B) builds completed; v2 ships with 4 SemD metrics (semd_topic / semd_vn / semd_h13 / n_topics_for_word). Spearman 0.74 vs Hoffman 30K oracle; beats Hoffman on full ELP+behavioral battery. Hoffman csv moved to `data/norms/_oracles/`.
CYP-LEX	Children-corpus subset of FineWeb-Edu, or PHON-73 child-familiarity	🟡 kept in v1 (also note: now complemented by PHON-86 PhonBank + PHON-87 CHILDES age-graded freq tables shipping via PHON-88)
AVL (academic vocab)	PHON-74 Phono-Academic list	🟡 kept in v1
Onix/LEARN stopwords	PHON-74 Phono-Stopwords list	🟡 kept in v1

Misc (NOTICE wrong but effect minimal — fix in NOTICE rewrite)¶

Item	NOTICE claim	Reality	Status
ipa-dict	CC0	MIT (and sub-datasets vary — verify only English/American is used)	☐
Swadesh list	"PD (linguistic concept)"	The 100/200-word selection is a creative work; copyright applies	☐
NLTK English stopwords	Apache 2.0	No license at all (NOTICE conflated library license with corpus)	☐
FOX stopwords (Fox 1989)	(no claim)	ACM SIGIR Forum publication; ACM holds rights	☐
GPT-2 (test fixture only)	MIT	Modified MIT (clause about generated content)	☐
Ogden Basic English	"pre-1930 PD"	Published 1930; entered US PD 2026-01-01 only. UK still copyright until 2028 (Ogden died 1957)	☐
ECCC URL	datashare.ed.ac.uk/handle/10283/2791	That URL is a different corpus; correct ECCC distribution is spandh.dcs.shef.ac.uk/ECCC/	☐

NEW in-house data builds (added during remediation campaign)¶

These are new datasets built in-house during the PHON-71 remediation work; not on the original RED/YELLOW audit list because they didn't exist yet. Recorded here for completeness of the audit closeout.

Item	License posture	Where shipped	Status
PHON-85 FineWeb-Edu grade-banded freq (5 reading-grade quantile bins via F-K + tier overlay)	Derivative of FineWeb-Edu (ODC-BY 1.0); only per-word per-band statistics redistributed. Same posture as the existing PhonoLex frequency column.	D1 `word_properties` (15 cols, all unsurfaced).	🟢 Done 2026-05-03 (commit bfa00a3 — data shipped; PHON-88 — pipeline integration). The `freq_age_*` headline aggregation was removed 2026-07-12 (PHON-161); the 15 raw grade-band cols remain (now in `word_freq_bands`).
PHON-86 PhonBank preschool age-graded freq (4 corpora: Davis, Penney, Providence, StanfordEnglish; 5 bands × 2 channels)	Per user 2026-05-03: "use as training/eval scaffolding without redistributing." Derived per-word per-band statistics; raw .cha utterances NOT redistributed. Per-corpus author contact deferred to commercial-deploy time.	D1 `word_properties` (30 cols, all unsurfaced).	⚫ REMOVED 2026-07-12 (PHON-161) — all PhonBank-derived columns dropped from pipeline + D1; no TalkBank statistics redistributed. (Originally shipped 2026-05-03, commit 5abc98b.)
PHON-87 CHILDES Eng-NA + Eng-UK age-graded freq, MOR-tagged (50 corpora; 8 bands × 2 channels; MOR-lemma tokenization)	Same posture as PHON-86 (TalkBank consortium derived statistics).	D1 `word_properties` (48 cols, all unsurfaced).	⚫ REMOVED 2026-07-12 (PHON-161) — all CHILDES-derived columns dropped from pipeline + D1; no TalkBank statistics redistributed. (Originally shipped 2026-05-03, commit 6b1cf2b.)
PHON-88 (this ticket) — pipeline + D1 + NOTICE integration of the above three	n/a (integration work)	All-of-above. New `surfaced` flag on PropertyDef gates 93 raw band cols out of `/api/property-metadata` while keeping them in D1 / `/api/words/:word` round-trip. NOTICE block named all 54 TalkBank corpora with DOIs.	🟢 Done 2026-05-03. TalkBank portions removed 2026-07-12 (PHON-161); NOTICE block reduced to the FineWeb-Edu grade bands.

GREEN — verified clean (no action)¶

Item	License	Confidence
CMU Pronouncing Dictionary v0.7b	Modified BSD ("completely unrestricted")	High
Roget's Thesaurus 1911 (PG #22)	US Public Domain	High
Ogden's Basic English (1930)	US PD as of 2026-01-01	Medium (UK still under copyright until 2028)
AFINN sentiment	Apache 2.0	High
spaCy English stopwords	MIT (inherits library)	High
en_core_web_sm	MIT	High
g2p-en	Apache 2.0	High
GPT-2 (test fixture)	Modified MIT	High
Glasgow Norms (Scott et al. 2019)	CC BY 4.0	High
Lancaster Sensorimotor Norms (Lynott et al. 2020)	CC BY 4.0	High (retired from shipped schema 2026-05-12; retained as eval oracle at `data/norms/_oracles/`)
Socialness norms (Diveica et al. 2023)	CC BY 4.0	High (retired from shipped schema 2026-05-12 in favor of PhonoLex Socialness via gpt-4.1-mini; full-vocab Spearman 0.820 vs Diveica oracle on N=7,850 overlap. Source CSV moved to `data/norms/_oracles/`.)
WordSim-353	CC BY 4.0	High
ECCC v1.2 (Marxer et al. 2016)	CC BY 4.0	High

Independent compliance items¶

Item	Status
Gemma TOS propagation — phonolex.com Terms of Service must include the Gemma Prohibited Use Policy (per Gemma TOS Hosted Services clause).	☑ Done 2026-05-01 — `packages/web/frontend/src/pages/TermsOfService.tsx` has dedicated "AI-Generated Content (Governed Generation)" section: names T5Gemma 9B-2B + 2-4b-4b, links Gemma TOS + Prohibited Use Policy, adds downstream-propagation clause, adds best-effort-not-guarantee disclaimer. Pending deploy.
NOTICE rewrite — full re-pass with verified license texts, drop wrong claims, add Gemma compliance section, fix ECCC URL.	☑ Done 2026-05-01 — `NOTICE` rewritten end-to-end. Removed RED items (SUBTLEX, MorphyNet, GSL/NGSL, USF/SPP/ELP) entirely; added FineWeb-Edu (ODC-BY 1.0) + PHON-73 AI features; corrected ipa-dict (CC0→MIT), Ogden timing, NLTK stopwords (Apache→none), Swadesh, ECCC URL, GPT-2 (Modified MIT); re-classified Hillenbrand as ML-trained-transformation; added "validation oracles only" section + Gemma TOS propagation reminder.
Hillenbrand absence audit — confirm zero pipeline ship	☐
PHOIBLE remnant audit — confirm zero pipeline ship	✅ Confirmed N/A. Features CSV (`packages/features/data/phonolex_features_ipa.csv`) is hand-built from Hayes 2009 textbook with per-segment citations; explicit "This file is an original encoding by Just Semantics". Only PHOIBLE references remaining are in `CitationDialog.tsx` (attribution / historical) and `packages/features/README.md` (comparison-against-PHOIBLE narrative). Drop those mentions in NOTICE rewrite if not strictly needed.

Execution order¶

PHON-72 — build FineWeb-Edu frequency + POS replacement (RunPod GPU). Closes 🔴 SUBTLEX-US frequency + SUBTLEX-US POS in one strike, plus delivers PHON-70's intent.
Strip 🔴 MorphyNet + 🔴 GSL + 🔴 NGSL from pipeline.
Re-source 🟡 4 Brysbaert datasets from CLLD CC BY 4.0.
Audit 🔴 Hillenbrand + PHOIBLE absence (likely confirms ⚪ N/A).
NOTICE rewrite (waits until items 1-4 are resolved so the rewrite reflects final reality).
🟡 Author-contact campaign (parallel; 30-day deadline; remove or replace if silent).
phonolex.com TOS update for Gemma.
Final lawyer review before merge to develop.

Audit provenance¶

Source: three license-research forks dispatched 2026-04-30, each verifying actual posted license at distribution point (not NOTICE's claim). Findings cross-referenced with ship-state in packages/web/workers/scripts/export-to-d1.py and packages/generation/scripts/build_runtime_data.py.