Skip to content

Data Licensing Remediation Checklist

Tracking ticket: PHON-71 Branch: feature/phon-71-data-license-remediation Started: 2026-04-30 Goal state: All third-party data items GREEN, or at minimum every RED resolved and YELLOW count reduced to ≤ 5.


Status legend

Status Meaning Action
🔴 RED Incompatible with proprietary commercial use; ships in production today Remove or replace
🟡 YELLOW License unposted, unverified, or carries compliance burden Verify, replace, or formally clear
🟢 GREEN Verified clean for proprietary commercial redistribution No action
⚪ N/A Not shipped in product DB / not in scope Verify absence and document

RED — must resolve

Item License posture Where shipped Target Status
SUBTLEX-US frequency (Brysbaert & New 2009) No posted license; default copyright. NOTICE's "per author permission" claim is unverified. /api/words/:wordfrequency, log_frequency, contextual_diversity per word; phoneme_rates.json in generation runtime Replace with FineWeb-Edu-derived corpus (PHON-72) 🟢 Done 2026-05-03. PHON-72 corpus (data/norms/phonolex_frequency.tsv, 800M tokens, 760K word types, ODC-BY 1.0) is integrated via load_phonolex_frequency in pipeline/words.py; ships frequency, log_frequency, contextual_diversity in D1 word_properties.
SUBTLEX-US POS (Brysbaert/New/Keuleers 2012) Explicit CC BY-NC-SA 4.0 (OSF distribution) Not yet shipped (PHON-70 was halted before integration) Don't integrate; replace via PHON-72's spaCy POS pass 🟢 Done 2026-05-03. PHON-72's spaCy en_core_web_trf UPOS tags integrated. Pipeline ships pos, pos_alt, pos_dominant_freq, all_pos, all_freqs in D1 word_properties / words.
MorphyNet (Batsuren et al. 2021) CC BY-SA 3.0 (Wiktionary-derived; share-alike incompatible with proprietary) morpheme_count, n_prefixes, n_suffixes, is_monomorphemic shipped via word_properties Replace with in-house algorithmic decomposer (Hayes 2009 / Aronoff & Fudeman affix tables, no third-party data) 🟢 Done. packages/data/src/phonolex_data/morphology/ provides analyze() + compute_for_lexicon(). Pipeline now uses our analyzer. CMU coverage: 100% of 125,756 words. SLP-relevant accuracy validated by 32 unit tests. Quality-upgrade path: license CELEX2 once funded.
NGSL (Browne, Culligan, Phillips 2013-2023) CC BY-SA 4.0 (verified 2026-04-30; project-homepage "least restrictive CC" copy is misleading — actual license badge on the wordlists is BY-SA) vocab_memberships included gsl_new (NEW.json) Stripped. Replacement tracked in PHON-74 (curated in-house wordlists from FineWeb-Edu freq + AI features) 🟢 stripped
GSL West 1953 Still copyrighted (US PD 2049, UK PD 2043; West died 1973) vocab_memberships included gsl_original (ORIGINAL.json) Stripped. Replacement tracked in PHON-74 🟢 stripped
Hillenbrand vowels (re-evaluated) No formal license; used only as one training input to the Bayesian inference in packages/features/ (alongside Hayes 2009 prior + ECCC CC BY 4.0). The learned 40×26 phoneme feature matrix is not per-row redistribution: 28 of 40 segments (all consonants) have zero Hillenbrand contribution, and each vowel cell is a posterior across multiple data sources + regularization. No output cell is traceable to any single Hillenbrand measurement. This is ML-trained transformation, not direct redistribution. Same legal posture as ML model weights generally. ⚪ Re-classified N/A. The original PHOIBLE→learned-vectors architecture choice was specifically to escape CC BY-SA contagion and works. See feedback memory feedback_distinguish_direct_vs_trained.

YELLOW — verify, replace, or clear

Brysbaert YELLOWs — superseded by AI-estimate path (see PHON-73)

CLLD re-source was investigated and rejected: NoRaRe is keyed to Concepticon's ~3K-concept inventory, so re-sourcing from CLLD drops coverage from ~14K-60K → ~2.3-2.7K per dataset (90%+ data loss). Script preserved at research/2026-04-30-clld-resource/ as a documented dead-end.

Replaced by AI-estimate path (PHON-73): Martínez/Brysbaert et al. (2024-2025) demonstrate that LLM-derived word feature estimates correlate r = .74-.95 with human ratings and outperform them on downstream prediction tasks. We generate our own estimates with an open-weight LLM on RunPod, vocabulary scope = full CMU dict, license = ours.

Item Replacement Status
Brysbaert Concreteness 2014 PhonoLex Concreteness (PHON-73) — gpt-4.1-mini cloze-prompt 🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.878 vs Brysbaert oracle (N=24,109 overlap). Brysbaert TXT moved to data/norms/_oracles/concreteness_brysbaert2014.txt. Pipeline ships concreteness repointed to PhonoLex value via load_phonolex_concreteness.
Warriner Valence 2013 PhonoLex Valence (PHON-73) — gpt-4.1-mini cloze-prompt 🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.853 vs Warriner oracle (N=12,537 overlap). Warriner CSV moved to data/norms/_oracles/Ratings_VAD_WarrinerEtAl.csv. Pipeline ships valence repointed via load_phonolex_valence.
Warriner Arousal 2013 PhonoLex Arousal (PHON-73) — gpt-4.1-mini cloze-prompt 🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.626 vs Warriner oracle (N=12,537 overlap; ceiling-bound construct, comfortably above 0.50 convergent-validity target — rank-order signal is what's load-bearing). Same Warriner CSV oracle as valence. Pipeline ships arousal repointed via load_phonolex_arousal.
Warriner Dominance 2013 DROPPED (no replacement) 🟢 Done 2026-05-02 (commits 2629b23 + 7eb3bd8). LLM-derived signal vs Warriner (r=0.41) underperformed 0.50 convergent-validity target; V-A-D third axis conflates agency + authority. PhonoLex uses 2-axis valence-arousal (Russell circumplex).
Kuperman AoA 2012 PhonoLex AoA (PHON-115) — gpt-4.1-mini cloze-prompt with logprob expected-value 🟢 Done 2026-05-12. Production build 47,724 words; full-vocab Glasgow Spearman 0.898 / Pearson 0.892 (N=4,399 overlap); Kuperman cross-construct Spearman 0.829 / Pearson 0.825 on N=17,572 Glasgow-unseen rows. Kuperman XLSX + Glasgow XLSX moved to data/norms/_oracles/. Pipeline ships aoa repointed via load_phonolex_aoa. imageability + size columns also retired (orphaned post-Glasgow-relocation, no consumer). Validation report at research/2026-05-11-phon-115-aoa-pilot/report.md.
Brysbaert Word Prevalence 2019 PhonoLex Familiarity (PHON-73 AI familiarity proxy) — gpt-4.1-mini cloze-prompt 🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.786 (rank-order) vs Glasgow FAM on N=4,401 overlap. Brysbaert prevalence directory moved to data/norms/_oracles/prevalence_brysbaert2019/. Pipeline ships familiarity repointed via load_phonolex_familiarity — loads AFTER Glasgow so the Glasgow FAM column for the ~5K Glasgow words is overwritten by the PhonoLex value.

Author-contact path — collapsed via PHON-73 (LLM) + PHON-75 (in-house behavioral)

The original "11 author-contact" list collapsed substantially after we identified that 9 of 11 have AI/algorithmic replacement paths (PHON-73 for LLM-estimable features; PHON-74 for vocab lists). The remaining 3 are behavioral measurements that cannot be substituted by LLM/embeddings — they're cognitive measurements requiring real human data.

Item Replacement track Status
USF Free Association PHON-75 (in-house Mechanical Turk / Prolific behavioral collection) 🟢 stripped 2026-04-30
SPP semantic priming PHON-75 🟢 stripped 2026-04-30
ELP lexical decision RT PHON-75 🟢 stripped 2026-04-30
SimLex-999 / MEN PHON-81 — PhonoLex Qwensim (pairwise cosine over PHON-76 4B word vectors) 🟢 Done 2026-05-02. Validation: Spearman 0.73 vs SimLex (n=999), 0.85 vs MEN (n=2988). Source tables moved to data/norms/_oracles/; pipeline now ships qwensim (replaces simlex_similarity + men_relatedness). PROPN-filtered (54% of CMU∩freq is proper nouns; excluded from edge graph).
WordSim-353 (already shipping; CC BY 4.0 verified) 🟢 GREEN
BOI PHON-82 — LLM-rating estimate (PHON-73 harness) 🟢 Done 2026-05-02. 47,724 non-PROPN content words rated via gpt-4.1-mini ($~$5, 1.6h, 0 fails). Validation: Spearman 0.820 / Pearson 0.804 vs Pexman oracle on 7,940-word overlap. Pexman xlsx moved to data/norms/_oracles/. Pipeline ships boi field repointed to PhonoLex value.
Iconicity PHON-83 — LLM-rating with form (IPA) + WordNet-gloss inputs 🟢 Done 2026-05-02. 47,724 non-PROPN content words rated via gpt-4.1-mini (~$5, 1.6h, 0 fails). Held-out validation Spearman 0.594 vs Winter on N=500; full-vocab Spearman 0.564 on 13,062-word overlap. Single-oracle correlation sits in Winter's own inter-rater band; the column reproduces 4/4 published convergent-validity patterns (POS, AoA, concreteness, reduplication) at magnitudes meeting or exceeding Winter on the same rows — see research/2026-04-30-llm-word-features/iconicity_convergent_validity.md. Winter csv moved to data/norms/_oracles/. Pipeline ships iconicity field repointed to PhonoLex value.
CYP-LEX OSF API verified CC BY 4.0 🟢 GREEN
Hoffman SemDiv 2013 PHON-76 — Qwen3-Embedding-4B / FineWeb-Edu pipeline 🟢 Done 2026-05-02. Production v1 (0.6B) + v2 (4B) builds completed; v2 ships with 4 SemD metrics (semd_topic / semd_vn / semd_h13 / n_topics_for_word). Spearman 0.74 vs Hoffman 30K oracle; beats Hoffman on full ELP+behavioral battery. Hoffman csv moved to data/norms/_oracles/.
CYP-LEX Children-corpus subset of FineWeb-Edu, or PHON-73 child-familiarity 🟡 kept in v1 (also note: now complemented by PHON-86 PhonBank + PHON-87 CHILDES age-graded freq tables shipping via PHON-88)
AVL (academic vocab) PHON-74 Phono-Academic list 🟡 kept in v1
Onix/LEARN stopwords PHON-74 Phono-Stopwords list 🟡 kept in v1

Misc (NOTICE wrong but effect minimal — fix in NOTICE rewrite)

Item NOTICE claim Reality Status
ipa-dict CC0 MIT (and sub-datasets vary — verify only English/American is used)
Swadesh list "PD (linguistic concept)" The 100/200-word selection is a creative work; copyright applies
NLTK English stopwords Apache 2.0 No license at all (NOTICE conflated library license with corpus)
FOX stopwords (Fox 1989) (no claim) ACM SIGIR Forum publication; ACM holds rights
GPT-2 (test fixture only) MIT Modified MIT (clause about generated content)
Ogden Basic English "pre-1930 PD" Published 1930; entered US PD 2026-01-01 only. UK still copyright until 2028 (Ogden died 1957)
ECCC URL datashare.ed.ac.uk/handle/10283/2791 That URL is a different corpus; correct ECCC distribution is spandh.dcs.shef.ac.uk/ECCC/

NEW in-house data builds (added during remediation campaign)

These are new datasets built in-house during the PHON-71 remediation work; not on the original RED/YELLOW audit list because they didn't exist yet. Recorded here for completeness of the audit closeout.

Item License posture Where shipped Status
PHON-85 FineWeb-Edu grade-banded freq (5 reading-grade quantile bins via F-K + tier overlay) Derivative of FineWeb-Edu (ODC-BY 1.0); only per-word per-band statistics redistributed. Same posture as the existing PhonoLex frequency column. D1 word_properties (15 cols, all unsurfaced). Aggregated into freq_age_8y + freq_age_12y headlines (surfaced filterable). 🟢 Done 2026-05-03 (commit bfa00a3 — data shipped; PHON-88 — pipeline integration).
PHON-86 PhonBank preschool age-graded freq (4 corpora: Davis, Penney, Providence, StanfordEnglish; 5 bands × 2 channels) Per user 2026-05-03: "use as training/eval scaffolding without redistributing." Derived per-word per-band statistics; raw .cha utterances NOT redistributed. Per-corpus author contact deferred to commercial-deploy time. D1 word_properties (30 cols, all unsurfaced). Aggregated into freq_age_2y + freq_age_5y headlines (surfaced filterable). 🟢 Done 2026-05-03 (commit 5abc98b — data shipped; PHON-88 — pipeline integration).
PHON-87 CHILDES Eng-NA + Eng-UK age-graded freq, MOR-tagged (50 corpora; 8 bands × 2 channels; MOR-lemma tokenization) Same posture as PHON-86 (TalkBank consortium derived statistics). D1 word_properties (48 cols, all unsurfaced). Aggregated into all 4 freq_age_* headlines (surfaced filterable). 🟢 Done 2026-05-03 (commit 6b1cf2b — data shipped; PHON-88 — pipeline integration). PHON-87 headline finding: input wpm at 24-36mo Spearman vs −Glasgow AoA = +0.751, the strongest in-vocabulary AoA proxy in PhonoLex.
PHON-88 (this ticket) — pipeline + D1 + NOTICE integration of the above three n/a (integration work) All-of-above. New surfaced flag on PropertyDef gates 93 raw band cols out of /api/property-metadata while keeping them in D1 / /api/words/:word round-trip. NOTICE block names all 54 TalkBank corpora with DOIs (52 with DOI; 2 with no DOI in TalkBank manifest as of 2026-05-03). 🟢 Done 2026-05-03.

GREEN — verified clean (no action)

Item License Confidence
CMU Pronouncing Dictionary v0.7b Modified BSD ("completely unrestricted") High
Roget's Thesaurus 1911 (PG #22) US Public Domain High
Ogden's Basic English (1930) US PD as of 2026-01-01 Medium (UK still under copyright until 2028)
AFINN sentiment Apache 2.0 High
spaCy English stopwords MIT (inherits library) High
en_core_web_sm MIT High
g2p-en Apache 2.0 High
GPT-2 (test fixture) Modified MIT High
Glasgow Norms (Scott et al. 2019) CC BY 4.0 High
Lancaster Sensorimotor Norms (Lynott et al. 2020) CC BY 4.0 High (retired from shipped schema 2026-05-12; retained as eval oracle at data/norms/_oracles/)
Socialness norms (Diveica et al. 2023) CC BY 4.0 High (retired from shipped schema 2026-05-12 in favor of PhonoLex Socialness via gpt-4.1-mini; full-vocab Spearman 0.820 vs Diveica oracle on N=7,850 overlap. Source CSV moved to data/norms/_oracles/.)
WordSim-353 CC BY 4.0 High
ECCC v1.2 (Marxer et al. 2016) CC BY 4.0 High

Independent compliance items

Item Status
Gemma TOS propagation — phonolex.com Terms of Service must include the Gemma Prohibited Use Policy (per Gemma TOS Hosted Services clause). ☑ Done 2026-05-01 — packages/web/frontend/src/pages/TermsOfService.tsx has dedicated "AI-Generated Content (Governed Generation)" section: names T5Gemma 9B-2B + 2-4b-4b, links Gemma TOS + Prohibited Use Policy, adds downstream-propagation clause, adds best-effort-not-guarantee disclaimer. Pending deploy.
NOTICE rewrite — full re-pass with verified license texts, drop wrong claims, add Gemma compliance section, fix ECCC URL. ☑ Done 2026-05-01 — NOTICE rewritten end-to-end. Removed RED items (SUBTLEX, MorphyNet, GSL/NGSL, USF/SPP/ELP) entirely; added FineWeb-Edu (ODC-BY 1.0) + PHON-73 AI features; corrected ipa-dict (CC0→MIT), Ogden timing, NLTK stopwords (Apache→none), Swadesh, ECCC URL, GPT-2 (Modified MIT); re-classified Hillenbrand as ML-trained-transformation; added "validation oracles only" section + Gemma TOS propagation reminder.
Hillenbrand absence audit — confirm zero pipeline ship
PHOIBLE remnant audit — confirm zero pipeline ship ✅ Confirmed N/A. Features CSV (packages/features/data/phonolex_features_ipa.csv) is hand-built from Hayes 2009 textbook with per-segment citations; explicit "This file is an original encoding by Neumann's Workshop LLC". Only PHOIBLE references remaining are in CitationDialog.tsx (attribution / historical) and packages/features/README.md (comparison-against-PHOIBLE narrative). Drop those mentions in NOTICE rewrite if not strictly needed.

Execution order

  1. PHON-72 — build FineWeb-Edu frequency + POS replacement (RunPod GPU). Closes 🔴 SUBTLEX-US frequency + SUBTLEX-US POS in one strike, plus delivers PHON-70's intent.
  2. Strip 🔴 MorphyNet + 🔴 GSL + 🔴 NGSL from pipeline.
  3. Re-source 🟡 4 Brysbaert datasets from CLLD CC BY 4.0.
  4. Audit 🔴 Hillenbrand + PHOIBLE absence (likely confirms ⚪ N/A).
  5. NOTICE rewrite (waits until items 1-4 are resolved so the rewrite reflects final reality).
  6. 🟡 Author-contact campaign (parallel; 30-day deadline; remove or replace if silent).
  7. phonolex.com TOS update for Gemma.
  8. Final lawyer review before merge to develop.

Audit provenance

Source: three license-research forks dispatched 2026-04-30, each verifying actual posted license at distribution point (not NOTICE's claim). Findings cross-referenced with ship-state in packages/web/workers/scripts/export-to-d1.py and packages/generation/scripts/build_runtime_data.py.