Tracking ticket: PHON-71
Branch: feature/phon-71-data-license-remediation
Started: 2026-04-30
Goal state: All third-party data items GREEN, or at minimum every RED resolved and YELLOW count reduced to ≤ 5.
Status legend
| Status |
Meaning |
Action |
| 🔴 RED |
Incompatible with proprietary commercial use; ships in production today |
Remove or replace |
| 🟡 YELLOW |
License unposted, unverified, or carries compliance burden |
Verify, replace, or formally clear |
| 🟢 GREEN |
Verified clean for proprietary commercial redistribution |
No action |
| ⚪ N/A |
Not shipped in product DB / not in scope |
Verify absence and document |
RED — must resolve
| Item |
License posture |
Where shipped |
Target |
Status |
| SUBTLEX-US frequency (Brysbaert & New 2009) |
No posted license; default copyright. NOTICE's "per author permission" claim is unverified. |
/api/words/:word → frequency, log_frequency, contextual_diversity per word; phoneme_rates.json in generation runtime |
Replace with FineWeb-Edu-derived corpus (PHON-72) |
🟢 Done 2026-05-03. PHON-72 corpus (data/norms/phonolex_frequency.tsv, 800M tokens, 760K word types, ODC-BY 1.0) is integrated via load_phonolex_frequency in pipeline/words.py; ships frequency, log_frequency, contextual_diversity in D1 word_properties. |
| SUBTLEX-US POS (Brysbaert/New/Keuleers 2012) |
Explicit CC BY-NC-SA 4.0 (OSF distribution) |
Not yet shipped (PHON-70 was halted before integration) |
Don't integrate; replace via PHON-72's spaCy POS pass |
🟢 Done 2026-05-03. PHON-72's spaCy en_core_web_trf UPOS tags integrated. Pipeline ships pos, pos_alt, pos_dominant_freq, all_pos, all_freqs in D1 word_properties / words. |
| MorphyNet (Batsuren et al. 2021) |
CC BY-SA 3.0 (Wiktionary-derived; share-alike incompatible with proprietary) |
morpheme_count, n_prefixes, n_suffixes, is_monomorphemic shipped via word_properties |
Replace with in-house algorithmic decomposer (Hayes 2009 / Aronoff & Fudeman affix tables, no third-party data) |
🟢 Done. packages/data/src/phonolex_data/morphology/ provides analyze() + compute_for_lexicon(). Pipeline now uses our analyzer. CMU coverage: 100% of 125,756 words. SLP-relevant accuracy validated by 32 unit tests. Quality-upgrade path: license CELEX2 once funded. |
| NGSL (Browne, Culligan, Phillips 2013-2023) |
CC BY-SA 4.0 (verified 2026-04-30; project-homepage "least restrictive CC" copy is misleading — actual license badge on the wordlists is BY-SA) |
vocab_memberships included gsl_new (NEW.json) |
Stripped. Replacement tracked in PHON-74 (curated in-house wordlists from FineWeb-Edu freq + AI features) |
🟢 stripped |
| GSL West 1953 |
Still copyrighted (US PD 2049, UK PD 2043; West died 1973) |
vocab_memberships included gsl_original (ORIGINAL.json) |
Stripped. Replacement tracked in PHON-74 |
🟢 stripped |
| Hillenbrand vowels (re-evaluated) |
No formal license; used only as one training input to the Bayesian inference in packages/features/ (alongside Hayes 2009 prior + ECCC CC BY 4.0). The learned 40×26 phoneme feature matrix is not per-row redistribution: 28 of 40 segments (all consonants) have zero Hillenbrand contribution, and each vowel cell is a posterior across multiple data sources + regularization. No output cell is traceable to any single Hillenbrand measurement. |
This is ML-trained transformation, not direct redistribution. Same legal posture as ML model weights generally. |
⚪ Re-classified N/A. The original PHOIBLE→learned-vectors architecture choice was specifically to escape CC BY-SA contagion and works. See feedback memory feedback_distinguish_direct_vs_trained. |
|
YELLOW — verify, replace, or clear
Brysbaert YELLOWs — superseded by AI-estimate path (see PHON-73)
CLLD re-source was investigated and rejected: NoRaRe is keyed to Concepticon's ~3K-concept inventory, so re-sourcing from CLLD drops coverage from ~14K-60K → ~2.3-2.7K per dataset (90%+ data loss). Script preserved at research/2026-04-30-clld-resource/ as a documented dead-end.
Replaced by AI-estimate path (PHON-73): Martínez/Brysbaert et al. (2024-2025) demonstrate that LLM-derived word feature estimates correlate r = .74-.95 with human ratings and outperform them on downstream prediction tasks. We generate our own estimates with an open-weight LLM on RunPod, vocabulary scope = full CMU dict, license = ours.
| Item |
Replacement |
Status |
| Brysbaert Concreteness 2014 |
PhonoLex Concreteness (PHON-73) — gpt-4.1-mini cloze-prompt |
🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.878 vs Brysbaert oracle (N=24,109 overlap). Brysbaert TXT moved to data/norms/_oracles/concreteness_brysbaert2014.txt. Pipeline ships concreteness repointed to PhonoLex value via load_phonolex_concreteness. |
| Warriner Valence 2013 |
PhonoLex Valence (PHON-73) — gpt-4.1-mini cloze-prompt |
🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.853 vs Warriner oracle (N=12,537 overlap). Warriner CSV moved to data/norms/_oracles/Ratings_VAD_WarrinerEtAl.csv. Pipeline ships valence repointed via load_phonolex_valence. |
| Warriner Arousal 2013 |
PhonoLex Arousal (PHON-73) — gpt-4.1-mini cloze-prompt |
🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.626 vs Warriner oracle (N=12,537 overlap; ceiling-bound construct, comfortably above 0.50 convergent-validity target — rank-order signal is what's load-bearing). Same Warriner CSV oracle as valence. Pipeline ships arousal repointed via load_phonolex_arousal. |
| Warriner Dominance 2013 |
DROPPED (no replacement) |
🟢 Done 2026-05-02 (commits 2629b23 + 7eb3bd8). LLM-derived signal vs Warriner (r=0.41) underperformed 0.50 convergent-validity target; V-A-D third axis conflates agency + authority. PhonoLex uses 2-axis valence-arousal (Russell circumplex). |
| Kuperman AoA 2012 |
PhonoLex AoA (PHON-115) — gpt-4.1-mini cloze-prompt with logprob expected-value |
🟢 Done 2026-05-12. Production build 47,724 words; full-vocab Glasgow Spearman 0.898 / Pearson 0.892 (N=4,399 overlap); Kuperman cross-construct Spearman 0.829 / Pearson 0.825 on N=17,572 Glasgow-unseen rows. Kuperman XLSX + Glasgow XLSX moved to data/norms/_oracles/. Pipeline ships aoa repointed via load_phonolex_aoa. imageability + size columns also retired (orphaned post-Glasgow-relocation, no consumer). Validation report at research/2026-05-11-phon-115-aoa-pilot/report.md. |
| Brysbaert Word Prevalence 2019 |
PhonoLex Familiarity (PHON-73 AI familiarity proxy) — gpt-4.1-mini cloze-prompt |
🟢 Done 2026-05-03. Production build 47,724 words, full-vocab Spearman 0.786 (rank-order) vs Glasgow FAM on N=4,401 overlap. Brysbaert prevalence directory moved to data/norms/_oracles/prevalence_brysbaert2019/. Pipeline ships familiarity repointed via load_phonolex_familiarity — loads AFTER Glasgow so the Glasgow FAM column for the ~5K Glasgow words is overwritten by the PhonoLex value. |
The original "11 author-contact" list collapsed substantially after we identified that 9 of 11 have AI/algorithmic replacement paths (PHON-73 for LLM-estimable features; PHON-74 for vocab lists). The remaining 3 are behavioral measurements that cannot be substituted by LLM/embeddings — they're cognitive measurements requiring real human data.
| Item |
Replacement track |
Status |
| USF Free Association |
PHON-75 (in-house Mechanical Turk / Prolific behavioral collection) |
🟢 stripped 2026-04-30 |
| SPP semantic priming |
PHON-75 |
🟢 stripped 2026-04-30 |
| ELP lexical decision RT |
PHON-75 |
🟢 stripped 2026-04-30 |
| SimLex-999 / MEN |
PHON-81 — PhonoLex Qwensim (pairwise cosine over PHON-76 4B word vectors) |
🟢 Done 2026-05-02. Validation: Spearman 0.73 vs SimLex (n=999), 0.85 vs MEN (n=2988). Source tables moved to data/norms/_oracles/; pipeline now ships qwensim (replaces simlex_similarity + men_relatedness). PROPN-filtered (54% of CMU∩freq is proper nouns; excluded from edge graph). |
| WordSim-353 |
(already shipping; CC BY 4.0 verified) |
🟢 GREEN |
| BOI |
PHON-82 — LLM-rating estimate (PHON-73 harness) |
🟢 Done 2026-05-02. 47,724 non-PROPN content words rated via gpt-4.1-mini ($~$5, 1.6h, 0 fails). Validation: Spearman 0.820 / Pearson 0.804 vs Pexman oracle on 7,940-word overlap. Pexman xlsx moved to data/norms/_oracles/. Pipeline ships boi field repointed to PhonoLex value. |
| Iconicity |
PHON-83 — LLM-rating with form (IPA) + WordNet-gloss inputs |
🟢 Done 2026-05-02. 47,724 non-PROPN content words rated via gpt-4.1-mini (~$5, 1.6h, 0 fails). Held-out validation Spearman 0.594 vs Winter on N=500; full-vocab Spearman 0.564 on 13,062-word overlap. Single-oracle correlation sits in Winter's own inter-rater band; the column reproduces 4/4 published convergent-validity patterns (POS, AoA, concreteness, reduplication) at magnitudes meeting or exceeding Winter on the same rows — see research/2026-04-30-llm-word-features/iconicity_convergent_validity.md. Winter csv moved to data/norms/_oracles/. Pipeline ships iconicity field repointed to PhonoLex value. |
| CYP-LEX |
OSF API verified CC BY 4.0 |
🟢 GREEN |
| Hoffman SemDiv 2013 |
PHON-76 — Qwen3-Embedding-4B / FineWeb-Edu pipeline |
🟢 Done 2026-05-02. Production v1 (0.6B) + v2 (4B) builds completed; v2 ships with 4 SemD metrics (semd_topic / semd_vn / semd_h13 / n_topics_for_word). Spearman 0.74 vs Hoffman 30K oracle; beats Hoffman on full ELP+behavioral battery. Hoffman csv moved to data/norms/_oracles/. |
| CYP-LEX |
Children-corpus subset of FineWeb-Edu, or PHON-73 child-familiarity |
🟡 kept in v1 (also note: now complemented by PHON-86 PhonBank + PHON-87 CHILDES age-graded freq tables shipping via PHON-88) |
| AVL (academic vocab) |
PHON-74 Phono-Academic list |
🟡 kept in v1 |
| Onix/LEARN stopwords |
PHON-74 Phono-Stopwords list |
🟡 kept in v1 |
Misc (NOTICE wrong but effect minimal — fix in NOTICE rewrite)
| Item |
NOTICE claim |
Reality |
Status |
| ipa-dict |
CC0 |
MIT (and sub-datasets vary — verify only English/American is used) |
☐ |
| Swadesh list |
"PD (linguistic concept)" |
The 100/200-word selection is a creative work; copyright applies |
☐ |
| NLTK English stopwords |
Apache 2.0 |
No license at all (NOTICE conflated library license with corpus) |
☐ |
| FOX stopwords (Fox 1989) |
(no claim) |
ACM SIGIR Forum publication; ACM holds rights |
☐ |
| GPT-2 (test fixture only) |
MIT |
Modified MIT (clause about generated content) |
☐ |
| Ogden Basic English |
"pre-1930 PD" |
Published 1930; entered US PD 2026-01-01 only. UK still copyright until 2028 (Ogden died 1957) |
☐ |
| ECCC URL |
datashare.ed.ac.uk/handle/10283/2791 |
That URL is a different corpus; correct ECCC distribution is spandh.dcs.shef.ac.uk/ECCC/ |
☐ |
These are new datasets built in-house during the PHON-71 remediation work;
not on the original RED/YELLOW audit list because they didn't exist yet.
Recorded here for completeness of the audit closeout.
| Item |
License posture |
Where shipped |
Status |
| PHON-85 FineWeb-Edu grade-banded freq (5 reading-grade quantile bins via F-K + tier overlay) |
Derivative of FineWeb-Edu (ODC-BY 1.0); only per-word per-band statistics redistributed. Same posture as the existing PhonoLex frequency column. |
D1 word_properties (15 cols, all unsurfaced). Aggregated into freq_age_8y + freq_age_12y headlines (surfaced filterable). |
🟢 Done 2026-05-03 (commit bfa00a3 — data shipped; PHON-88 — pipeline integration). |
| PHON-86 PhonBank preschool age-graded freq (4 corpora: Davis, Penney, Providence, StanfordEnglish; 5 bands × 2 channels) |
Per user 2026-05-03: "use as training/eval scaffolding without redistributing." Derived per-word per-band statistics; raw .cha utterances NOT redistributed. Per-corpus author contact deferred to commercial-deploy time. |
D1 word_properties (30 cols, all unsurfaced). Aggregated into freq_age_2y + freq_age_5y headlines (surfaced filterable). |
🟢 Done 2026-05-03 (commit 5abc98b — data shipped; PHON-88 — pipeline integration). |
| PHON-87 CHILDES Eng-NA + Eng-UK age-graded freq, MOR-tagged (50 corpora; 8 bands × 2 channels; MOR-lemma tokenization) |
Same posture as PHON-86 (TalkBank consortium derived statistics). |
D1 word_properties (48 cols, all unsurfaced). Aggregated into all 4 freq_age_* headlines (surfaced filterable). |
🟢 Done 2026-05-03 (commit 6b1cf2b — data shipped; PHON-88 — pipeline integration). PHON-87 headline finding: input wpm at 24-36mo Spearman vs −Glasgow AoA = +0.751, the strongest in-vocabulary AoA proxy in PhonoLex. |
| PHON-88 (this ticket) — pipeline + D1 + NOTICE integration of the above three |
n/a (integration work) |
All-of-above. New surfaced flag on PropertyDef gates 93 raw band cols out of /api/property-metadata while keeping them in D1 / /api/words/:word round-trip. NOTICE block names all 54 TalkBank corpora with DOIs (52 with DOI; 2 with no DOI in TalkBank manifest as of 2026-05-03). |
🟢 Done 2026-05-03. |
GREEN — verified clean (no action)
| Item |
License |
Confidence |
| CMU Pronouncing Dictionary v0.7b |
Modified BSD ("completely unrestricted") |
High |
| Roget's Thesaurus 1911 (PG #22) |
US Public Domain |
High |
| Ogden's Basic English (1930) |
US PD as of 2026-01-01 |
Medium (UK still under copyright until 2028) |
| AFINN sentiment |
Apache 2.0 |
High |
| spaCy English stopwords |
MIT (inherits library) |
High |
| en_core_web_sm |
MIT |
High |
| g2p-en |
Apache 2.0 |
High |
| GPT-2 (test fixture) |
Modified MIT |
High |
| Glasgow Norms (Scott et al. 2019) |
CC BY 4.0 |
High |
| Lancaster Sensorimotor Norms (Lynott et al. 2020) |
CC BY 4.0 |
High (retired from shipped schema 2026-05-12; retained as eval oracle at data/norms/_oracles/) |
| Socialness norms (Diveica et al. 2023) |
CC BY 4.0 |
High (retired from shipped schema 2026-05-12 in favor of PhonoLex Socialness via gpt-4.1-mini; full-vocab Spearman 0.820 vs Diveica oracle on N=7,850 overlap. Source CSV moved to data/norms/_oracles/.) |
| WordSim-353 |
CC BY 4.0 |
High |
| ECCC v1.2 (Marxer et al. 2016) |
CC BY 4.0 |
High |
Independent compliance items
| Item |
Status |
| Gemma TOS propagation — phonolex.com Terms of Service must include the Gemma Prohibited Use Policy (per Gemma TOS Hosted Services clause). |
☑ Done 2026-05-01 — packages/web/frontend/src/pages/TermsOfService.tsx has dedicated "AI-Generated Content (Governed Generation)" section: names T5Gemma 9B-2B + 2-4b-4b, links Gemma TOS + Prohibited Use Policy, adds downstream-propagation clause, adds best-effort-not-guarantee disclaimer. Pending deploy. |
| NOTICE rewrite — full re-pass with verified license texts, drop wrong claims, add Gemma compliance section, fix ECCC URL. |
☑ Done 2026-05-01 — NOTICE rewritten end-to-end. Removed RED items (SUBTLEX, MorphyNet, GSL/NGSL, USF/SPP/ELP) entirely; added FineWeb-Edu (ODC-BY 1.0) + PHON-73 AI features; corrected ipa-dict (CC0→MIT), Ogden timing, NLTK stopwords (Apache→none), Swadesh, ECCC URL, GPT-2 (Modified MIT); re-classified Hillenbrand as ML-trained-transformation; added "validation oracles only" section + Gemma TOS propagation reminder. |
| Hillenbrand absence audit — confirm zero pipeline ship |
☐ |
| PHOIBLE remnant audit — confirm zero pipeline ship |
✅ Confirmed N/A. Features CSV (packages/features/data/phonolex_features_ipa.csv) is hand-built from Hayes 2009 textbook with per-segment citations; explicit "This file is an original encoding by Neumann's Workshop LLC". Only PHOIBLE references remaining are in CitationDialog.tsx (attribution / historical) and packages/features/README.md (comparison-against-PHOIBLE narrative). Drop those mentions in NOTICE rewrite if not strictly needed. |
Execution order
- PHON-72 — build FineWeb-Edu frequency + POS replacement (RunPod GPU). Closes 🔴 SUBTLEX-US frequency + SUBTLEX-US POS in one strike, plus delivers PHON-70's intent.
- Strip 🔴 MorphyNet + 🔴 GSL + 🔴 NGSL from pipeline.
- Re-source 🟡 4 Brysbaert datasets from CLLD CC BY 4.0.
- Audit 🔴 Hillenbrand + PHOIBLE absence (likely confirms ⚪ N/A).
- NOTICE rewrite (waits until items 1-4 are resolved so the rewrite reflects final reality).
- 🟡 Author-contact campaign (parallel; 30-day deadline; remove or replace if silent).
- phonolex.com TOS update for Gemma.
- Final lawyer review before merge to
develop.
Audit provenance
Source: three license-research forks dispatched 2026-04-30, each verifying actual posted license at distribution point (not NOTICE's claim). Findings cross-referenced with ship-state in packages/web/workers/scripts/export-to-d1.py and packages/generation/scripts/build_runtime_data.py.