Skip to content

License Restructuring and Dataset Removal Design

Date: 2026-03-17 Status: Approved Goal: Remove three datasets with incompatible licenses (SWOW, IPhOD2, PHOIBLE), restructure licensing to Apache 2.0 + Proprietary split, and fix pre-existing TypeScript errors.


Motivation

PhonoLex is transitioning from an open academic tool to a commercial SLP platform (Neumann's Workshop, LLC). Three datasets have licenses incompatible with commercial use or that impose unwanted constraints:

Dataset License Problem
SWOW (De Deyne et al., 2019) CC BY-NC-ND 3.0 No commercial use, no derivatives
IPhOD2 (Vaden et al., 2009) GPL v2 Copyleft — derivative works must be open source
PHOIBLE (Moran & McCloy, 2019) CC BY-SA 3.0 Share-alike — already replaced by learned vectors, but remnants remain

With these three removed, all remaining datasets are either permissive (BSD, CC BY 4.0, CC0) or have no explicit license (academic norms with citation requirements only). This enables a clean split-license model.


License Structure

Split Model

Scope License Rationale
packages/data/, packages/features/, packages/web/, root configs, docs, CI Apache 2.0 Patent grant protection, academic credibility, grant-friendly
packages/governors/, packages/dashboard/ Proprietary (All Rights Reserved) Core IP — the constraint engine and governed chat platform

Files

Root LICENSE — Full Apache 2.0 text. Copyright Neumann's Workshop, LLC.

packages/governors/LICENSE — Proprietary notice:

Copyright (c) 2026 Neumann's Workshop, LLC. All rights reserved.

This software is proprietary and confidential. No part of this software
may be reproduced, distributed, or transmitted in any form without the
prior written permission of Neumann's Workshop, LLC.

packages/dashboard/LICENSE — Same proprietary notice.

NOTICE (new, at root) — Third-party attributions required by Apache 2.0: - CMU Pronouncing Dictionary — Modified BSD - Edinburgh Closed-set Confusability Corpus (ECCC) — CC BY 4.0 - ipa-dict — CC0 (Public Domain) - Academic norm datasets — references to docs/about/citations.md

docs/about/license.md — Rewrite to explain the split model.

docs/about/citations.md — Remove SWOW, IPhOD2 (Vitevitch & Luce 2004 phonotactic probability section), and PHOIBLE entries. All other citations remain.


Dataset Removal: SWOW

SWOW provides ~976K cognitive association edges out of ~1M total. Removal significantly reduces association graph density. USF Free Association (~72K edges) remains and keeps /theme functional with thinner coverage.

Code Changes

Loaders: - packages/data/src/phonolex_data/loaders/associations.py — delete load_swow() function - packages/data/src/phonolex_data/loaders/__init__.py — remove load_swow import and export

Pipeline: - packages/data/src/phonolex_data/pipeline/edges.py — remove SWOW edge assembly (the 7 SWOW edge lines in build_edges()) - packages/data/src/phonolex_data/pipeline/schema.py — remove swow_strength: float | None from EdgeRecord

Governor: - packages/governors/src/phonolex_governors/thematic.py — update build_assoc_graph() signature: remove swow parameter, accept only usf. Update module docstring (line 3), build_assoc_graph() docstring, and ThematicConstraint class docstring (line 68) — all reference "SWOW + USF". - packages/governors/src/phonolex_governors/boosts.py — update comment referencing SWOW

Dashboard: - packages/dashboard/server/model.py — stop loading SWOW data in build_assoc_graph() call

Web API: - packages/web/workers/scripts/export-to-d1.py — remove swow_strength from EDGE_COLUMNS - packages/web/workers/src/types.ts — remove swow_strength from EdgeRow - packages/web/workers/src/config/edgeTypes.ts — remove EDGE_TYPES["SWOW"] - packages/web/workers/scripts/config.py — remove EDGE_TYPES["SWOW"] - packages/web/workers/src/routes/associations.ts — remove swow_strength from rowToEdgeResponse()

Frontend: - packages/web/frontend/src/types/phonology.ts — remove swow_strength from edge types - packages/web/frontend/src/services/apiClient.ts — remove swow_strength from EdgeResult interface - packages/web/frontend/src/components/tools/LookupTool.tsx — remove SWOW display references - packages/web/frontend/src/components/AppHeader.tsx — update hardcoded "SWOW (976K)" string and dataset list chip - packages/web/frontend/public/landing/phonological-therapy-materials.html — update "free association norms (SWOW, USF)" copy

Tests: - packages/web/workers/src/__tests__/api.test.ts — update edge-types count assertion (7→6), remove SWOW .toContain and body.SWOW accessor - packages/data/tests/test_pipeline.py — update EdgeRecord construction (remove swow_strength field and "SWOW" from edge_sources)


Dataset Removal: IPhOD2

IPhOD2 provides 6 properties in the PHONOTACTIC_PROBABILITY category:

Property Column Type
Biphone Probability (Avg) phono_prob_avg float
Positional Segment Probability (Avg) positional_prob_avg float
Neighborhood Density neighborhood_density int
Stressed Biphone Probability str_phono_prob_avg float
Stressed Positional Segment Probability str_positional_prob_avg float
Stressed Neighborhood Density str_neighborhood_density int

Each also has a _percentile column (12 columns total removed from D1).

Code Changes

Loaders: - packages/data/src/phonolex_data/loaders/norms.py — delete load_iphod() function - packages/data/src/phonolex_data/loaders/__init__.py — remove load_iphod import and export

Pipeline: - packages/data/src/phonolex_data/pipeline/schema.py — remove 6 IPhOD fields from WordRecord - packages/data/src/phonolex_data/pipeline/words.py — remove load_iphod import, remove IPhOD entries from _NORM_FIELD_MAP, remove IPhOD loader call in build_words()

Web API: - packages/web/workers/src/config/properties.ts — remove entire PHONOTACTIC_PROBABILITY category definition and its entry in the categories array - packages/web/workers/scripts/config.py — remove PHONOTACTIC_PROBABILITY category - packages/web/workers/scripts/export-to-d1.py — remove 6 property columns and 6 percentile columns from word table export - packages/web/workers/src/types.ts — remove 6 IPhOD properties from word type

Frontend: - packages/web/frontend/src/types/phonology.ts — remove 6 IPhOD properties from word types

Dashboard (IPhOD2-derived fields in governor pipeline): - packages/dashboard/server/schemas.py — remove biphone_avg and pos_seg_avg fields from Phono model (these are IPhOD2-derived phonotactic probability values) - packages/dashboard/frontend/src/types.ts — remove corresponding fields from frontend Phono type - packages/dashboard/scripts/build_lookup.py — remove biphone_avg/pos_seg_avg fallback writes - packages/dashboard/server/model.py — remove IPhOD2 field population from lookup data

Tests: - packages/data/tests/test_new_loaders.py — remove test_load_iphod test (lines 24-40) - packages/data/tests/test_datasets.py — remove test_load_phonotactic_probability test - packages/dashboard/server/tests/test_schemas.py — update Phono model test fixtures (remove biphone_avg/pos_seg_avg)


Dataset Removal: PHOIBLE Remnants

PHOIBLE's 76-dimensional feature vectors were replaced by learned Bayesian vectors (packages/features/). The loader and data files are dead code. The load_phonotactic_probability() function in the same file loaded a legacy Vitevitch & Luce JSON — also superseded by load_iphod() (which is itself being removed).

Code Changes

Loaders: - packages/data/src/phonolex_data/loaders/phoible.pydelete entire file - packages/data/src/phonolex_data/loaders/__init__.py — remove load_phoible and load_phonotactic_probability imports/exports

Stale comments: - packages/web/workers/src/lib/similarity.ts — update comments from "PHOIBLE 76d vectors" to "feature vectors" (the precomputed dot products now come from learned vectors) - packages/web/frontend/src/components/PhonemePickerDialog.tsx — update "Phoible features" comment - packages/features/src/phonolex_features/validate.py — update PHOIBLE references in docstrings - packages/data/src/phonolex_data/pipeline/schema.py — update DerivedData docstring ("PHOIBLE vectors" → "feature vectors")

Tests: - packages/data/tests/test_datasets.py — remove test_load_phoible and test_load_phonotactic_probability tests

Documentation: - CLAUDE.md — update terminology note. "PHOIBLE vectors" → "feature vectors" is already the preferred term; remove the "NOT embeddings" note since PHOIBLE is no longer relevant. Keep the "feature vectors, not embeddings" guidance. - docs/reference/phoible-features.md — remove or redirect (this entire page is about PHOIBLE features)


Pre-existing TypeScript Fix

packages/web/frontend/src/components/tools/ContrastiveInterventionTool.tsx line 159: phoneme_count is number | null but used without null guard. Fix: add if (wordLength == null) return true; before the position filter logic. (Already implemented, needs commit.)


Documentation Updates

  • CLAUDE.md — update property count, dataset count, terminology, remove PHOIBLE/SWOW/IPhOD2 references
  • README.md — update license badge and data source list
  • docs/about/citations.md — remove SWOW, PHOIBLE, Vitevitch & Luce (IPhOD2) entries
  • docs/about/license.md — rewrite for Apache 2.0 + Proprietary split
  • docs/reference/phoible-features.md — remove or mark as historical

Historical specs/plans under docs/superpowers/ are left as-is — they are point-in-time records.


Deferred Tasks

Task F: Recalculate Phonotactic Probability from CMU Dict

The 6 IPhOD2 properties (biphone probability, positional segment probability, neighborhood density — plain + stressed) can be independently computed from the CMU Pronouncing Dictionary. This restores the PHONOTACTIC_PROBABILITY category with clean provenance and no GPL dependency.

Scope: New computation module in packages/data/, re-add properties to pipeline/schema/metadata. To be designed and implemented in a separate spec after the license cleanup merges.

Task G: Regenerate Pickle and D1 Seed

After the dataset removals (Tasks A-C), the pickle and D1 seed must be regenerated to reflect: - ~976K fewer cognitive association edges (SWOW removed) - 12 fewer word columns (6 IPhOD2 properties + 6 percentiles) - swow_strength column removed from edges table

Must also be run again after Task F when recalculated phonotactic properties are ready.

Note: The pickle and D1 seed are gitignored, so this is a local pipeline operation — not a code change.