Monorepo Migration Implementation Plan¶
For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Cannibalize diffusion-governors and constrained_chat into PhonoLex as a packages/ monorepo with a shared data layer.
Architecture: Create packages/{data,governors,web,dashboard}. Move existing PhonoLex web into packages/web/, copy governor engine into packages/governors/, copy dashboard into packages/dashboard/, assemble shared data layer from all three into packages/data/. Drop dead code. Fix paths.
Tech Stack: Python (uv workspaces), TypeScript (Hono/Cloudflare Workers, React), Git
Spec: docs/superpowers/specs/2026-03-13-monorepo-migration-design.md
Chunk 1: Branch, Scaffold, Move PhonoLex Web¶
Task 1: Create branch and scaffold packages/ directories¶
Files:
- Create: packages/data/src/phonolex_data/__init__.py
- Create: packages/data/src/phonolex_data/loaders/__init__.py
- Create: packages/data/src/phonolex_data/phonology/__init__.py
- Create: packages/data/src/phonolex_data/mappings/__init__.py
- Create: packages/data/src/phonolex_data/graph/__init__.py
- Create: packages/data/tests/__init__.py
Uses src/ layout (matching packages/governors/src/diffusion_governors/) so hatch packaging works correctly.
- [ ] Step 1: Create branch off main
git checkout main
git pull origin main
git checkout -b feat/monorepo-migration
- [ ] Step 2: Create packages/data/ directory structure with init.py files
mkdir -p packages/data/src/phonolex_data/{loaders,phonology,mappings,graph}
mkdir -p packages/data/tests
Create packages/data/src/phonolex_data/__init__.py:
"""phonolex_data — shared data layer for PhonoLex platform."""
Create empty __init__.py in each subdirectory (loaders/, phonology/, mappings/, graph/) and in tests/.
- [ ] Step 3: Verify structure
find packages/ -type f | sort
Expected: 6 __init__.py files across packages/data/ tree.
- [ ] Step 4: Commit
git add packages/
git commit -m "scaffold: create packages/data/ directory structure"
Task 2: Move PhonoLex web into packages/web/¶
Files:
- Move: workers/ → packages/web/workers/
- Move: webapp/frontend/ → packages/web/frontend/
- [ ] Step 1: Create packages/web/ and move workers
mkdir -p packages/web
git mv workers packages/web/workers
- [ ] Step 2: Move webapp/frontend
git mv webapp/frontend packages/web/frontend
- [ ] Step 3: Remove empty webapp/ directory
rmdir webapp
If webapp/ has other contents, check what's there first. Only frontend/ should exist.
- [ ] Step 4: Verify structure
ls packages/web/workers/src/
ls packages/web/frontend/src/
Expected: existing source files in both locations.
- [ ] Step 5: Commit
git add -A
git commit -m "move: workers/ and webapp/frontend/ into packages/web/"
Chunk 2: Copy Governors¶
Task 3: Copy diffusion-governors engine into packages/governors/¶
Source: /Users/jneumann/Repos/diffusion-governors/
Files:
- Copy: src/diffusion_governors/{__init__,core,gates,boosts,cdd,constraints,lookups}.py → packages/governors/src/diffusion_governors/
- Copy: tests/{conftest,test_core,test_datasets}.py → packages/governors/tests/
- Copy: pyproject.toml → packages/governors/pyproject.toml
NOT copying: llada_sampler.py, mdlm_sampler.py, data/, models/, scripts/
- [ ] Step 1: Create target directory
mkdir -p packages/governors/src/diffusion_governors
mkdir -p packages/governors/tests
- [ ] Step 2: Copy engine modules (minus samplers)
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/core.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/gates.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/boosts.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/cdd.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/constraints.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/lookups.py packages/governors/src/diffusion_governors/
- [ ] Step 3: Create new init.py (remove sampler imports)
Create packages/governors/src/diffusion_governors/__init__.py:
"""Diffusion Governors — constraint layer for language model generation."""
from diffusion_governors.core import Governor, GovernorContext
from diffusion_governors.constraints import (
Bound,
Complexity,
Density,
Exclude,
ExcludeInClusters,
NormCovered,
VocabOnly,
ESSENTIAL_ENGLISH,
STOP_WORDS,
)
from diffusion_governors.gates import HardGate
from diffusion_governors.boosts import LogitBoost
from diffusion_governors.cdd import CDDConstraint, CDDProjection
from diffusion_governors.lookups import (
LookupBuilder,
PhonoFeatures,
Syllable,
TokenFeatures,
)
__all__ = [
"Governor",
"GovernorContext",
"Bound",
"Complexity",
"Density",
"Exclude",
"ExcludeInClusters",
"NormCovered",
"VocabOnly",
"ESSENTIAL_ENGLISH",
"STOP_WORDS",
"HardGate",
"LogitBoost",
"CDDConstraint",
"CDDProjection",
"LookupBuilder",
"PhonoFeatures",
"Syllable",
"TokenFeatures",
]
Note: Removed SamplerConfig, sample, LLaDASamplerConfig, llada_sample, and datasets imports. The datasets module is being replaced by phonolex_data.loaders.
- [ ] Step 4: Copy tests (minus e2e sampler tests)
cp /Users/jneumann/Repos/diffusion-governors/tests/conftest.py packages/governors/tests/
cp /Users/jneumann/Repos/diffusion-governors/tests/test_core.py packages/governors/tests/
NOT copying: test_e2e.py, test_llada_e2e.py (sampler tests), test_datasets.py (moves to packages/data).
- [ ] Step 5: Create pyproject.toml for governors
Create packages/governors/pyproject.toml:
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "diffusion-governors"
version = "0.1.0"
description = "Constraint layer for language model generation"
license = "CC-BY-SA-3.0"
requires-python = ">=3.10"
dependencies = [
"torch>=2.0",
]
[project.optional-dependencies]
dev = [
"pytest>=7.0",
"ruff>=0.4",
]
[tool.hatch.build.targets.wheel]
packages = ["src/diffusion_governors"]
[tool.ruff]
target-version = "py310"
line-length = 100
[tool.pytest.ini_options]
testpaths = ["tests"]
markers = [
"slow: requires model weights (deselect with '-m \"not slow\"')",
]
- [ ] Step 6: Verify files exist
ls packages/governors/src/diffusion_governors/
ls packages/governors/tests/
Expected: 7 .py files in src, 2 in tests, plus pyproject.toml.
- [ ] Step 7: Commit
git add packages/governors/
git commit -m "add: copy diffusion-governors engine into packages/governors/"
Chunk 3: Copy Dashboard¶
Task 4: Copy constrained_chat into packages/dashboard/¶
Source: /Users/jneumann/Repos/constrained_chat/
Files:
- Copy: server/ → packages/dashboard/server/
- Copy: frontend/ → packages/dashboard/frontend/
- Copy: scripts/build_lookup_phonolex.py → packages/dashboard/scripts/build_lookup.py
- Copy: scripts/generation_sweep.py → packages/dashboard/scripts/generation_sweep.py
NOT copying: phase*.py, lookups/, docs/, governor-t5-plan.md, WORKING_IMPLEMENTATIONS.md, patch_lookup_syllables.py
- [ ] Step 1: Create target directories
mkdir -p packages/dashboard/scripts
- [ ] Step 2: Copy server
cp -r /Users/jneumann/Repos/constrained_chat/server packages/dashboard/server
- [ ] Step 3: Copy frontend
cp -r /Users/jneumann/Repos/constrained_chat/frontend packages/dashboard/frontend
- [ ] Step 4: Copy scripts (rename build_lookup)
cp /Users/jneumann/Repos/constrained_chat/scripts/build_lookup_phonolex.py packages/dashboard/scripts/build_lookup.py
cp /Users/jneumann/Repos/constrained_chat/scripts/generation_sweep.py packages/dashboard/scripts/generation_sweep.py
- [ ] Step 5: Remove frontend dist/ if copied (build artifact)
rm -rf packages/dashboard/frontend/dist
- [ ] Step 6: Add lookups/ to .gitignore for dashboard
The dashboard generates large lookup JSON files. Add to the project .gitignore:
# Dashboard — generated lookup files
packages/dashboard/lookups/
- [ ] Step 7: Verify structure
ls packages/dashboard/server/
ls packages/dashboard/frontend/src/
ls packages/dashboard/scripts/
Expected: server has main.py, model.py, governor.py, schemas.py, profiles.py, sessions.py, routes/, tests/. Frontend has React source. Scripts has build_lookup.py and generation_sweep.py.
- [ ] Step 8: Commit
git add packages/dashboard/
git commit -m "add: copy constrained_chat into packages/dashboard/"
Chunk 4: Assemble Shared Data Layer¶
Task 5: Copy missing data files from diffusion-governors¶
Before creating loaders, ensure all data files exist in PhonoLex's data/ at repo root. Most norms and vocab files currently only exist in diffusion-governors.
- [ ] Step 1: Copy vocab directories
cp -r /Users/jneumann/Repos/diffusion-governors/data/vocab data/vocab
- [ ] Step 2: Copy all missing norms files
# Copy all norms files from diffusion-governors (will skip existing ones)
cp -n /Users/jneumann/Repos/diffusion-governors/data/norms/*.csv data/norms/
cp -n /Users/jneumann/Repos/diffusion-governors/data/norms/*.txt data/norms/
cp -n /Users/jneumann/Repos/diffusion-governors/data/norms/*.xlsx data/norms/
cp -rn /Users/jneumann/Repos/diffusion-governors/data/norms/swow data/norms/swow
This copies: Ratings_VAD_WarrinerEtAl.csv, kuperman_aoa.xlsx, subtlex_frequency.txt, Sensorimotor_norms.csv, SimLex-999.txt, free_association.txt, semantic_diversity.csv, SocialnessNorms_DiveicaPexmanBinney2021.csv, boi_pexman2019.xlsx, elp_items.csv, iconicity_ratings.csv, swow/ directory. PhonoLex already has concreteness.txt and GlasgowNorms.xlsx.
- [ ] Step 3: Copy PHOIBLE English CSV
The governor loaders use a curated English-only phoneme CSV (comma-delimited), not PhonoLex's full PHOIBLE TSV. Copy it:
cp /Users/jneumann/Repos/diffusion-governors/data/phonology/phoible-english.csv data/phoible/phoible-english.csv
- [ ] Step 4: Update .gitignore for data files
The *.csv glob in .gitignore will hide the new norms CSV files. Add exceptions:
# Data files needed by loaders
!data/norms/*.csv
!data/norms/*.txt
!data/norms/*.xlsx
!data/norms/swow/*.csv
!data/vocab/**/*.json
!data/phoible/*.csv
- [ ] Step 5: Commit
git add data/ .gitignore
git commit -m "add: copy missing norms, vocab, and PHOIBLE data from diffusion-governors"
Task 6: Split datasets.py into packages/data/loaders/¶
Source: /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/datasets.py (606 LOC)
All files go under packages/data/src/phonolex_data/loaders/ (src layout matching governors).
Files:
- Create: packages/data/src/phonolex_data/loaders/_helpers.py
- Create: packages/data/src/phonolex_data/loaders/cmudict.py
- Create: packages/data/src/phonolex_data/loaders/norms.py
- Create: packages/data/src/phonolex_data/loaders/associations.py
- Create: packages/data/src/phonolex_data/loaders/phoible.py
- Create: packages/data/src/phonolex_data/loaders/vocab_lists.py
- Create: packages/data/src/phonolex_data/loaders/__init__.py (re-exports)
This is the riskiest step — actual refactoring, not just file movement. Every function gets assigned to a module, shared helpers get factored out, and all downstream from diffusion_governors.datasets import ... imports break simultaneously.
- [ ] Step 1: Create shared _helpers module
Create packages/data/src/phonolex_data/loaders/_helpers.py:
"""Shared helpers for dataset loaders."""
from __future__ import annotations
import json
from pathlib import Path
def get_data_dir() -> Path:
"""Return the repo-root data/ directory.
Looks for DATA_DIR env var first, then walks up from this file to find the
repo root (identified by having a packages/ directory).
"""
import os
env = os.environ.get("DATA_DIR")
if env:
return Path(env)
# Walk up: src/phonolex_data/loaders/ → src/phonolex_data/ → src/ → packages/data/ → packages/ → repo root
return Path(__file__).resolve().parent.parent.parent.parent.parent.parent / "data"
def require_openpyxl():
try:
import openpyxl
return openpyxl
except ImportError:
raise ImportError(
"openpyxl is required for .xlsx files: pip install phonolex-data[data]"
) from None
def load_vocab_dir(dirpath: Path, prefix: str) -> dict[str, set[str]]:
"""Load all JSON word lists from a directory into {word: {membership, ...}}."""
result: dict[str, set[str]] = {}
for f in sorted(dirpath.glob("*.json")):
list_name = f"{prefix}_{f.stem.lower()}"
with open(f) as fh:
words = json.load(fh)
for word in words:
w = word.strip().lower()
if w:
result.setdefault(w, set()).add(list_name)
return result
Note: get_data_dir() walks 6 levels up from packages/data/src/phonolex_data/loaders/_helpers.py to reach repo root, then appends data/.
- [ ] Step 2: Create cmudict.py
Create packages/data/src/phonolex_data/loaders/cmudict.py:
"""CMU Pronouncing Dictionary loader."""
from __future__ import annotations
from pathlib import Path
from typing import Any
from phonolex_data.loaders._helpers import get_data_dir
from phonolex_data.mappings import load_arpa_to_ipa
def load_cmudict(path: str | Path | None = None) -> dict[str, list[str]]:
"""Load CMU Pronouncing Dictionary (0.7b).
Returns:
{word: [ARPAbet_phoneme, ...]}. First pronunciation only.
"""
path = Path(path) if path else get_data_dir() / "cmu" / "cmudict-0.7b"
result: dict[str, list[str]] = {}
with open(path, encoding="latin-1") as f:
for line in f:
line = line.strip()
if not line or line.startswith(";;;"):
continue
parts = line.split(" ", 1)
if len(parts) != 2:
continue
word = parts[0]
if "(" in word:
continue
result[word.lower()] = parts[1].split()
return result
def cmudict_to_phono(
cmu: dict[str, list[str]] | None = None,
arpa_map: dict[str, str] | None = None,
) -> dict[str, dict[str, Any]]:
"""Convert raw CMUdict to PhonoFeatures-compatible format.
Returns:
{word: {"phonemes": [ipa, ...], "ipa": "..."}}
Compatible with LookupBuilder.add_phono().
"""
if cmu is None:
cmu = load_cmudict()
if arpa_map is None:
arpa_map = load_arpa_to_ipa()
result: dict[str, dict[str, Any]] = {}
for word, arpa_phones in cmu.items():
ipa_phones = []
for p in arpa_phones:
ipa = arpa_map.get(p) or arpa_map.get(p.rstrip("012"), p)
ipa_phones.append(ipa)
result[word] = {
"phonemes": ipa_phones,
"ipa": "".join(ipa_phones),
}
return result
Note: Mapping loaders (load_arpa_to_ipa, load_ipa_to_arpa) live only in phonolex_data.mappings — no duplication.
- [ ] Step 3: Create phoible.py
Create packages/data/src/phonolex_data/loaders/phoible.py:
"""PHOIBLE phoneme feature vector loader."""
from __future__ import annotations
import csv
import json
from pathlib import Path
from phonolex_data.loaders._helpers import get_data_dir
def load_phoible(path: str | Path | None = None) -> dict[str, dict[str, str]]:
"""Load PHOIBLE English phoneme distinctive features.
Returns:
{phoneme: {feature: "+"/"-"/"0", ...}} with 37 binary/ternary features.
"""
path = Path(path) if path else get_data_dir() / "phoible" / "phoible-english.csv"
skip = {
"InventoryID", "Glottocode", "ISO6393", "LanguageName",
"SpecificDialect", "GlyphID", "Phoneme", "Allophones",
"Marginal", "SegmentClass", "Source",
}
result: dict[str, dict[str, str]] = {}
with open(path) as f:
reader = csv.DictReader(f)
for row in reader:
phoneme = row.get("Phoneme", "")
if not phoneme:
continue
features = {k: v for k, v in row.items() if k not in skip and v}
if phoneme not in result:
result[phoneme] = features
return result
def load_phonotactic_probability(
path: str | Path | None = None,
) -> dict[str, dict[str, float]]:
"""Load phonotactic probability norms (Vitevitch & Luce 2004).
Returns:
{word: {phono_prob_avg, positional_prob_avg, num_biphones, num_segments, ...}}
"""
path = Path(path) if path else get_data_dir() / "phonotactic_probability_full.json"
with open(path) as f:
data = json.load(f)
return data["word_probabilities"]
Uses phoible-english.csv (the curated English-only CSV copied from diffusion-governors in Task 5), NOT the full PHOIBLE TSV. This matches the governor's expectations — comma-delimited, English phonemes only.
TODO (post-migration): Replace PHOIBLE feature vectors with our own. Initialize from basic articulatory data, tune with morphological/phonological datasets. Opens up licensing (PHOIBLE is CC-BY, constrains derivatives). Track in coherence pass.
- [ ] Step 4: Create norms.py
Create packages/data/src/phonolex_data/loaders/norms.py — copy all 11 norm loaders from datasets.py (lines 161-434), replacing DATA_DIR references with get_data_dir() calls. Functions: load_warriner, load_glasgow, load_concreteness, load_sensorimotor, load_kuperman, load_semantic_diversity, load_socialness, load_boi, load_subtlex, load_elp, load_iconicity.
Header:
"""Psycholinguistic norm dataset loaders."""
from __future__ import annotations
import csv
from pathlib import Path
from phonolex_data.loaders._helpers import get_data_dir, require_openpyxl
Each function gets the same transformation: replace DATA_DIR / "norms" / "..." with get_data_dir() / "norms" / "...".
- [ ] Step 5: Create associations.py
Create packages/data/src/phonolex_data/loaders/associations.py — copy load_swow and load_free_association from datasets.py (lines 441-495). Same DATA_DIR → get_data_dir() transformation.
Also copy load_simlex here (lines 502-518) since it's a benchmark loader, not a norm.
Header:
"""Association and similarity benchmark loaders."""
from __future__ import annotations
import csv
from pathlib import Path
from phonolex_data.loaders._helpers import get_data_dir
- [ ] Step 6: Create vocab_lists.py
Create packages/data/src/phonolex_data/loaders/vocab_lists.py — copy all vocab loaders from datasets.py (lines 525-606). Same transformation.
Header:
"""Curated vocabulary list loaders."""
from __future__ import annotations
from pathlib import Path
from phonolex_data.loaders._helpers import get_data_dir, load_vocab_dir
The _load_vocab_dir helper is now load_vocab_dir imported from _helpers.py. Update all calls: _load_vocab_dir(path, prefix) → load_vocab_dir(path, prefix).
- [ ] Step 7: Update loaders/init.py with re-exports
Create packages/data/src/phonolex_data/loaders/__init__.py:
"""Dataset loaders — single source of truth for all PhonoLex data loading."""
from phonolex_data.loaders.cmudict import load_cmudict, cmudict_to_phono
from phonolex_data.loaders.phoible import load_phoible, load_phonotactic_probability
from phonolex_data.loaders.norms import (
load_warriner, load_glasgow, load_concreteness, load_sensorimotor,
load_kuperman, load_semantic_diversity, load_socialness, load_boi,
load_subtlex, load_elp, load_iconicity,
)
from phonolex_data.loaders.associations import load_swow, load_free_association, load_simlex
from phonolex_data.loaders.vocab_lists import (
load_ogden, load_afinn, load_stop_words, load_swadesh,
load_roget, load_gsl, load_avl, load_all_vocab,
)
__all__ = [
"load_cmudict", "cmudict_to_phono",
"load_phoible", "load_phonotactic_probability",
"load_warriner", "load_glasgow", "load_concreteness", "load_sensorimotor",
"load_kuperman", "load_semantic_diversity", "load_socialness", "load_boi",
"load_subtlex", "load_elp", "load_iconicity",
"load_swow", "load_free_association", "load_simlex",
"load_ogden", "load_afinn", "load_stop_words", "load_swadesh",
"load_roget", "load_gsl", "load_avl", "load_all_vocab",
]
Note: Mapping loaders (load_arpa_to_ipa, load_ipa_to_arpa) are NOT re-exported here. They live canonically in phonolex_data.mappings.
- [ ] Step 8: Copy and rewrite test_datasets.py
cp /Users/jneumann/Repos/diffusion-governors/tests/test_datasets.py packages/data/tests/test_datasets.py
Rewrite imports in the copied file: from diffusion_governors.datasets import ... → from phonolex_data.loaders import .... Also update from diffusion_governors import datasets → from phonolex_data import loaders.
- [ ] Step 9: Commit
git add packages/data/src/phonolex_data/loaders/ packages/data/tests/test_datasets.py
git commit -m "add: split datasets.py into packages/data loaders modules"
Task 7: Move phonology modules into packages/data/phonology/¶
Files:
- Move: src/phonolex/utils/syllabification.py → packages/data/src/phonolex_data/phonology/syllabification.py
- Move: workers/scripts/g2p_alignment.py → packages/data/src/phonolex_data/phonology/g2p_alignment.py
- Create: packages/data/src/phonolex_data/phonology/wcm.py
- Create: packages/data/src/phonolex_data/phonology/normalize.py
- [ ] Step 1: Move syllabification.py
cp src/phonolex/utils/syllabification.py packages/data/src/phonolex_data/phonology/syllabification.py
Using cp not git mv because src/phonolex/ will be deleted entirely later.
- [ ] Step 2: Move g2p_alignment.py and fix REPO_ROOT path
git mv packages/web/workers/scripts/g2p_alignment.py packages/data/src/phonolex_data/phonology/g2p_alignment.py
Using git mv here since the source is already tracked in the new location.
Important: After the move, fix the REPO_ROOT path calculation in g2p_alignment.py. The old path was 3 parents up from workers/scripts/. The new location at packages/data/src/phonolex_data/phonology/ is 6 parents up from repo root:
# Old (from workers/scripts/):
# REPO_ROOT = Path(__file__).parent.parent.parent
# New (from packages/data/src/phonolex_data/phonology/):
REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent.parent.parent
Also fix export-to-d1.py if it references g2p_alignment.py — it now needs to import from phonolex_data.phonology.g2p_alignment or use the new path.
- [ ] Step 3: Create wcm.py (extracted from export-to-d1.py)
Create packages/data/src/phonolex_data/phonology/wcm.py:
"""Word Complexity Measure (Stoel-Gammon 2010)."""
from __future__ import annotations
# WCM sound classes
VOWELS = {
"i", "ɪ", "e", "ɛ", "æ", "ɑ", "ɔ", "o", "ʊ", "u",
"ʌ", "ə", "ɚ", "ɝ", "eɪ", "aɪ", "ɔɪ", "aʊ", "oʊ",
}
VELARS = {"k", "g", "ŋ"}
LIQUIDS_RHOTICS = {"l", "ɹ", "r", "ɚ", "ɝ"}
FRICATIVES_AFFRICATES = {"f", "v", "θ", "ð", "s", "z", "ʃ", "ʒ", "h", "tʃ", "dʒ"}
VOICED_FRIC_AFFRIC = {"v", "ð", "z", "ʒ", "dʒ"}
def compute_wcm(phonemes: list[str], syllables: list[dict]) -> int:
"""Compute Word Complexity Measure (Stoel-Gammon 2010).
Args:
phonemes: List of IPA phoneme strings.
syllables: List of syllable dicts with 'onset', 'nucleus', 'coda', 'stress' keys.
Returns:
Integer WCM score (higher = more complex).
"""
if not phonemes or not syllables:
return 0
score = 0
# 1. More than 2 syllables
if len(syllables) > 2:
score += 1
# 2. Non-initial stress
stress_positions = [
i for i, syl in enumerate(syllables)
if syl.get("stress", 0) in (1, 2)
]
if stress_positions and stress_positions[0] > 0:
score += 1
# 3. Word-final consonant
if phonemes and phonemes[-1] not in VOWELS:
score += 1
# 4. Consonant clusters
for syl in syllables:
if len(syl.get("onset", [])) >= 2:
score += 1
if len(syl.get("coda", [])) >= 2:
score += 1
# 5-8. Sound class counts
for p in phonemes:
p_base = p.replace("\u02c8", "").replace("\u02cc", "")
if p_base in VELARS:
score += 1
if p_base in LIQUIDS_RHOTICS:
score += 1
if p_base in FRICATIVES_AFFRICATES:
score += 1
if p_base in VOICED_FRIC_AFFRIC:
score += 1
return score
- [ ] Step 4: Create normalize.py (IPA-canonical)
Create packages/data/src/phonolex_data/phonology/normalize.py:
"""IPA normalization — canonical representation uses IPA codepoints.
ASCII g (U+0067) → IPA ɡ (U+0261) for the voiced velar stop.
This matches PhonoLex's web app and D1 database representation.
"""
from __future__ import annotations
# ASCII → IPA canonical mappings
_TO_IPA: dict[str, str] = {
"g": "\u0261", # ASCII g → IPA ɡ (voiced velar stop)
}
# Reverse mapping for interop with systems that use ASCII
_TO_ASCII: dict[str, str] = {v: k for k, v in _TO_IPA.items()}
def to_ipa(phoneme: str) -> str:
"""Normalize a phoneme string to IPA-canonical representation."""
for ascii_char, ipa_char in _TO_IPA.items():
phoneme = phoneme.replace(ascii_char, ipa_char)
return phoneme
def to_ascii(phoneme: str) -> str:
"""Normalize a phoneme string to ASCII-safe representation.
Use only for systems that cannot handle IPA codepoints.
Prefer to_ipa() for all internal representations.
"""
for ipa_char, ascii_char in _TO_ASCII.items():
phoneme = phoneme.replace(ipa_char, ascii_char)
return phoneme
def normalize_phoneme(phoneme: str) -> str:
"""Alias for to_ipa() — the canonical normalization direction."""
return to_ipa(phoneme)
def normalize_phoneme_list(phonemes: list[str]) -> list[str]:
"""Normalize and deduplicate a list of phonemes to IPA-canonical."""
return sorted(set(to_ipa(p) for p in phonemes))
- [ ] Step 5: Update phonology/init.py
"""Phonological computation modules."""
from phonolex_data.phonology.normalize import to_ipa, to_ascii, normalize_phoneme, normalize_phoneme_list
from phonolex_data.phonology.wcm import compute_wcm
__all__ = [
"to_ipa", "to_ascii", "normalize_phoneme", "normalize_phoneme_list",
"compute_wcm",
]
Note: syllabification and g2p_alignment are not re-exported here — they have complex interfaces that consumers import directly.
- [ ] Step 6: Commit
git add packages/data/src/phonolex_data/phonology/
git commit -m "add: phonology modules (syllabification, wcm, normalize, g2p) in packages/data/"
Task 8: Move graph builder, mappings, and finalize data package¶
Files:
- Copy: src/phonolex/build_phonological_graph.py → packages/data/src/phonolex_data/graph/build_phonological_graph.py
- Copy: data/mappings/arpa_to_ipa.json → packages/data/src/phonolex_data/mappings/arpa_to_ipa.json
- Copy: data/mappings/ipa_to_arpa.json → packages/data/src/phonolex_data/mappings/ipa_to_arpa.json
- Move: tests/test_g2p_alignment.py → packages/data/tests/test_g2p_alignment.py
- [ ] Step 1: Copy graph builder
cp src/phonolex/build_phonological_graph.py packages/data/src/phonolex_data/graph/build_phonological_graph.py
- [ ] Step 2: Copy mapping JSON files into package
cp data/mappings/arpa_to_ipa.json packages/data/src/phonolex_data/mappings/arpa_to_ipa.json
cp data/mappings/ipa_to_arpa.json packages/data/src/phonolex_data/mappings/ipa_to_arpa.json
These are bundled with the package so they're available after pip install. The canonical copies remain in data/mappings/ at repo root.
- [ ] Step 3: Create mappings/init.py loader
Create packages/data/src/phonolex_data/mappings/__init__.py:
"""IPA/ARPAbet mapping data and loaders.
Canonical location for mapping loaders. All other modules that need
ARPA↔IPA conversion should import from here.
"""
import json
from pathlib import Path
_DIR = Path(__file__).parent
def load_arpa_to_ipa() -> dict[str, str]:
"""Load ARPAbet -> IPA mapping from bundled JSON."""
with open(_DIR / "arpa_to_ipa.json") as f:
return json.load(f)
def load_ipa_to_arpa() -> dict[str, str]:
"""Load IPA -> ARPAbet mapping from bundled JSON."""
with open(_DIR / "ipa_to_arpa.json") as f:
return json.load(f)
This is the one canonical location for mapping loaders. cmudict.py imports from here via from phonolex_data.mappings import load_arpa_to_ipa.
- [ ] Step 4: Move g2p test
git mv tests/test_g2p_alignment.py packages/data/tests/test_g2p_alignment.py
- [ ] Step 5: Create pyproject.toml for data package
Create packages/data/pyproject.toml:
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "phonolex-data"
version = "0.1.0"
description = "Shared data layer for the PhonoLex platform"
license = "CC-BY-SA-3.0"
requires-python = ">=3.10"
dependencies = []
[project.optional-dependencies]
data = [
"openpyxl>=3.0",
]
dev = [
"pytest>=7.0",
"ruff>=0.4",
"openpyxl>=3.0",
]
[tool.hatch.build.targets.wheel]
packages = ["src/phonolex_data"]
[tool.ruff]
target-version = "py310"
line-length = 100
[tool.pytest.ini_options]
testpaths = ["tests"]
Uses src/ layout — packages = ["src/phonolex_data"] tells hatch to package the src/phonolex_data/ directory. After pip install -e packages/data, from phonolex_data.loaders import ... works.
- [ ] Step 6: Commit
git add packages/data/
git commit -m "add: graph builder, mappings, test, and pyproject.toml in packages/data/"
Chunk 5: Drop Dead Code, Fix Paths¶
Task 9: Drop dead code¶
Files:
- Delete: src/phonolex/ (entire directory)
- Delete: research/ (entire directory)
- Delete: python/ (entire directory)
- Delete: data/mappings/phoneme_mappings.py
- Delete: data/mappings/phoneme_vectorizer.py
- Delete: tests/ (now empty after g2p test moved)
- [ ] Step 1: Delete src/phonolex/
git rm -r src/phonolex/
rmdir src 2>/dev/null || true
This removes: embeddings/ (7 files), models/phonolex_bert.py, word_filter.py, tools/maximal_opposition.py, utils/extract_psycholinguistic_norms.py, utils/syllabification.py (already copied to packages/data), utils/__init__.py, build_phonological_graph.py (already copied), __init__.py.
- [ ] Step 2: Delete research/
git rm -r research/
- [ ] Step 3: Delete python/
git rm -r python/
- [ ] Step 4: Delete dead mapping code
git rm data/mappings/phoneme_mappings.py
git rm data/mappings/phoneme_vectorizer.py
- [ ] Step 5: Remove empty tests/ if present
rm -rf tests 2>/dev/null || true
Note: rmdir would fail if __pycache__/ exists inside. Use rm -rf since the test file was already moved out.
- [ ] Step 6: Commit
git add -A
git commit -m "drop: remove dead code (embeddings, BERT, research, old python package)"
Task 10: Fix post-migration paths¶
Files:
- Modify: .github/workflows/deploy.yml
- Modify: .github/workflows/ci.yml
- Modify: package.json
- Modify: .gitignore
- Create: pyproject.toml (root workspace)
- [ ] Step 1: Update deploy.yml
In .github/workflows/deploy.yml, update working directories:
working-directory: ./workers→working-directory: ./packages/web/workersworking-directory: ./webapp/frontend→working-directory: ./packages/web/frontendmkdocs build --site-dir webapp/frontend/dist/docs→mkdocs build --site-dir packages/web/frontend/dist/docs-
Pages deploy
working-directory: ./webapp/frontend→working-directory: ./packages/web/frontend -
[ ] Step 2: Update ci.yml
In .github/workflows/ci.yml, update working directories:
working-directory: ./webapp/frontend→working-directory: ./packages/web/frontendworking-directory: ./workers→working-directory: ./packages/web/workers- Cache key:
hashFiles('webapp/frontend/package-lock.json')→hashFiles('packages/web/frontend/package-lock.json') -
Cache key:
hashFiles('workers/package-lock.json')→hashFiles('packages/web/workers/package-lock.json') -
[ ] Step 3: Update root package.json
Update all --prefix paths:
{
"scripts": {
"dev": "npm run dev --prefix packages/web/frontend",
"build": "npm run build --prefix packages/web/frontend",
"preview": "npm run preview --prefix packages/web/frontend",
"test": "npm test --prefix packages/web/frontend",
"type-check": "npm run type-check --prefix packages/web/frontend",
"lint": "npm run lint --prefix packages/web/frontend",
"lint:fix": "npm run lint:fix --prefix packages/web/frontend"
}
}
- [ ] Step 4: Update .gitignore
Replace old paths with new:
# Build outputs
packages/web/frontend/dist/
packages/web/workers/.wrangler/
# Workers — D1 seed SQL (generated, large)
packages/web/workers/scripts/d1-seed.sql
# Dashboard — generated lookup files
packages/dashboard/lookups/
# G2P alignment output (generated, large)
data/g2p_alignment.json
Remove obsolete entries:
- webapp/frontend/dist/
- webapp/backend/dist/
- workers/scripts/d1-seed.sql
- workers/.wrangler/
- webapp/frontend/playwright-report/
- webapp/frontend/test-results/
- [ ] Step 5: Create root pyproject.toml (uv workspace)
Create pyproject.toml at repo root:
[project]
name = "phonolex"
version = "4.0.0"
description = "Phonological analysis and governed language generation platform"
license = "CC-BY-SA-3.0"
requires-python = ">=3.10"
[tool.uv.workspace]
members = [
"packages/data",
"packages/governors",
]
[tool.ruff]
target-version = "py310"
line-length = 100
- [ ] Step 6: Fix export-to-d1.py REPO_ROOT path
After moving to packages/web/workers/scripts/, the REPO_ROOT calculation in export-to-d1.py needs updating. It currently uses Path(__file__).resolve().parent.parent.parent which resolved to repo root from workers/scripts/. Now it needs an extra .parent since it's one level deeper:
In packages/web/workers/scripts/export-to-d1.py, update any Path(__file__).parent based repo root resolution to account for the new depth.
Also update export-to-d1.py to import compute_wcm from phonolex_data.phonology.wcm instead of defining it inline (coherence pass item — for now, the inline copy still works).
- [ ] Step 7: Flip normalization direction in build_lookup.py
In packages/dashboard/scripts/build_lookup.py, the IPA_NORMALIZE dict normalizes IPA→ASCII. Flip it to normalize ASCII→IPA to match the canonical direction:
# Old (IPA → ASCII):
# IPA_NORMALIZE = {"\u0261": "g"}
# New (ASCII → IPA):
IPA_NORMALIZE = {"g": "\u0261"}
Update _normalize_phoneme() accordingly. This is the normalization decision from the spec (Section 4).
- [ ] Step 8: Commit
git add .github/workflows/ package.json .gitignore pyproject.toml packages/web/workers/scripts/ packages/dashboard/scripts/
git commit -m "fix: update CI, paths, normalization direction, and pyproject.toml for monorepo"
Task 11: Update CLAUDE.md¶
Files:
- Modify: CLAUDE.md
- [ ] Step 1: Update project structure section
Replace the project structure in CLAUDE.md to reflect the new packages/ layout. Update:
- Architecture diagram: keep the same data flow, update path references
- Project Structure: reflect
packages/{data,governors,web,dashboard} - Dev Setup: update
cd workers→cd packages/web/workers,cd webapp/frontend→cd packages/web/frontend - Seeding D1: update script path
- Key Patterns: update file path references
- Gotchas: update
workers/scripts/config.py→packages/web/workers/scripts/config.py,workers/src/config/properties.ts→packages/web/workers/src/config/properties.ts - Inventio Data Contract: update references
-
Add new section about packages/data and packages/governors
-
[ ] Step 2: Commit
git add CLAUDE.md
git commit -m "docs: update CLAUDE.md for monorepo structure"
Task 12: Verify migration¶
- [ ] Step 1: Check that no old paths remain in tracked files
git grep -l 'webapp/frontend' -- ':!docs/' ':!*.md' ':!CHANGELOG*'
git grep -l '"./workers"' -- ':!docs/' ':!*.md'
Expected: no results (or only documentation references that are acceptable).
- [ ] Step 2: Verify web workers package.json and wrangler.toml are intact
cat packages/web/workers/wrangler.toml
cat packages/web/workers/package.json | head -5
cat packages/web/frontend/package.json | head -5
Expected: files exist with correct content. Note wrangler.toml uses relative paths (src/index.ts) so it should work from its new location without changes.
- [ ] Step 3: Verify governor package structure
ls packages/governors/src/diffusion_governors/
Expected: __init__.py, core.py, gates.py, boosts.py, cdd.py, constraints.py, lookups.py
- [ ] Step 4: Verify data package structure
find packages/data/ -name '*.py' | sort
Expected: src/phonolex_data/{__init__}.py, src/phonolex_data/loaders/{__init__,_helpers,cmudict,norms,phoible,associations,vocab_lists}.py, src/phonolex_data/phonology/{__init__,syllabification,wcm,normalize,g2p_alignment}.py, src/phonolex_data/graph/{__init__,build_phonological_graph}.py, src/phonolex_data/mappings/{__init__}.py + 2 JSON files, tests/{__init__,test_g2p_alignment,test_datasets}.py
- [ ] Step 5: Verify no dead code remains
ls src/ 2>/dev/null && echo "ERROR: src/ still exists" || echo "OK: src/ removed"
ls research/ 2>/dev/null && echo "ERROR: research/ still exists" || echo "OK: research/ removed"
ls python/ 2>/dev/null && echo "ERROR: python/ still exists" || echo "OK: python/ removed"
ls workers/ 2>/dev/null && echo "ERROR: workers/ still exists" || echo "OK: workers/ removed"
ls webapp/ 2>/dev/null && echo "ERROR: webapp/ still exists" || echo "OK: webapp/ removed"
Expected: all "OK" lines.
- [ ] Step 6: Final commit if any fixups needed
git status
# If clean, no commit needed
# If fixups: git add -A && git commit -m "fix: migration cleanup"