Skip to content

Monorepo Migration Implementation Plan

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Cannibalize diffusion-governors and constrained_chat into PhonoLex as a packages/ monorepo with a shared data layer.

Architecture: Create packages/{data,governors,web,dashboard}. Move existing PhonoLex web into packages/web/, copy governor engine into packages/governors/, copy dashboard into packages/dashboard/, assemble shared data layer from all three into packages/data/. Drop dead code. Fix paths.

Tech Stack: Python (uv workspaces), TypeScript (Hono/Cloudflare Workers, React), Git

Spec: docs/superpowers/specs/2026-03-13-monorepo-migration-design.md


Chunk 1: Branch, Scaffold, Move PhonoLex Web

Task 1: Create branch and scaffold packages/ directories

Files: - Create: packages/data/src/phonolex_data/__init__.py - Create: packages/data/src/phonolex_data/loaders/__init__.py - Create: packages/data/src/phonolex_data/phonology/__init__.py - Create: packages/data/src/phonolex_data/mappings/__init__.py - Create: packages/data/src/phonolex_data/graph/__init__.py - Create: packages/data/tests/__init__.py

Uses src/ layout (matching packages/governors/src/diffusion_governors/) so hatch packaging works correctly.

  • [ ] Step 1: Create branch off main
git checkout main
git pull origin main
git checkout -b feat/monorepo-migration
  • [ ] Step 2: Create packages/data/ directory structure with init.py files
mkdir -p packages/data/src/phonolex_data/{loaders,phonology,mappings,graph}
mkdir -p packages/data/tests

Create packages/data/src/phonolex_data/__init__.py:

"""phonolex_data — shared data layer for PhonoLex platform."""

Create empty __init__.py in each subdirectory (loaders/, phonology/, mappings/, graph/) and in tests/.

  • [ ] Step 3: Verify structure
find packages/ -type f | sort

Expected: 6 __init__.py files across packages/data/ tree.

  • [ ] Step 4: Commit
git add packages/
git commit -m "scaffold: create packages/data/ directory structure"

Task 2: Move PhonoLex web into packages/web/

Files: - Move: workers/packages/web/workers/ - Move: webapp/frontend/packages/web/frontend/

  • [ ] Step 1: Create packages/web/ and move workers
mkdir -p packages/web
git mv workers packages/web/workers
  • [ ] Step 2: Move webapp/frontend
git mv webapp/frontend packages/web/frontend
  • [ ] Step 3: Remove empty webapp/ directory
rmdir webapp

If webapp/ has other contents, check what's there first. Only frontend/ should exist.

  • [ ] Step 4: Verify structure
ls packages/web/workers/src/
ls packages/web/frontend/src/

Expected: existing source files in both locations.

  • [ ] Step 5: Commit
git add -A
git commit -m "move: workers/ and webapp/frontend/ into packages/web/"

Chunk 2: Copy Governors

Task 3: Copy diffusion-governors engine into packages/governors/

Source: /Users/jneumann/Repos/diffusion-governors/

Files: - Copy: src/diffusion_governors/{__init__,core,gates,boosts,cdd,constraints,lookups}.pypackages/governors/src/diffusion_governors/ - Copy: tests/{conftest,test_core,test_datasets}.pypackages/governors/tests/ - Copy: pyproject.tomlpackages/governors/pyproject.toml

NOT copying: llada_sampler.py, mdlm_sampler.py, data/, models/, scripts/

  • [ ] Step 1: Create target directory
mkdir -p packages/governors/src/diffusion_governors
mkdir -p packages/governors/tests
  • [ ] Step 2: Copy engine modules (minus samplers)
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/core.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/gates.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/boosts.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/cdd.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/constraints.py packages/governors/src/diffusion_governors/
cp /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/lookups.py packages/governors/src/diffusion_governors/
  • [ ] Step 3: Create new init.py (remove sampler imports)

Create packages/governors/src/diffusion_governors/__init__.py:

"""Diffusion Governors — constraint layer for language model generation."""

from diffusion_governors.core import Governor, GovernorContext
from diffusion_governors.constraints import (
    Bound,
    Complexity,
    Density,
    Exclude,
    ExcludeInClusters,
    NormCovered,
    VocabOnly,
    ESSENTIAL_ENGLISH,
    STOP_WORDS,
)
from diffusion_governors.gates import HardGate
from diffusion_governors.boosts import LogitBoost
from diffusion_governors.cdd import CDDConstraint, CDDProjection
from diffusion_governors.lookups import (
    LookupBuilder,
    PhonoFeatures,
    Syllable,
    TokenFeatures,
)

__all__ = [
    "Governor",
    "GovernorContext",
    "Bound",
    "Complexity",
    "Density",
    "Exclude",
    "ExcludeInClusters",
    "NormCovered",
    "VocabOnly",
    "ESSENTIAL_ENGLISH",
    "STOP_WORDS",
    "HardGate",
    "LogitBoost",
    "CDDConstraint",
    "CDDProjection",
    "LookupBuilder",
    "PhonoFeatures",
    "Syllable",
    "TokenFeatures",
]

Note: Removed SamplerConfig, sample, LLaDASamplerConfig, llada_sample, and datasets imports. The datasets module is being replaced by phonolex_data.loaders.

  • [ ] Step 4: Copy tests (minus e2e sampler tests)
cp /Users/jneumann/Repos/diffusion-governors/tests/conftest.py packages/governors/tests/
cp /Users/jneumann/Repos/diffusion-governors/tests/test_core.py packages/governors/tests/

NOT copying: test_e2e.py, test_llada_e2e.py (sampler tests), test_datasets.py (moves to packages/data).

  • [ ] Step 5: Create pyproject.toml for governors

Create packages/governors/pyproject.toml:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "diffusion-governors"
version = "0.1.0"
description = "Constraint layer for language model generation"
license = "CC-BY-SA-3.0"
requires-python = ">=3.10"
dependencies = [
    "torch>=2.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
    "ruff>=0.4",
]

[tool.hatch.build.targets.wheel]
packages = ["src/diffusion_governors"]

[tool.ruff]
target-version = "py310"
line-length = 100

[tool.pytest.ini_options]
testpaths = ["tests"]
markers = [
    "slow: requires model weights (deselect with '-m \"not slow\"')",
]

  • [ ] Step 6: Verify files exist
ls packages/governors/src/diffusion_governors/
ls packages/governors/tests/

Expected: 7 .py files in src, 2 in tests, plus pyproject.toml.

  • [ ] Step 7: Commit
git add packages/governors/
git commit -m "add: copy diffusion-governors engine into packages/governors/"

Chunk 3: Copy Dashboard

Task 4: Copy constrained_chat into packages/dashboard/

Source: /Users/jneumann/Repos/constrained_chat/

Files: - Copy: server/packages/dashboard/server/ - Copy: frontend/packages/dashboard/frontend/ - Copy: scripts/build_lookup_phonolex.pypackages/dashboard/scripts/build_lookup.py - Copy: scripts/generation_sweep.pypackages/dashboard/scripts/generation_sweep.py

NOT copying: phase*.py, lookups/, docs/, governor-t5-plan.md, WORKING_IMPLEMENTATIONS.md, patch_lookup_syllables.py

  • [ ] Step 1: Create target directories
mkdir -p packages/dashboard/scripts
  • [ ] Step 2: Copy server
cp -r /Users/jneumann/Repos/constrained_chat/server packages/dashboard/server
  • [ ] Step 3: Copy frontend
cp -r /Users/jneumann/Repos/constrained_chat/frontend packages/dashboard/frontend
  • [ ] Step 4: Copy scripts (rename build_lookup)
cp /Users/jneumann/Repos/constrained_chat/scripts/build_lookup_phonolex.py packages/dashboard/scripts/build_lookup.py
cp /Users/jneumann/Repos/constrained_chat/scripts/generation_sweep.py packages/dashboard/scripts/generation_sweep.py
  • [ ] Step 5: Remove frontend dist/ if copied (build artifact)
rm -rf packages/dashboard/frontend/dist
  • [ ] Step 6: Add lookups/ to .gitignore for dashboard

The dashboard generates large lookup JSON files. Add to the project .gitignore:

# Dashboard — generated lookup files
packages/dashboard/lookups/

  • [ ] Step 7: Verify structure
ls packages/dashboard/server/
ls packages/dashboard/frontend/src/
ls packages/dashboard/scripts/

Expected: server has main.py, model.py, governor.py, schemas.py, profiles.py, sessions.py, routes/, tests/. Frontend has React source. Scripts has build_lookup.py and generation_sweep.py.

  • [ ] Step 8: Commit
git add packages/dashboard/
git commit -m "add: copy constrained_chat into packages/dashboard/"

Chunk 4: Assemble Shared Data Layer

Task 5: Copy missing data files from diffusion-governors

Before creating loaders, ensure all data files exist in PhonoLex's data/ at repo root. Most norms and vocab files currently only exist in diffusion-governors.

  • [ ] Step 1: Copy vocab directories
cp -r /Users/jneumann/Repos/diffusion-governors/data/vocab data/vocab
  • [ ] Step 2: Copy all missing norms files
# Copy all norms files from diffusion-governors (will skip existing ones)
cp -n /Users/jneumann/Repos/diffusion-governors/data/norms/*.csv data/norms/
cp -n /Users/jneumann/Repos/diffusion-governors/data/norms/*.txt data/norms/
cp -n /Users/jneumann/Repos/diffusion-governors/data/norms/*.xlsx data/norms/
cp -rn /Users/jneumann/Repos/diffusion-governors/data/norms/swow data/norms/swow

This copies: Ratings_VAD_WarrinerEtAl.csv, kuperman_aoa.xlsx, subtlex_frequency.txt, Sensorimotor_norms.csv, SimLex-999.txt, free_association.txt, semantic_diversity.csv, SocialnessNorms_DiveicaPexmanBinney2021.csv, boi_pexman2019.xlsx, elp_items.csv, iconicity_ratings.csv, swow/ directory. PhonoLex already has concreteness.txt and GlasgowNorms.xlsx.

  • [ ] Step 3: Copy PHOIBLE English CSV

The governor loaders use a curated English-only phoneme CSV (comma-delimited), not PhonoLex's full PHOIBLE TSV. Copy it:

cp /Users/jneumann/Repos/diffusion-governors/data/phonology/phoible-english.csv data/phoible/phoible-english.csv
  • [ ] Step 4: Update .gitignore for data files

The *.csv glob in .gitignore will hide the new norms CSV files. Add exceptions:

# Data files needed by loaders
!data/norms/*.csv
!data/norms/*.txt
!data/norms/*.xlsx
!data/norms/swow/*.csv
!data/vocab/**/*.json
!data/phoible/*.csv
  • [ ] Step 5: Commit
git add data/ .gitignore
git commit -m "add: copy missing norms, vocab, and PHOIBLE data from diffusion-governors"

Task 6: Split datasets.py into packages/data/loaders/

Source: /Users/jneumann/Repos/diffusion-governors/src/diffusion_governors/datasets.py (606 LOC)

All files go under packages/data/src/phonolex_data/loaders/ (src layout matching governors).

Files: - Create: packages/data/src/phonolex_data/loaders/_helpers.py - Create: packages/data/src/phonolex_data/loaders/cmudict.py - Create: packages/data/src/phonolex_data/loaders/norms.py - Create: packages/data/src/phonolex_data/loaders/associations.py - Create: packages/data/src/phonolex_data/loaders/phoible.py - Create: packages/data/src/phonolex_data/loaders/vocab_lists.py - Create: packages/data/src/phonolex_data/loaders/__init__.py (re-exports)

This is the riskiest step — actual refactoring, not just file movement. Every function gets assigned to a module, shared helpers get factored out, and all downstream from diffusion_governors.datasets import ... imports break simultaneously.

  • [ ] Step 1: Create shared _helpers module

Create packages/data/src/phonolex_data/loaders/_helpers.py:

"""Shared helpers for dataset loaders."""

from __future__ import annotations

import json
from pathlib import Path


def get_data_dir() -> Path:
    """Return the repo-root data/ directory.

    Looks for DATA_DIR env var first, then walks up from this file to find the
    repo root (identified by having a packages/ directory).
    """
    import os
    env = os.environ.get("DATA_DIR")
    if env:
        return Path(env)
    # Walk up: src/phonolex_data/loaders/ → src/phonolex_data/ → src/ → packages/data/ → packages/ → repo root
    return Path(__file__).resolve().parent.parent.parent.parent.parent.parent / "data"


def require_openpyxl():
    try:
        import openpyxl
        return openpyxl
    except ImportError:
        raise ImportError(
            "openpyxl is required for .xlsx files: pip install phonolex-data[data]"
        ) from None


def load_vocab_dir(dirpath: Path, prefix: str) -> dict[str, set[str]]:
    """Load all JSON word lists from a directory into {word: {membership, ...}}."""
    result: dict[str, set[str]] = {}
    for f in sorted(dirpath.glob("*.json")):
        list_name = f"{prefix}_{f.stem.lower()}"
        with open(f) as fh:
            words = json.load(fh)
        for word in words:
            w = word.strip().lower()
            if w:
                result.setdefault(w, set()).add(list_name)
    return result

Note: get_data_dir() walks 6 levels up from packages/data/src/phonolex_data/loaders/_helpers.py to reach repo root, then appends data/.

  • [ ] Step 2: Create cmudict.py

Create packages/data/src/phonolex_data/loaders/cmudict.py:

"""CMU Pronouncing Dictionary loader."""

from __future__ import annotations

from pathlib import Path
from typing import Any

from phonolex_data.loaders._helpers import get_data_dir
from phonolex_data.mappings import load_arpa_to_ipa


def load_cmudict(path: str | Path | None = None) -> dict[str, list[str]]:
    """Load CMU Pronouncing Dictionary (0.7b).

    Returns:
        {word: [ARPAbet_phoneme, ...]}. First pronunciation only.
    """
    path = Path(path) if path else get_data_dir() / "cmu" / "cmudict-0.7b"
    result: dict[str, list[str]] = {}
    with open(path, encoding="latin-1") as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith(";;;"):
                continue
            parts = line.split("  ", 1)
            if len(parts) != 2:
                continue
            word = parts[0]
            if "(" in word:
                continue
            result[word.lower()] = parts[1].split()
    return result


def cmudict_to_phono(
    cmu: dict[str, list[str]] | None = None,
    arpa_map: dict[str, str] | None = None,
) -> dict[str, dict[str, Any]]:
    """Convert raw CMUdict to PhonoFeatures-compatible format.

    Returns:
        {word: {"phonemes": [ipa, ...], "ipa": "..."}}
        Compatible with LookupBuilder.add_phono().
    """
    if cmu is None:
        cmu = load_cmudict()
    if arpa_map is None:
        arpa_map = load_arpa_to_ipa()

    result: dict[str, dict[str, Any]] = {}
    for word, arpa_phones in cmu.items():
        ipa_phones = []
        for p in arpa_phones:
            ipa = arpa_map.get(p) or arpa_map.get(p.rstrip("012"), p)
            ipa_phones.append(ipa)
        result[word] = {
            "phonemes": ipa_phones,
            "ipa": "".join(ipa_phones),
        }
    return result

Note: Mapping loaders (load_arpa_to_ipa, load_ipa_to_arpa) live only in phonolex_data.mappings — no duplication.

  • [ ] Step 3: Create phoible.py

Create packages/data/src/phonolex_data/loaders/phoible.py:

"""PHOIBLE phoneme feature vector loader."""

from __future__ import annotations

import csv
import json
from pathlib import Path

from phonolex_data.loaders._helpers import get_data_dir


def load_phoible(path: str | Path | None = None) -> dict[str, dict[str, str]]:
    """Load PHOIBLE English phoneme distinctive features.

    Returns:
        {phoneme: {feature: "+"/"-"/"0", ...}} with 37 binary/ternary features.
    """
    path = Path(path) if path else get_data_dir() / "phoible" / "phoible-english.csv"
    skip = {
        "InventoryID", "Glottocode", "ISO6393", "LanguageName",
        "SpecificDialect", "GlyphID", "Phoneme", "Allophones",
        "Marginal", "SegmentClass", "Source",
    }
    result: dict[str, dict[str, str]] = {}
    with open(path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            phoneme = row.get("Phoneme", "")
            if not phoneme:
                continue
            features = {k: v for k, v in row.items() if k not in skip and v}
            if phoneme not in result:
                result[phoneme] = features
    return result


def load_phonotactic_probability(
    path: str | Path | None = None,
) -> dict[str, dict[str, float]]:
    """Load phonotactic probability norms (Vitevitch & Luce 2004).

    Returns:
        {word: {phono_prob_avg, positional_prob_avg, num_biphones, num_segments, ...}}
    """
    path = Path(path) if path else get_data_dir() / "phonotactic_probability_full.json"
    with open(path) as f:
        data = json.load(f)
    return data["word_probabilities"]

Uses phoible-english.csv (the curated English-only CSV copied from diffusion-governors in Task 5), NOT the full PHOIBLE TSV. This matches the governor's expectations — comma-delimited, English phonemes only.

TODO (post-migration): Replace PHOIBLE feature vectors with our own. Initialize from basic articulatory data, tune with morphological/phonological datasets. Opens up licensing (PHOIBLE is CC-BY, constrains derivatives). Track in coherence pass.

  • [ ] Step 4: Create norms.py

Create packages/data/src/phonolex_data/loaders/norms.py — copy all 11 norm loaders from datasets.py (lines 161-434), replacing DATA_DIR references with get_data_dir() calls. Functions: load_warriner, load_glasgow, load_concreteness, load_sensorimotor, load_kuperman, load_semantic_diversity, load_socialness, load_boi, load_subtlex, load_elp, load_iconicity.

Header:

"""Psycholinguistic norm dataset loaders."""

from __future__ import annotations

import csv
from pathlib import Path

from phonolex_data.loaders._helpers import get_data_dir, require_openpyxl

Each function gets the same transformation: replace DATA_DIR / "norms" / "..." with get_data_dir() / "norms" / "...".

  • [ ] Step 5: Create associations.py

Create packages/data/src/phonolex_data/loaders/associations.py — copy load_swow and load_free_association from datasets.py (lines 441-495). Same DATA_DIRget_data_dir() transformation.

Also copy load_simlex here (lines 502-518) since it's a benchmark loader, not a norm.

Header:

"""Association and similarity benchmark loaders."""

from __future__ import annotations

import csv
from pathlib import Path

from phonolex_data.loaders._helpers import get_data_dir

  • [ ] Step 6: Create vocab_lists.py

Create packages/data/src/phonolex_data/loaders/vocab_lists.py — copy all vocab loaders from datasets.py (lines 525-606). Same transformation.

Header:

"""Curated vocabulary list loaders."""

from __future__ import annotations

from pathlib import Path

from phonolex_data.loaders._helpers import get_data_dir, load_vocab_dir

The _load_vocab_dir helper is now load_vocab_dir imported from _helpers.py. Update all calls: _load_vocab_dir(path, prefix)load_vocab_dir(path, prefix).

  • [ ] Step 7: Update loaders/init.py with re-exports

Create packages/data/src/phonolex_data/loaders/__init__.py:

"""Dataset loaders — single source of truth for all PhonoLex data loading."""

from phonolex_data.loaders.cmudict import load_cmudict, cmudict_to_phono
from phonolex_data.loaders.phoible import load_phoible, load_phonotactic_probability
from phonolex_data.loaders.norms import (
    load_warriner, load_glasgow, load_concreteness, load_sensorimotor,
    load_kuperman, load_semantic_diversity, load_socialness, load_boi,
    load_subtlex, load_elp, load_iconicity,
)
from phonolex_data.loaders.associations import load_swow, load_free_association, load_simlex
from phonolex_data.loaders.vocab_lists import (
    load_ogden, load_afinn, load_stop_words, load_swadesh,
    load_roget, load_gsl, load_avl, load_all_vocab,
)

__all__ = [
    "load_cmudict", "cmudict_to_phono",
    "load_phoible", "load_phonotactic_probability",
    "load_warriner", "load_glasgow", "load_concreteness", "load_sensorimotor",
    "load_kuperman", "load_semantic_diversity", "load_socialness", "load_boi",
    "load_subtlex", "load_elp", "load_iconicity",
    "load_swow", "load_free_association", "load_simlex",
    "load_ogden", "load_afinn", "load_stop_words", "load_swadesh",
    "load_roget", "load_gsl", "load_avl", "load_all_vocab",
]

Note: Mapping loaders (load_arpa_to_ipa, load_ipa_to_arpa) are NOT re-exported here. They live canonically in phonolex_data.mappings.

  • [ ] Step 8: Copy and rewrite test_datasets.py
cp /Users/jneumann/Repos/diffusion-governors/tests/test_datasets.py packages/data/tests/test_datasets.py

Rewrite imports in the copied file: from diffusion_governors.datasets import ...from phonolex_data.loaders import .... Also update from diffusion_governors import datasetsfrom phonolex_data import loaders.

  • [ ] Step 9: Commit
git add packages/data/src/phonolex_data/loaders/ packages/data/tests/test_datasets.py
git commit -m "add: split datasets.py into packages/data loaders modules"

Task 7: Move phonology modules into packages/data/phonology/

Files: - Move: src/phonolex/utils/syllabification.pypackages/data/src/phonolex_data/phonology/syllabification.py - Move: workers/scripts/g2p_alignment.pypackages/data/src/phonolex_data/phonology/g2p_alignment.py - Create: packages/data/src/phonolex_data/phonology/wcm.py - Create: packages/data/src/phonolex_data/phonology/normalize.py

  • [ ] Step 1: Move syllabification.py
cp src/phonolex/utils/syllabification.py packages/data/src/phonolex_data/phonology/syllabification.py

Using cp not git mv because src/phonolex/ will be deleted entirely later.

  • [ ] Step 2: Move g2p_alignment.py and fix REPO_ROOT path
git mv packages/web/workers/scripts/g2p_alignment.py packages/data/src/phonolex_data/phonology/g2p_alignment.py

Using git mv here since the source is already tracked in the new location.

Important: After the move, fix the REPO_ROOT path calculation in g2p_alignment.py. The old path was 3 parents up from workers/scripts/. The new location at packages/data/src/phonolex_data/phonology/ is 6 parents up from repo root:

# Old (from workers/scripts/):
# REPO_ROOT = Path(__file__).parent.parent.parent
# New (from packages/data/src/phonolex_data/phonology/):
REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent.parent.parent

Also fix export-to-d1.py if it references g2p_alignment.py — it now needs to import from phonolex_data.phonology.g2p_alignment or use the new path.

  • [ ] Step 3: Create wcm.py (extracted from export-to-d1.py)

Create packages/data/src/phonolex_data/phonology/wcm.py:

"""Word Complexity Measure (Stoel-Gammon 2010)."""

from __future__ import annotations

# WCM sound classes
VOWELS = {
    "i", "ɪ", "e", "ɛ", "æ", "ɑ", "ɔ", "o", "ʊ", "u",
    "ʌ", "ə", "ɚ", "ɝ", "eɪ", "aɪ", "ɔɪ", "aʊ", "oʊ",
}
VELARS = {"k", "g", "ŋ"}
LIQUIDS_RHOTICS = {"l", "ɹ", "r", "ɚ", "ɝ"}
FRICATIVES_AFFRICATES = {"f", "v", "θ", "ð", "s", "z", "ʃ", "ʒ", "h", "tʃ", "dʒ"}
VOICED_FRIC_AFFRIC = {"v", "ð", "z", "ʒ", "dʒ"}


def compute_wcm(phonemes: list[str], syllables: list[dict]) -> int:
    """Compute Word Complexity Measure (Stoel-Gammon 2010).

    Args:
        phonemes: List of IPA phoneme strings.
        syllables: List of syllable dicts with 'onset', 'nucleus', 'coda', 'stress' keys.

    Returns:
        Integer WCM score (higher = more complex).
    """
    if not phonemes or not syllables:
        return 0

    score = 0

    # 1. More than 2 syllables
    if len(syllables) > 2:
        score += 1

    # 2. Non-initial stress
    stress_positions = [
        i for i, syl in enumerate(syllables)
        if syl.get("stress", 0) in (1, 2)
    ]
    if stress_positions and stress_positions[0] > 0:
        score += 1

    # 3. Word-final consonant
    if phonemes and phonemes[-1] not in VOWELS:
        score += 1

    # 4. Consonant clusters
    for syl in syllables:
        if len(syl.get("onset", [])) >= 2:
            score += 1
        if len(syl.get("coda", [])) >= 2:
            score += 1

    # 5-8. Sound class counts
    for p in phonemes:
        p_base = p.replace("\u02c8", "").replace("\u02cc", "")
        if p_base in VELARS:
            score += 1
        if p_base in LIQUIDS_RHOTICS:
            score += 1
        if p_base in FRICATIVES_AFFRICATES:
            score += 1
        if p_base in VOICED_FRIC_AFFRIC:
            score += 1

    return score

  • [ ] Step 4: Create normalize.py (IPA-canonical)

Create packages/data/src/phonolex_data/phonology/normalize.py:

"""IPA normalization — canonical representation uses IPA codepoints.

ASCII g (U+0067) → IPA ɡ (U+0261) for the voiced velar stop.
This matches PhonoLex's web app and D1 database representation.
"""

from __future__ import annotations

# ASCII → IPA canonical mappings
_TO_IPA: dict[str, str] = {
    "g": "\u0261",  # ASCII g → IPA ɡ (voiced velar stop)
}

# Reverse mapping for interop with systems that use ASCII
_TO_ASCII: dict[str, str] = {v: k for k, v in _TO_IPA.items()}


def to_ipa(phoneme: str) -> str:
    """Normalize a phoneme string to IPA-canonical representation."""
    for ascii_char, ipa_char in _TO_IPA.items():
        phoneme = phoneme.replace(ascii_char, ipa_char)
    return phoneme


def to_ascii(phoneme: str) -> str:
    """Normalize a phoneme string to ASCII-safe representation.

    Use only for systems that cannot handle IPA codepoints.
    Prefer to_ipa() for all internal representations.
    """
    for ipa_char, ascii_char in _TO_ASCII.items():
        phoneme = phoneme.replace(ipa_char, ascii_char)
    return phoneme


def normalize_phoneme(phoneme: str) -> str:
    """Alias for to_ipa() — the canonical normalization direction."""
    return to_ipa(phoneme)


def normalize_phoneme_list(phonemes: list[str]) -> list[str]:
    """Normalize and deduplicate a list of phonemes to IPA-canonical."""
    return sorted(set(to_ipa(p) for p in phonemes))

  • [ ] Step 5: Update phonology/init.py
"""Phonological computation modules."""

from phonolex_data.phonology.normalize import to_ipa, to_ascii, normalize_phoneme, normalize_phoneme_list
from phonolex_data.phonology.wcm import compute_wcm

__all__ = [
    "to_ipa", "to_ascii", "normalize_phoneme", "normalize_phoneme_list",
    "compute_wcm",
]

Note: syllabification and g2p_alignment are not re-exported here — they have complex interfaces that consumers import directly.

  • [ ] Step 6: Commit
git add packages/data/src/phonolex_data/phonology/
git commit -m "add: phonology modules (syllabification, wcm, normalize, g2p) in packages/data/"

Task 8: Move graph builder, mappings, and finalize data package

Files: - Copy: src/phonolex/build_phonological_graph.pypackages/data/src/phonolex_data/graph/build_phonological_graph.py - Copy: data/mappings/arpa_to_ipa.jsonpackages/data/src/phonolex_data/mappings/arpa_to_ipa.json - Copy: data/mappings/ipa_to_arpa.jsonpackages/data/src/phonolex_data/mappings/ipa_to_arpa.json - Move: tests/test_g2p_alignment.pypackages/data/tests/test_g2p_alignment.py

  • [ ] Step 1: Copy graph builder
cp src/phonolex/build_phonological_graph.py packages/data/src/phonolex_data/graph/build_phonological_graph.py
  • [ ] Step 2: Copy mapping JSON files into package
cp data/mappings/arpa_to_ipa.json packages/data/src/phonolex_data/mappings/arpa_to_ipa.json
cp data/mappings/ipa_to_arpa.json packages/data/src/phonolex_data/mappings/ipa_to_arpa.json

These are bundled with the package so they're available after pip install. The canonical copies remain in data/mappings/ at repo root.

  • [ ] Step 3: Create mappings/init.py loader

Create packages/data/src/phonolex_data/mappings/__init__.py:

"""IPA/ARPAbet mapping data and loaders.

Canonical location for mapping loaders. All other modules that need
ARPA↔IPA conversion should import from here.
"""

import json
from pathlib import Path

_DIR = Path(__file__).parent


def load_arpa_to_ipa() -> dict[str, str]:
    """Load ARPAbet -> IPA mapping from bundled JSON."""
    with open(_DIR / "arpa_to_ipa.json") as f:
        return json.load(f)


def load_ipa_to_arpa() -> dict[str, str]:
    """Load IPA -> ARPAbet mapping from bundled JSON."""
    with open(_DIR / "ipa_to_arpa.json") as f:
        return json.load(f)

This is the one canonical location for mapping loaders. cmudict.py imports from here via from phonolex_data.mappings import load_arpa_to_ipa.

  • [ ] Step 4: Move g2p test
git mv tests/test_g2p_alignment.py packages/data/tests/test_g2p_alignment.py
  • [ ] Step 5: Create pyproject.toml for data package

Create packages/data/pyproject.toml:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "phonolex-data"
version = "0.1.0"
description = "Shared data layer for the PhonoLex platform"
license = "CC-BY-SA-3.0"
requires-python = ">=3.10"
dependencies = []

[project.optional-dependencies]
data = [
    "openpyxl>=3.0",
]
dev = [
    "pytest>=7.0",
    "ruff>=0.4",
    "openpyxl>=3.0",
]

[tool.hatch.build.targets.wheel]
packages = ["src/phonolex_data"]

[tool.ruff]
target-version = "py310"
line-length = 100

[tool.pytest.ini_options]
testpaths = ["tests"]

Uses src/ layout — packages = ["src/phonolex_data"] tells hatch to package the src/phonolex_data/ directory. After pip install -e packages/data, from phonolex_data.loaders import ... works.

  • [ ] Step 6: Commit
git add packages/data/
git commit -m "add: graph builder, mappings, test, and pyproject.toml in packages/data/"

Chunk 5: Drop Dead Code, Fix Paths

Task 9: Drop dead code

Files: - Delete: src/phonolex/ (entire directory) - Delete: research/ (entire directory) - Delete: python/ (entire directory) - Delete: data/mappings/phoneme_mappings.py - Delete: data/mappings/phoneme_vectorizer.py - Delete: tests/ (now empty after g2p test moved)

  • [ ] Step 1: Delete src/phonolex/
git rm -r src/phonolex/
rmdir src 2>/dev/null || true

This removes: embeddings/ (7 files), models/phonolex_bert.py, word_filter.py, tools/maximal_opposition.py, utils/extract_psycholinguistic_norms.py, utils/syllabification.py (already copied to packages/data), utils/__init__.py, build_phonological_graph.py (already copied), __init__.py.

  • [ ] Step 2: Delete research/
git rm -r research/
  • [ ] Step 3: Delete python/
git rm -r python/
  • [ ] Step 4: Delete dead mapping code
git rm data/mappings/phoneme_mappings.py
git rm data/mappings/phoneme_vectorizer.py
  • [ ] Step 5: Remove empty tests/ if present
rm -rf tests 2>/dev/null || true

Note: rmdir would fail if __pycache__/ exists inside. Use rm -rf since the test file was already moved out.

  • [ ] Step 6: Commit
git add -A
git commit -m "drop: remove dead code (embeddings, BERT, research, old python package)"

Task 10: Fix post-migration paths

Files: - Modify: .github/workflows/deploy.yml - Modify: .github/workflows/ci.yml - Modify: package.json - Modify: .gitignore - Create: pyproject.toml (root workspace)

  • [ ] Step 1: Update deploy.yml

In .github/workflows/deploy.yml, update working directories:

  • working-directory: ./workersworking-directory: ./packages/web/workers
  • working-directory: ./webapp/frontendworking-directory: ./packages/web/frontend
  • mkdocs build --site-dir webapp/frontend/dist/docsmkdocs build --site-dir packages/web/frontend/dist/docs
  • Pages deploy working-directory: ./webapp/frontendworking-directory: ./packages/web/frontend

  • [ ] Step 2: Update ci.yml

In .github/workflows/ci.yml, update working directories:

  • working-directory: ./webapp/frontendworking-directory: ./packages/web/frontend
  • working-directory: ./workersworking-directory: ./packages/web/workers
  • Cache key: hashFiles('webapp/frontend/package-lock.json')hashFiles('packages/web/frontend/package-lock.json')
  • Cache key: hashFiles('workers/package-lock.json')hashFiles('packages/web/workers/package-lock.json')

  • [ ] Step 3: Update root package.json

Update all --prefix paths:

{
  "scripts": {
    "dev": "npm run dev --prefix packages/web/frontend",
    "build": "npm run build --prefix packages/web/frontend",
    "preview": "npm run preview --prefix packages/web/frontend",
    "test": "npm test --prefix packages/web/frontend",
    "type-check": "npm run type-check --prefix packages/web/frontend",
    "lint": "npm run lint --prefix packages/web/frontend",
    "lint:fix": "npm run lint:fix --prefix packages/web/frontend"
  }
}
  • [ ] Step 4: Update .gitignore

Replace old paths with new:

# Build outputs
packages/web/frontend/dist/
packages/web/workers/.wrangler/

# Workers — D1 seed SQL (generated, large)
packages/web/workers/scripts/d1-seed.sql

# Dashboard — generated lookup files
packages/dashboard/lookups/

# G2P alignment output (generated, large)
data/g2p_alignment.json

Remove obsolete entries: - webapp/frontend/dist/ - webapp/backend/dist/ - workers/scripts/d1-seed.sql - workers/.wrangler/ - webapp/frontend/playwright-report/ - webapp/frontend/test-results/

  • [ ] Step 5: Create root pyproject.toml (uv workspace)

Create pyproject.toml at repo root:

[project]
name = "phonolex"
version = "4.0.0"
description = "Phonological analysis and governed language generation platform"
license = "CC-BY-SA-3.0"
requires-python = ">=3.10"

[tool.uv.workspace]
members = [
    "packages/data",
    "packages/governors",
]

[tool.ruff]
target-version = "py310"
line-length = 100

  • [ ] Step 6: Fix export-to-d1.py REPO_ROOT path

After moving to packages/web/workers/scripts/, the REPO_ROOT calculation in export-to-d1.py needs updating. It currently uses Path(__file__).resolve().parent.parent.parent which resolved to repo root from workers/scripts/. Now it needs an extra .parent since it's one level deeper:

In packages/web/workers/scripts/export-to-d1.py, update any Path(__file__).parent based repo root resolution to account for the new depth.

Also update export-to-d1.py to import compute_wcm from phonolex_data.phonology.wcm instead of defining it inline (coherence pass item — for now, the inline copy still works).

  • [ ] Step 7: Flip normalization direction in build_lookup.py

In packages/dashboard/scripts/build_lookup.py, the IPA_NORMALIZE dict normalizes IPA→ASCII. Flip it to normalize ASCII→IPA to match the canonical direction:

# Old (IPA → ASCII):
# IPA_NORMALIZE = {"\u0261": "g"}
# New (ASCII → IPA):
IPA_NORMALIZE = {"g": "\u0261"}

Update _normalize_phoneme() accordingly. This is the normalization decision from the spec (Section 4).

  • [ ] Step 8: Commit
git add .github/workflows/ package.json .gitignore pyproject.toml packages/web/workers/scripts/ packages/dashboard/scripts/
git commit -m "fix: update CI, paths, normalization direction, and pyproject.toml for monorepo"

Task 11: Update CLAUDE.md

Files: - Modify: CLAUDE.md

  • [ ] Step 1: Update project structure section

Replace the project structure in CLAUDE.md to reflect the new packages/ layout. Update:

  • Architecture diagram: keep the same data flow, update path references
  • Project Structure: reflect packages/{data,governors,web,dashboard}
  • Dev Setup: update cd workerscd packages/web/workers, cd webapp/frontendcd packages/web/frontend
  • Seeding D1: update script path
  • Key Patterns: update file path references
  • Gotchas: update workers/scripts/config.pypackages/web/workers/scripts/config.py, workers/src/config/properties.tspackages/web/workers/src/config/properties.ts
  • Inventio Data Contract: update references
  • Add new section about packages/data and packages/governors

  • [ ] Step 2: Commit

git add CLAUDE.md
git commit -m "docs: update CLAUDE.md for monorepo structure"

Task 12: Verify migration

  • [ ] Step 1: Check that no old paths remain in tracked files
git grep -l 'webapp/frontend' -- ':!docs/' ':!*.md' ':!CHANGELOG*'
git grep -l '"./workers"' -- ':!docs/' ':!*.md'

Expected: no results (or only documentation references that are acceptable).

  • [ ] Step 2: Verify web workers package.json and wrangler.toml are intact
cat packages/web/workers/wrangler.toml
cat packages/web/workers/package.json | head -5
cat packages/web/frontend/package.json | head -5

Expected: files exist with correct content. Note wrangler.toml uses relative paths (src/index.ts) so it should work from its new location without changes.

  • [ ] Step 3: Verify governor package structure
ls packages/governors/src/diffusion_governors/

Expected: __init__.py, core.py, gates.py, boosts.py, cdd.py, constraints.py, lookups.py

  • [ ] Step 4: Verify data package structure
find packages/data/ -name '*.py' | sort

Expected: src/phonolex_data/{__init__}.py, src/phonolex_data/loaders/{__init__,_helpers,cmudict,norms,phoible,associations,vocab_lists}.py, src/phonolex_data/phonology/{__init__,syllabification,wcm,normalize,g2p_alignment}.py, src/phonolex_data/graph/{__init__,build_phonological_graph}.py, src/phonolex_data/mappings/{__init__}.py + 2 JSON files, tests/{__init__,test_g2p_alignment,test_datasets}.py

  • [ ] Step 5: Verify no dead code remains
ls src/ 2>/dev/null && echo "ERROR: src/ still exists" || echo "OK: src/ removed"
ls research/ 2>/dev/null && echo "ERROR: research/ still exists" || echo "OK: research/ removed"
ls python/ 2>/dev/null && echo "ERROR: python/ still exists" || echo "OK: python/ removed"
ls workers/ 2>/dev/null && echo "ERROR: workers/ still exists" || echo "OK: workers/ removed"
ls webapp/ 2>/dev/null && echo "ERROR: webapp/ still exists" || echo "OK: webapp/ removed"

Expected: all "OK" lines.

  • [ ] Step 6: Final commit if any fixups needed
git status
# If clean, no commit needed
# If fixups: git add -A && git commit -m "fix: migration cleanup"