Monorepo Migration Design¶

2026-03-13 — Rough migration of diffusion-governors and constrained_chat into PhonoLex

Goal¶

Cannibalize diffusion-governors and constrained_chat into the PhonoLex repo as a packages/ monorepo. No git history migration. Drop dead code. Extract a shared data layer. Get the structure right; coherence pass follows.

Approach¶

Approach C: full restructure in one pass, accept that CI/CD breaks until paths are fixed (bounded follow-up work).

Branch: feat/monorepo-migration off main.

External repos (diffusion-governors, constrained_chat) stay as separate repos, frozen. Not archived or deleted — just no longer the active development location.

1. Target Structure¶

PhonoLex/
├── packages/
│   ├── data/                        # Shared data layer (NEW — assembled from all three repos)
│   │   ├── __init__.py              # Package name: phonolex_data
│   │   ├── loaders/                 # One set of dataset loaders
│   │   │   ├── __init__.py
│   │   │   ├── cmudict.py           # CMU dict → phonological features
│   │   │   ├── norms.py             # All psycholinguistic norm datasets (11 loaders)
│   │   │   ├── associations.py      # SWOW, free association
│   │   │   ├── phoible.py           # Phoneme feature vectors
│   │   │   └── vocab_lists.py       # Ogden, Roget, stop words, AFINN, etc.
│   │   ├── phonology/               # Phonological computation
│   │   │   ├── __init__.py
│   │   │   ├── syllabification.py   # ← from src/phonolex/utils/
│   │   │   ├── wcm.py              # Word Complexity Measure (extracted from export-to-d1.py)
│   │   │   ├── normalize.py        # IPA normalization (canonical: IPA ɡ U+0261)
│   │   │   └── g2p_alignment.py    # ← from workers/scripts/ (shared by web + dashboard)
│   │   ├── mappings/                # IPA/ARPAbet conversion
│   │   │   ├── __init__.py          # Loader helpers
│   │   │   ├── arpa_to_ipa.json
│   │   │   └── ipa_to_arpa.json
│   │   ├── graph/                   # Pickle builder
│   │   │   ├── __init__.py
│   │   │   └── build_phonological_graph.py  # ← from src/phonolex/
│   │   ├── tests/
│   │   │   └── test_g2p_alignment.py  # ← from tests/
│   │   └── pyproject.toml
│   │
│   ├── governors/                   # Constraint engine (← diffusion-governors)
│   │   ├── src/diffusion_governors/
│   │   │   ├── __init__.py
│   │   │   ├── core.py             # Governor, GovernorContext, Mechanism
│   │   │   ├── gates.py            # HardGate
│   │   │   ├── boosts.py           # LogitBoost
│   │   │   ├── cdd.py              # CDDProjection
│   │   │   ├── constraints.py      # 15 declarative constraints
│   │   │   └── lookups.py          # LookupBuilder, TokenFeatures
│   │   ├── tests/
│   │   └── pyproject.toml
│   │
│   ├── web/                         # PhonoLex web app (← workers/ + webapp/)
│   │   ├── workers/                 # Hono API + D1
│   │   │   ├── src/
│   │   │   ├── scripts/
│   │   │   │   ├── export-to-d1.py
│   │   │   │   └── config.py
│   │   │   ├── wrangler.toml       # Paths updated post-migration
│   │   │   └── package.json
│   │   └── frontend/                # React + MUI
│   │       ├── src/
│   │       └── package.json
│   │
│   └── dashboard/                   # Governed Chat (← constrained_chat)
│       ├── server/                  # FastAPI backend
│       │   ├── main.py
│       │   ├── model.py
│       │   ├── governor.py          # HFGovernorProcessor + GovernorCache stay here
│       │   ├── schemas.py
│       │   ├── profiles.py
│       │   ├── sessions.py
│       │   ├── routes/
│       │   └── tests/
│       ├── frontend/                # React + Tailwind
│       │   └── src/
│       └── scripts/
│           ├── build_lookup.py      # ← build_lookup_phonolex.py (renamed)
│           └── generation_sweep.py
│
├── data/                            # Raw data files (one copy, stays at root)
├── docs/
├── tests/                           # Cross-package integration tests
├── .github/workflows/
├── CLAUDE.md
└── pyproject.toml                   # Workspace root (uv workspaces)

2. What Gets Dropped¶

From src/phonolex/ (dead code)¶

embeddings/ (7 files) — dead approach
models/phonolex_bert.py — dead
word_filter.py — superseded by Workers API (filters.ts, patterns.ts)
tools/maximal_opposition.py — superseded by contrastive.ts route
utils/extract_psycholinguistic_norms.py — superseded by export-to-d1.py

From data/mappings/ (dead code)¶

phoneme_mappings.py — pre-Workers era, mapping handled by JSON files
phoneme_vectorizer.py — pre-Workers era, vectorization handled by load_phoible() and similarity.ts

From diffusion-governors (not copying)¶

llada_sampler.py, mdlm_sampler.py — diffusion-era samplers, not used by T5Gemma
models/mdlm-owt/ — MDLM model files
data/ directory — duplicate of PhonoLex data/
scripts/build_lookup.py — superseded by constrained_chat's version
scripts/example_usage.py — demo script

From constrained_chat (not copying)¶

phase0_eval*.py, phase1_*.py, phase2_*.py — research scripts
governor-t5-plan.md, WORKING_IMPLEMENTATIONS.md — superseded by product-plan.md
patch_lookup_syllables.py — interim hack
lookups/ directory — generated artifacts (52MB+), gitignored
docs/superpowers/ — planning docs from that repo

From PhonoLex root¶

research/ — papers already absorbed into frontend and docs
src/phonolex/ directory removed after surviving code extracted
python/ directory — superseded by root pyproject.toml

3. Shared Data Layer Assembly¶

packages/data/ is the one new piece — assembled from parts of all three repos. Package name: phonolex_data.

loaders/¶

Cannibalized from diffusion-governors/src/diffusion_governors/datasets.py (606 LOC). Split the monolithic file into focused modules. This is the riskiest step — actual refactoring, not just file movement. Every function gets assigned to a module, shared helpers get factored out, and all downstream imports break simultaneously.

cmudict.py — cmudict_to_phono() (CMU dict → IPA, phonemes, features)
norms.py — 11 loaders: load_warriner(), load_kuperman(), load_subtlex(), load_concreteness(), load_sensorimotor(), load_glasgow(), load_boi(), load_elp(), load_iconicity(), load_semantic_diversity(), load_socialness()
associations.py — load_swow(), load_free_association()
phoible.py — load_phoible() (phoneme feature vectors)
vocab_lists.py — load_all_vocab(), load_ogden(), load_roget(), load_stop_words(), load_swadesh(), load_afinn(), load_gsl(), load_avl()

phonology/¶

syllabification.py — moves from src/phonolex/utils/syllabification.py
wcm.py — Word Complexity Measure computation, extracted from inline code in export-to-d1.py
normalize.py — IPA normalization. Canonical direction: IPA ɡ (U+0261). Coalesces PhonoLex (ASCII→IPA) and governors (IPA→ASCII) around the IPA-canonical representation.
g2p_alignment.py — moves from workers/scripts/g2p_alignment.py. Shared by both web export and dashboard lookup builder.

graph/¶

build_phonological_graph.py — moves from src/phonolex/build_phonological_graph.py

mappings/¶

JSON files from data/mappings/ (arpa_to_ipa.json, ipa_to_arpa.json)
Loader helper in __init__.py
sample_vectors.json and README.md stay in data/mappings/ (reference material, not code)

tests/¶

test_g2p_alignment.py — moves from tests/test_g2p_alignment.py

Data file references¶

All loaders reference data/ at repo root. Configurable via DATA_DIR env var or get_data_dir() helper defaulting to repo root detection.

4. Normalization Resolution¶

PhonoLex normalizes ASCII g → IPA ɡ (U+0261). Governor build_lookup normalizes IPA ɡ → ASCII g.

Decision: IPA is canonical. packages/data/phonology/normalize.py normalizes everything to IPA. Governor-side normalization flipped to match. Other normalization gaps may surface during migration — audit then.

5. What Doesn't Change (Yet)¶

These move but don't get rewritten — logic stays the same:

Governor internals (core, gates, boosts, cdd, constraints)
Dashboard server and frontend (including HFGovernorProcessor in governor.py)
PhonoLex web workers and frontend
export-to-d1.py (import paths may change, logic doesn't)
Tests (copy alongside packages, fix imports)

The coherence pass afterward is where we discuss rewrites, including: - Promoting HFGovernorProcessor from dashboard to governors package - Renaming cdd.py → projections.py (per product plan) - Restructuring lookups.py location - Replacing PHOIBLE feature vectors with our own (initialized from basic articulatory data, tuned by morpho/phono datasets — opens up licensing)

6. Post-Migration Path Fixes (Bounded)¶

To get back to a working state:

wrangler.toml — update paths for packages/web/workers/ (note: changes working directory for wrangler dev, affects all relative paths)
.github/workflows/deploy.yml — update working directory
.github/workflows/ci.yml — update working directory references
vite.config.ts (both frontends) — update any path references
pyproject.toml at root — uv workspace config pointing at packages/*/
Python imports — from diffusion_governors.datasets import ... → from phonolex_data.loaders import ... (in build_lookup.py, export-to-d1.py, etc.). Note: build_lookup.py imports from both phonolex_data (loaders) and diffusion_governors (LookupBuilder). Requires both packages installed as editable (uv pip install -e packages/data -e packages/governors).
Root package.json — update script path references (e.g., npm run dev --prefix webapp/frontend)
.gitignore — update paths (e.g., webapp/frontend/dist/ → packages/web/frontend/dist/)
CLAUDE.md — update project structure, paths, dev setup instructions

7. Migration Sequence¶

Create branch feat/monorepo-migration off main
Create packages/ directory structure
Move workers/ → packages/web/workers/, webapp/ → packages/web/frontend/
Copy diffusion-governors engine into packages/governors/ (minus samplers, data/, scripts, models/)
Copy constrained_chat into packages/dashboard/ (minus research scripts, lookups/, planning docs)
Assemble packages/data/ — split datasets.py into loaders/, move syllabification + g2p_alignment, extract WCM, create normalize.py
Drop dead code (src/phonolex/ remnants, research/, python/, dead data/mappings/ code)
Fix post-migration paths (wrangler, CI, vite, imports, package.json, gitignore, CLAUDE.md)
Commit