Monorepo Migration Design¶
2026-03-13 — Rough migration of diffusion-governors and constrained_chat into PhonoLex
Goal¶
Cannibalize diffusion-governors and constrained_chat into the PhonoLex repo as a packages/ monorepo. No git history migration. Drop dead code. Extract a shared data layer. Get the structure right; coherence pass follows.
Approach¶
Approach C: full restructure in one pass, accept that CI/CD breaks until paths are fixed (bounded follow-up work).
Branch: feat/monorepo-migration off main.
External repos (diffusion-governors, constrained_chat) stay as separate repos, frozen. Not archived or deleted — just no longer the active development location.
1. Target Structure¶
PhonoLex/
├── packages/
│ ├── data/ # Shared data layer (NEW — assembled from all three repos)
│ │ ├── __init__.py # Package name: phonolex_data
│ │ ├── loaders/ # One set of dataset loaders
│ │ │ ├── __init__.py
│ │ │ ├── cmudict.py # CMU dict → phonological features
│ │ │ ├── norms.py # All psycholinguistic norm datasets (11 loaders)
│ │ │ ├── associations.py # SWOW, free association
│ │ │ ├── phoible.py # Phoneme feature vectors
│ │ │ └── vocab_lists.py # Ogden, Roget, stop words, AFINN, etc.
│ │ ├── phonology/ # Phonological computation
│ │ │ ├── __init__.py
│ │ │ ├── syllabification.py # ← from src/phonolex/utils/
│ │ │ ├── wcm.py # Word Complexity Measure (extracted from export-to-d1.py)
│ │ │ ├── normalize.py # IPA normalization (canonical: IPA ɡ U+0261)
│ │ │ └── g2p_alignment.py # ← from workers/scripts/ (shared by web + dashboard)
│ │ ├── mappings/ # IPA/ARPAbet conversion
│ │ │ ├── __init__.py # Loader helpers
│ │ │ ├── arpa_to_ipa.json
│ │ │ └── ipa_to_arpa.json
│ │ ├── graph/ # Pickle builder
│ │ │ ├── __init__.py
│ │ │ └── build_phonological_graph.py # ← from src/phonolex/
│ │ ├── tests/
│ │ │ └── test_g2p_alignment.py # ← from tests/
│ │ └── pyproject.toml
│ │
│ ├── governors/ # Constraint engine (← diffusion-governors)
│ │ ├── src/diffusion_governors/
│ │ │ ├── __init__.py
│ │ │ ├── core.py # Governor, GovernorContext, Mechanism
│ │ │ ├── gates.py # HardGate
│ │ │ ├── boosts.py # LogitBoost
│ │ │ ├── cdd.py # CDDProjection
│ │ │ ├── constraints.py # 15 declarative constraints
│ │ │ └── lookups.py # LookupBuilder, TokenFeatures
│ │ ├── tests/
│ │ └── pyproject.toml
│ │
│ ├── web/ # PhonoLex web app (← workers/ + webapp/)
│ │ ├── workers/ # Hono API + D1
│ │ │ ├── src/
│ │ │ ├── scripts/
│ │ │ │ ├── export-to-d1.py
│ │ │ │ └── config.py
│ │ │ ├── wrangler.toml # Paths updated post-migration
│ │ │ └── package.json
│ │ └── frontend/ # React + MUI
│ │ ├── src/
│ │ └── package.json
│ │
│ └── dashboard/ # Governed Chat (← constrained_chat)
│ ├── server/ # FastAPI backend
│ │ ├── main.py
│ │ ├── model.py
│ │ ├── governor.py # HFGovernorProcessor + GovernorCache stay here
│ │ ├── schemas.py
│ │ ├── profiles.py
│ │ ├── sessions.py
│ │ ├── routes/
│ │ └── tests/
│ ├── frontend/ # React + Tailwind
│ │ └── src/
│ └── scripts/
│ ├── build_lookup.py # ← build_lookup_phonolex.py (renamed)
│ └── generation_sweep.py
│
├── data/ # Raw data files (one copy, stays at root)
├── docs/
├── tests/ # Cross-package integration tests
├── .github/workflows/
├── CLAUDE.md
└── pyproject.toml # Workspace root (uv workspaces)
2. What Gets Dropped¶
From src/phonolex/ (dead code)¶
embeddings/(7 files) — dead approachmodels/phonolex_bert.py— deadword_filter.py— superseded by Workers API (filters.ts, patterns.ts)tools/maximal_opposition.py— superseded by contrastive.ts routeutils/extract_psycholinguistic_norms.py— superseded by export-to-d1.py
From data/mappings/ (dead code)¶
phoneme_mappings.py— pre-Workers era, mapping handled by JSON filesphoneme_vectorizer.py— pre-Workers era, vectorization handled by load_phoible() and similarity.ts
From diffusion-governors (not copying)¶
llada_sampler.py,mdlm_sampler.py— diffusion-era samplers, not used by T5Gemmamodels/mdlm-owt/— MDLM model filesdata/directory — duplicate of PhonoLex data/scripts/build_lookup.py— superseded by constrained_chat's versionscripts/example_usage.py— demo script
From constrained_chat (not copying)¶
phase0_eval*.py,phase1_*.py,phase2_*.py— research scriptsgovernor-t5-plan.md,WORKING_IMPLEMENTATIONS.md— superseded by product-plan.mdpatch_lookup_syllables.py— interim hacklookups/directory — generated artifacts (52MB+), gitignoreddocs/superpowers/— planning docs from that repo
From PhonoLex root¶
research/— papers already absorbed into frontend and docssrc/phonolex/directory removed after surviving code extractedpython/directory — superseded by root pyproject.toml
3. Shared Data Layer Assembly¶
packages/data/ is the one new piece — assembled from parts of all three repos. Package name: phonolex_data.
loaders/¶
Cannibalized from diffusion-governors/src/diffusion_governors/datasets.py (606 LOC). Split the monolithic file into focused modules. This is the riskiest step — actual refactoring, not just file movement. Every function gets assigned to a module, shared helpers get factored out, and all downstream imports break simultaneously.
- cmudict.py —
cmudict_to_phono()(CMU dict → IPA, phonemes, features) - norms.py — 11 loaders:
load_warriner(),load_kuperman(),load_subtlex(),load_concreteness(),load_sensorimotor(),load_glasgow(),load_boi(),load_elp(),load_iconicity(),load_semantic_diversity(),load_socialness() - associations.py —
load_swow(),load_free_association() - phoible.py —
load_phoible()(phoneme feature vectors) - vocab_lists.py —
load_all_vocab(),load_ogden(),load_roget(),load_stop_words(),load_swadesh(),load_afinn(),load_gsl(),load_avl()
phonology/¶
- syllabification.py — moves from
src/phonolex/utils/syllabification.py - wcm.py — Word Complexity Measure computation, extracted from inline code in
export-to-d1.py - normalize.py — IPA normalization. Canonical direction: IPA ɡ (U+0261). Coalesces PhonoLex (ASCII→IPA) and governors (IPA→ASCII) around the IPA-canonical representation.
- g2p_alignment.py — moves from
workers/scripts/g2p_alignment.py. Shared by both web export and dashboard lookup builder.
graph/¶
- build_phonological_graph.py — moves from
src/phonolex/build_phonological_graph.py
mappings/¶
- JSON files from
data/mappings/(arpa_to_ipa.json, ipa_to_arpa.json) - Loader helper in
__init__.py sample_vectors.jsonandREADME.mdstay indata/mappings/(reference material, not code)
tests/¶
- test_g2p_alignment.py — moves from
tests/test_g2p_alignment.py
Data file references¶
All loaders reference data/ at repo root. Configurable via DATA_DIR env var or get_data_dir() helper defaulting to repo root detection.
4. Normalization Resolution¶
PhonoLex normalizes ASCII g → IPA ɡ (U+0261). Governor build_lookup normalizes IPA ɡ → ASCII g.
Decision: IPA is canonical. packages/data/phonology/normalize.py normalizes everything to IPA. Governor-side normalization flipped to match. Other normalization gaps may surface during migration — audit then.
5. What Doesn't Change (Yet)¶
These move but don't get rewritten — logic stays the same:
- Governor internals (core, gates, boosts, cdd, constraints)
- Dashboard server and frontend (including HFGovernorProcessor in governor.py)
- PhonoLex web workers and frontend
- export-to-d1.py (import paths may change, logic doesn't)
- Tests (copy alongside packages, fix imports)
The coherence pass afterward is where we discuss rewrites, including:
- Promoting HFGovernorProcessor from dashboard to governors package
- Renaming cdd.py → projections.py (per product plan)
- Restructuring lookups.py location
- Replacing PHOIBLE feature vectors with our own (initialized from basic articulatory data, tuned by morpho/phono datasets — opens up licensing)
6. Post-Migration Path Fixes (Bounded)¶
To get back to a working state:
- wrangler.toml — update paths for
packages/web/workers/(note: changes working directory forwrangler dev, affects all relative paths) .github/workflows/deploy.yml— update working directory.github/workflows/ci.yml— update working directory referencesvite.config.ts(both frontends) — update any path referencespyproject.tomlat root — uv workspace config pointing atpackages/*/- Python imports —
from diffusion_governors.datasets import ...→from phonolex_data.loaders import ...(in build_lookup.py, export-to-d1.py, etc.). Note:build_lookup.pyimports from bothphonolex_data(loaders) anddiffusion_governors(LookupBuilder). Requires both packages installed as editable (uv pip install -e packages/data -e packages/governors). - Root
package.json— update script path references (e.g.,npm run dev --prefix webapp/frontend) .gitignore— update paths (e.g.,webapp/frontend/dist/→packages/web/frontend/dist/)- CLAUDE.md — update project structure, paths, dev setup instructions
7. Migration Sequence¶
- Create branch
feat/monorepo-migrationoffmain - Create
packages/directory structure - Move
workers/→packages/web/workers/,webapp/→packages/web/frontend/ - Copy diffusion-governors engine into
packages/governors/(minus samplers, data/, scripts, models/) - Copy constrained_chat into
packages/dashboard/(minus research scripts, lookups/, planning docs) - Assemble
packages/data/— split datasets.py into loaders/, move syllabification + g2p_alignment, extract WCM, create normalize.py - Drop dead code (src/phonolex/ remnants, research/, python/, dead data/mappings/ code)
- Fix post-migration paths (wrangler, CI, vite, imports, package.json, gitignore, CLAUDE.md)
- Commit