Skip to content

Monorepo Migration Design

2026-03-13 — Rough migration of diffusion-governors and constrained_chat into PhonoLex


Goal

Cannibalize diffusion-governors and constrained_chat into the PhonoLex repo as a packages/ monorepo. No git history migration. Drop dead code. Extract a shared data layer. Get the structure right; coherence pass follows.

Approach

Approach C: full restructure in one pass, accept that CI/CD breaks until paths are fixed (bounded follow-up work).

Branch: feat/monorepo-migration off main.

External repos (diffusion-governors, constrained_chat) stay as separate repos, frozen. Not archived or deleted — just no longer the active development location.


1. Target Structure

PhonoLex/
├── packages/
│   ├── data/                        # Shared data layer (NEW — assembled from all three repos)
│   │   ├── __init__.py              # Package name: phonolex_data
│   │   ├── loaders/                 # One set of dataset loaders
│   │   │   ├── __init__.py
│   │   │   ├── cmudict.py           # CMU dict → phonological features
│   │   │   ├── norms.py             # All psycholinguistic norm datasets (11 loaders)
│   │   │   ├── associations.py      # SWOW, free association
│   │   │   ├── phoible.py           # Phoneme feature vectors
│   │   │   └── vocab_lists.py       # Ogden, Roget, stop words, AFINN, etc.
│   │   ├── phonology/               # Phonological computation
│   │   │   ├── __init__.py
│   │   │   ├── syllabification.py   # ← from src/phonolex/utils/
│   │   │   ├── wcm.py              # Word Complexity Measure (extracted from export-to-d1.py)
│   │   │   ├── normalize.py        # IPA normalization (canonical: IPA ɡ U+0261)
│   │   │   └── g2p_alignment.py    # ← from workers/scripts/ (shared by web + dashboard)
│   │   ├── mappings/                # IPA/ARPAbet conversion
│   │   │   ├── __init__.py          # Loader helpers
│   │   │   ├── arpa_to_ipa.json
│   │   │   └── ipa_to_arpa.json
│   │   ├── graph/                   # Pickle builder
│   │   │   ├── __init__.py
│   │   │   └── build_phonological_graph.py  # ← from src/phonolex/
│   │   ├── tests/
│   │   │   └── test_g2p_alignment.py  # ← from tests/
│   │   └── pyproject.toml
│   │
│   ├── governors/                   # Constraint engine (← diffusion-governors)
│   │   ├── src/diffusion_governors/
│   │   │   ├── __init__.py
│   │   │   ├── core.py             # Governor, GovernorContext, Mechanism
│   │   │   ├── gates.py            # HardGate
│   │   │   ├── boosts.py           # LogitBoost
│   │   │   ├── cdd.py              # CDDProjection
│   │   │   ├── constraints.py      # 15 declarative constraints
│   │   │   └── lookups.py          # LookupBuilder, TokenFeatures
│   │   ├── tests/
│   │   └── pyproject.toml
│   │
│   ├── web/                         # PhonoLex web app (← workers/ + webapp/)
│   │   ├── workers/                 # Hono API + D1
│   │   │   ├── src/
│   │   │   ├── scripts/
│   │   │   │   ├── export-to-d1.py
│   │   │   │   └── config.py
│   │   │   ├── wrangler.toml       # Paths updated post-migration
│   │   │   └── package.json
│   │   └── frontend/                # React + MUI
│   │       ├── src/
│   │       └── package.json
│   │
│   └── dashboard/                   # Governed Chat (← constrained_chat)
│       ├── server/                  # FastAPI backend
│       │   ├── main.py
│       │   ├── model.py
│       │   ├── governor.py          # HFGovernorProcessor + GovernorCache stay here
│       │   ├── schemas.py
│       │   ├── profiles.py
│       │   ├── sessions.py
│       │   ├── routes/
│       │   └── tests/
│       ├── frontend/                # React + Tailwind
│       │   └── src/
│       └── scripts/
│           ├── build_lookup.py      # ← build_lookup_phonolex.py (renamed)
│           └── generation_sweep.py
│
├── data/                            # Raw data files (one copy, stays at root)
├── docs/
├── tests/                           # Cross-package integration tests
├── .github/workflows/
├── CLAUDE.md
└── pyproject.toml                   # Workspace root (uv workspaces)

2. What Gets Dropped

From src/phonolex/ (dead code)

  • embeddings/ (7 files) — dead approach
  • models/phonolex_bert.py — dead
  • word_filter.py — superseded by Workers API (filters.ts, patterns.ts)
  • tools/maximal_opposition.py — superseded by contrastive.ts route
  • utils/extract_psycholinguistic_norms.py — superseded by export-to-d1.py

From data/mappings/ (dead code)

  • phoneme_mappings.py — pre-Workers era, mapping handled by JSON files
  • phoneme_vectorizer.py — pre-Workers era, vectorization handled by load_phoible() and similarity.ts

From diffusion-governors (not copying)

  • llada_sampler.py, mdlm_sampler.py — diffusion-era samplers, not used by T5Gemma
  • models/mdlm-owt/ — MDLM model files
  • data/ directory — duplicate of PhonoLex data/
  • scripts/build_lookup.py — superseded by constrained_chat's version
  • scripts/example_usage.py — demo script

From constrained_chat (not copying)

  • phase0_eval*.py, phase1_*.py, phase2_*.py — research scripts
  • governor-t5-plan.md, WORKING_IMPLEMENTATIONS.md — superseded by product-plan.md
  • patch_lookup_syllables.py — interim hack
  • lookups/ directory — generated artifacts (52MB+), gitignored
  • docs/superpowers/ — planning docs from that repo

From PhonoLex root

  • research/ — papers already absorbed into frontend and docs
  • src/phonolex/ directory removed after surviving code extracted
  • python/ directory — superseded by root pyproject.toml

3. Shared Data Layer Assembly

packages/data/ is the one new piece — assembled from parts of all three repos. Package name: phonolex_data.

loaders/

Cannibalized from diffusion-governors/src/diffusion_governors/datasets.py (606 LOC). Split the monolithic file into focused modules. This is the riskiest step — actual refactoring, not just file movement. Every function gets assigned to a module, shared helpers get factored out, and all downstream imports break simultaneously.

  • cmudict.pycmudict_to_phono() (CMU dict → IPA, phonemes, features)
  • norms.py — 11 loaders: load_warriner(), load_kuperman(), load_subtlex(), load_concreteness(), load_sensorimotor(), load_glasgow(), load_boi(), load_elp(), load_iconicity(), load_semantic_diversity(), load_socialness()
  • associations.pyload_swow(), load_free_association()
  • phoible.pyload_phoible() (phoneme feature vectors)
  • vocab_lists.pyload_all_vocab(), load_ogden(), load_roget(), load_stop_words(), load_swadesh(), load_afinn(), load_gsl(), load_avl()

phonology/

  • syllabification.py — moves from src/phonolex/utils/syllabification.py
  • wcm.py — Word Complexity Measure computation, extracted from inline code in export-to-d1.py
  • normalize.py — IPA normalization. Canonical direction: IPA ɡ (U+0261). Coalesces PhonoLex (ASCII→IPA) and governors (IPA→ASCII) around the IPA-canonical representation.
  • g2p_alignment.py — moves from workers/scripts/g2p_alignment.py. Shared by both web export and dashboard lookup builder.

graph/

  • build_phonological_graph.py — moves from src/phonolex/build_phonological_graph.py

mappings/

  • JSON files from data/mappings/ (arpa_to_ipa.json, ipa_to_arpa.json)
  • Loader helper in __init__.py
  • sample_vectors.json and README.md stay in data/mappings/ (reference material, not code)

tests/

  • test_g2p_alignment.py — moves from tests/test_g2p_alignment.py

Data file references

All loaders reference data/ at repo root. Configurable via DATA_DIR env var or get_data_dir() helper defaulting to repo root detection.

4. Normalization Resolution

PhonoLex normalizes ASCII g → IPA ɡ (U+0261). Governor build_lookup normalizes IPA ɡ → ASCII g.

Decision: IPA is canonical. packages/data/phonology/normalize.py normalizes everything to IPA. Governor-side normalization flipped to match. Other normalization gaps may surface during migration — audit then.

5. What Doesn't Change (Yet)

These move but don't get rewritten — logic stays the same:

  • Governor internals (core, gates, boosts, cdd, constraints)
  • Dashboard server and frontend (including HFGovernorProcessor in governor.py)
  • PhonoLex web workers and frontend
  • export-to-d1.py (import paths may change, logic doesn't)
  • Tests (copy alongside packages, fix imports)

The coherence pass afterward is where we discuss rewrites, including: - Promoting HFGovernorProcessor from dashboard to governors package - Renaming cdd.pyprojections.py (per product plan) - Restructuring lookups.py location - Replacing PHOIBLE feature vectors with our own (initialized from basic articulatory data, tuned by morpho/phono datasets — opens up licensing)

6. Post-Migration Path Fixes (Bounded)

To get back to a working state:

  1. wrangler.toml — update paths for packages/web/workers/ (note: changes working directory for wrangler dev, affects all relative paths)
  2. .github/workflows/deploy.yml — update working directory
  3. .github/workflows/ci.yml — update working directory references
  4. vite.config.ts (both frontends) — update any path references
  5. pyproject.toml at root — uv workspace config pointing at packages/*/
  6. Python importsfrom diffusion_governors.datasets import ...from phonolex_data.loaders import ... (in build_lookup.py, export-to-d1.py, etc.). Note: build_lookup.py imports from both phonolex_data (loaders) and diffusion_governors (LookupBuilder). Requires both packages installed as editable (uv pip install -e packages/data -e packages/governors).
  7. Root package.json — update script path references (e.g., npm run dev --prefix webapp/frontend)
  8. .gitignore — update paths (e.g., webapp/frontend/dist/packages/web/frontend/dist/)
  9. CLAUDE.md — update project structure, paths, dev setup instructions

7. Migration Sequence

  1. Create branch feat/monorepo-migration off main
  2. Create packages/ directory structure
  3. Move workers/packages/web/workers/, webapp/packages/web/frontend/
  4. Copy diffusion-governors engine into packages/governors/ (minus samplers, data/, scripts, models/)
  5. Copy constrained_chat into packages/dashboard/ (minus research scripts, lookups/, planning docs)
  6. Assemble packages/data/ — split datasets.py into loaders/, move syllabification + g2p_alignment, extract WCM, create normalize.py
  7. Drop dead code (src/phonolex/ remnants, research/, python/, dead data/mappings/ code)
  8. Fix post-migration paths (wrangler, CI, vite, imports, package.json, gitignore, CLAUDE.md)
  9. Commit