Canonical Morphological Decomposer Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Build a recursive binary canonical morphological decomposer that produces semantically coherent morpheme forms (e.g., "happily" → "happy" + "ly").
Architecture: Decompose-and-edit paradigm — a BiLSTM character encoder feeds four prediction heads: monomorphemic classifier, split-point predictor, label classifier, and canonical boundary edit predictor. Single forward pass, no autoregressive decoding. Recursive application builds a full decomposition tree.
Tech Stack: Python, PyTorch, BiLSTM, MorphyNet (CC BY-SA 3.0), SIGMORPHON 2022 (eval only)
Spec: docs/superpowers/specs/2026-04-07-canonical-morphological-decomposer-design.md
File Structure¶
packages/tokenizer/
├── pyproject.toml # CREATE (currently missing)
├── src/phonolex_tokenizer/
│ ├── __init__.py # MODIFY — export new types
│ ├── model/
│ │ ├── encoder.py # REUSE (unchanged)
│ │ ├── features.py # REUSE (unchanged)
│ │ └── schema.py # REUSE — MorphLabel enum
│ ├── decomposer/
│ │ ├── __init__.py # CREATE
│ │ ├── schema.py # CREATE — Mono, Split, MorphTree
│ │ ├── codebook.py # CREATE — EditCodebook
│ │ ├── model.py # CREATE — DecomposerModel (nn.Module)
│ │ └── decomposer.py # CREATE — Decomposer wrapper
│ ├── data/
│ │ ├── canonical_loader.py # CREATE — MorphyNet canonical pairs
│ │ └── negatives.py # CREATE — Hard negative mining
│ └── eval/
│ ├── decomposer_metrics.py # CREATE — Decomposition metrics
│ └── decomposer_benchmark.py # CREATE — Benchmark + SIGMORPHON regression
├── scripts/
│ ├── download_data.sh # CREATE — Download MorphyNet + SIGMORPHON
│ └── train_decomposer.py # CREATE — Training entry point
└── tests/
├── __init__.py # CREATE
├── test_decomposer_schema.py # CREATE
├── test_codebook.py # CREATE
├── test_decomposer_model.py # CREATE
├── test_decomposer.py # CREATE
├── test_canonical_loader.py # CREATE
├── test_negatives.py # CREATE
└── test_decomposer_metrics.py # CREATE
Task 1: Package Setup¶
Files:
- Create: packages/tokenizer/pyproject.toml
- Modify: pyproject.toml (workspace root — add tokenizer to members)
- Create: packages/tokenizer/tests/__init__.py
- [ ] Step 1: Create tokenizer pyproject.toml
[project]
name = "phonolex-tokenizer"
version = "2.0.0"
description = "Canonical morphological decomposer for PhonoLex"
license = "Apache-2.0"
requires-python = ">=3.10"
dependencies = [
"torch>=2.0",
"phonolex-data",
]
[project.optional-dependencies]
dev = [
"pytest>=9.0",
]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/phonolex_tokenizer"]
- [ ] Step 2: Add tokenizer to workspace
In root pyproject.toml, add "packages/tokenizer" to [tool.uv.workspace] members.
-
[ ] Step 3: Create
packages/tokenizer/tests/__init__.py(empty file) -
[ ] Step 4: Install and verify
cd /Users/jneumann/Repos/PhonoLex && uv pip install -e packages/tokenizer
uv run python -c "from phonolex_tokenizer.model.schema import MorphLabel; print(MorphLabel.ROOT)"
Expected: MorphLabel.ROOT
- [ ] Step 5: Commit
git add packages/tokenizer/pyproject.toml packages/tokenizer/tests/__init__.py pyproject.toml
git commit -m "chore: add tokenizer package to uv workspace"
Task 2: Download Training Data¶
Files:
- Create: packages/tokenizer/scripts/download_data.sh
- [ ] Step 1: Create download script
#!/usr/bin/env bash
set -euo pipefail
DATA_DIR="$(cd "$(dirname "$0")/.." && pwd)/data"
mkdir -p "$DATA_DIR"
# MorphyNet English
MORPHYNET_DIR="$DATA_DIR/morphynet"
if [ ! -f "$MORPHYNET_DIR/eng.derivational.v1.tsv" ]; then
echo "Downloading MorphyNet English..."
mkdir -p "$MORPHYNET_DIR"
curl -L "https://raw.githubusercontent.com/kbatsuren/MorphyNet/main/eng/eng.derivational.v1.tsv" \
-o "$MORPHYNET_DIR/eng.derivational.v1.tsv"
curl -L "https://raw.githubusercontent.com/kbatsuren/MorphyNet/main/eng/eng.inflectional.v1.tsv" \
-o "$MORPHYNET_DIR/eng.inflectional.v1.tsv"
else
echo "MorphyNet already present."
fi
# SIGMORPHON 2022 English (eval only)
SIGMORPHON_DIR="$DATA_DIR/sigmorphon2022"
if [ ! -f "$SIGMORPHON_DIR/eng.word.train.tsv" ]; then
echo "Downloading SIGMORPHON 2022 English..."
mkdir -p "$SIGMORPHON_DIR"
BASE="https://raw.githubusercontent.com/sigmorphon/2022SegmentationST/main/data"
curl -L "$BASE/eng.word.train.tsv" -o "$SIGMORPHON_DIR/eng.word.train.tsv"
curl -L "$BASE/eng.word.dev.tsv" -o "$SIGMORPHON_DIR/eng.word.dev.tsv"
curl -L "$BASE/eng.word.test.gold.tsv" -o "$SIGMORPHON_DIR/eng.word.test.gold.tsv"
else
echo "SIGMORPHON 2022 already present."
fi
echo "All data ready in $DATA_DIR"
- [ ] Step 2: Run download and add data dir to .gitignore
chmod +x packages/tokenizer/scripts/download_data.sh
packages/tokenizer/scripts/download_data.sh
echo "packages/tokenizer/data/" >> .gitignore
- [ ] Step 3: Commit
git add packages/tokenizer/scripts/download_data.sh .gitignore
git commit -m "chore: add data download script for decomposer training data"
Task 3: Decomposer Schema¶
Files:
- Create: packages/tokenizer/src/phonolex_tokenizer/decomposer/__init__.py
- Create: packages/tokenizer/src/phonolex_tokenizer/decomposer/schema.py
- Test: packages/tokenizer/tests/test_decomposer_schema.py
- [ ] Step 1: Write failing tests
Test Mono, Split, MorphTree types and flatten_tree(). Tests cover: frozen dataclasses, .leaves() in reading order (prefix first, suffix last), recursive tree flattening for multi-step decompositions like "unfortunately" → un + fortune + ate + ly.
See spec output schema for the full type definitions.
-
[ ] Step 2: Run tests, verify they fail (
ModuleNotFoundError) -
[ ] Step 3: Implement
Mono,Split,MorphTree,flatten_tree
Key behaviors:
- Mono.leaves() → [(word, ROOT)]
- Split.leaves() → prefix-first ordering: if label is PREFIX, affix leaves come before base leaves; otherwise base leaves come before affix leaves
- Recursion through base_tree/affix_tree when populated
- flatten_tree() delegates to .leaves()
- [ ] Step 4: Run tests, verify they pass
- [ ] Step 5: Commit —
feat(tokenizer): add decomposer schema — Mono, Split, MorphTree types
Task 4: Edit Codebook¶
Files:
- Create: packages/tokenizer/src/phonolex_tokenizer/decomposer/codebook.py
- Test: packages/tokenizer/tests/test_codebook.py
- [ ] Step 1: Write failing tests
Test BoundaryEdit (identity, strip-right, append-right, strip-and-append, strip-left) and EditCodebook (mine from pairs, lookup, identity always at index 0, roundtrip recovery).
Key test cases:
- BoundaryEdit(strip=1, append="y").apply("happi", side="right") → "happy"
- BoundaryEdit(strip=1, append="").apply("runn", side="right") → "run"
- BoundaryEdit(strip=0, append="e").apply("mak", side="right") → "make"
- BoundaryEdit(strip=1, append="").apply("es", side="left") → "s"
- Roundtrip: every training pair recoverable via its codebook entry
-
[ ] Step 2: Run tests, verify they fail
-
[ ] Step 3: Implement
BoundaryEditandEditCodebook
BoundaryEdit: frozen dataclass with strip: int, append: str, is_identity property, apply(surface, side) method.
EditCodebook: index 0 is always identity. from_pairs(pairs) mines the set of edits from (surface, canonical, side) triples. encode(surface, canonical, side) returns the codebook index. save()/load() for JSON persistence. _infer_edit() helper finds the minimal edit by comparing common prefix (right-side) or common suffix (left-side).
- [ ] Step 4: Run tests, verify they pass
- [ ] Step 5: Commit —
feat(tokenizer): add boundary edit codebook for canonical form recovery
Task 5: Canonical Data Loader¶
Files:
- Create: packages/tokenizer/src/phonolex_tokenizer/data/canonical_loader.py
- Test: packages/tokenizer/tests/test_canonical_loader.py
- [ ] Step 1: Write failing tests
Test parse_derivational_line() and parse_inflectional_line() for surface-faithful, allomorphic, prefix, and malformed cases. Tests use raw TSV strings, no file I/O.
Key: parse_derivational_line("happy\thappily\tJJ\tRB\tly\tsuffix") → DecompositionExample(word="happily", base="happy", affix="ly", label=SUFFIX, is_allomorphic=True). The allomorphic flag is set because "happy" + "ly" != "happily".
-
[ ] Step 2: Run tests, verify they fail
-
[ ] Step 3: Implement
DecompositionExample,parse_derivational_line,parse_inflectional_line,load_morphynet_canonical
DecompositionExample: frozen dataclass with word, base, affix, label, is_allomorphic. Parse functions split on tabs, determine label from prefix/suffix type field, set is_allomorphic = (reconstructed != target). load_morphynet_canonical() reads both derivational and inflectional TSVs, deduplicates by (word, base, affix). Inflectional entries get label=INFLECTION.
Unlike the existing loaders.py, this loader KEEPS allomorphic entries.
- [ ] Step 4: Run tests, verify they pass
- [ ] Step 5: Commit —
feat(tokenizer): add MorphyNet canonical decomposition loader
Task 6: Hard Negative Generator¶
Files:
- Create: packages/tokenizer/src/phonolex_tokenizer/data/negatives.py
- Test: packages/tokenizer/tests/test_negatives.py
- [ ] Step 1: Write failing tests
Test find_opaque_words() (butter/hammer are opaque, runner is not) and find_root_only_words() (MorphyNet sources that are never targets). Both return MonoExample objects.
-
[ ] Step 2: Run tests, verify they fail
-
[ ] Step 3: Implement
MonoExample,find_opaque_words,find_root_only_words
find_opaque_words(decomposed_words, all_words): words in all_words - decomposed_words that end in common suffix patterns (-er, -ing, -ed, -ly, etc.) or start with common prefix patterns (un-, re-, etc.). These look decomposable but aren't.
find_root_only_words(sources, targets): sources - targets → confirmed monomorphemic roots.
- [ ] Step 4: Run tests, verify they pass
- [ ] Step 5: Commit —
feat(tokenizer): add hard negative mining for monomorphemic examples
Task 7: Decomposer Metrics¶
Files:
- Create: packages/tokenizer/src/phonolex_tokenizer/eval/decomposer_metrics.py
- Test: packages/tokenizer/tests/test_decomposer_metrics.py
- [ ] Step 1: Write failing tests
Test decomposition_accuracy() (correct split vs wrong split vs mono-mismatch), canonical_form_accuracy() (correct canonical vs surface-instead-of-canonical, skips Mono), hard_negative_precision() (Mono gold correctly predicted vs false positive decomposition).
-
[ ] Step 2: Run tests, verify they fail
-
[ ] Step 3: Implement three metric functions
All take (golds: list[MorphTree], preds: list[MorphTree]) and return float. decomposition_accuracy: matches on (base, affix, label) for Splits, or both Mono. canonical_form_accuracy: only evaluates Split-vs-Split pairs, checks base+affix strings match. hard_negative_precision: only evaluates gold-Mono pairs, checks pred is also Mono.
- [ ] Step 4: Run tests, verify they pass
- [ ] Step 5: Commit —
feat(tokenizer): add decomposer evaluation metrics
Task 8: Decomposer Model (nn.Module)¶
Files:
- Create: packages/tokenizer/src/phonolex_tokenizer/decomposer/model.py
- Test: packages/tokenizer/tests/test_decomposer_model.py
- [ ] Step 1: Write failing tests
Test output shapes from forward(): mono_logit (B,1), split_logits (B,N-1), label_logits (B,3), base_edit_logits (B,codebook), affix_edit_logits (B,codebook). Test edge cases: single-char word (0 split positions), two-char word (1 split position).
-
[ ] Step 2: Run tests, verify they fail
-
[ ] Step 3: Implement
DecomposerModel
Reuses CharEncoder from model/encoder.py. Four heads:
- Mono head:
Linear(enc_dim, 1)on mean-pooled encoder output - Split head:
Linear(enc_dim*3, 1)on[h[i-1]; h[i]; h[i-1]-h[i]]for each interior position, masked to valid positions - Label head:
Linear(enc_dim, 3)for PREFIX/SUFFIX/INFLECTION - Edit heads:
Linear(enc_dim*2, codebook_size)for base and affix edits
Two forward methods:
- forward(chars, lengths) — uses mean-pooled repr for heads 3-4 (inference-time; Decomposer wrapper re-calls with split point)
- forward_with_split(chars, lengths, split_points) — uses gold/predicted split point to pool correct affix span and extract boundary-adjacent states for heads 3-4
- [ ] Step 4: Run tests, verify they pass
- [ ] Step 5: Commit —
feat(tokenizer): add DecomposerModel — BiLSTM encoder + 4 prediction heads
Task 9: Decomposer Wrapper¶
Files:
- Create: packages/tokenizer/src/phonolex_tokenizer/decomposer/decomposer.py
- Test: packages/tokenizer/tests/test_decomposer.py
This is the main integration point — the high-level class users interact with.
- [ ] Step 1: Write failing tests
Test with tiny synthetic data (4 positives + 2 negatives):
- Decomposer.build(positives, negatives) succeeds, builds codebooks
- decompose("darkness") returns Mono | Split
- decompose_batch(["darkness", "cat"]) returns 2 results
- decompose("darkness", max_depth=3) returns recursive tree
- train_epoch() returns a float loss
- save()/load() roundtrip produces same output type
-
[ ] Step 2: Run tests, verify they fail
-
[ ] Step 3: Implement
Decomposer
Key methods:
- build(positives, negatives): builds CharVocab from all words, mines edit codebooks from training data, instantiates DecomposerModel
- train_epoch(positives, negatives, optimizer): shuffles, batches, computes multi-task loss (BCE for mono + CE for split point + CE for label + CE for base edit + CE for affix edit). Non-mono losses only computed on non-mono examples.
- decompose_batch(words): single-step inference — mono check → split point → label → edit application
- decompose(word, max_depth): recursive — calls decompose_batch([word]), if Split recurses into base
- save(path) / load(path): persists char_vocab.json, base_codebook.json, affix_codebook.json, model.pt
Helper: _find_split_pos(word, base, affix, label) → surface split position. For suffixes: len(word) - len(affix). For prefixes: len(affix). Simple heuristic — autoresearch can refine.
- [ ] Step 4: Run tests, verify they pass
- [ ] Step 5: Commit —
feat(tokenizer): add Decomposer — train, infer, recursive decompose, save/load
Task 10: Training Script¶
Files:
- Create: packages/tokenizer/scripts/train_decomposer.py
- [ ] Step 1: Create training script
"""Train the canonical morphological decomposer."""
import argparse
import logging
import torch
from pathlib import Path
from phonolex_tokenizer.data.canonical_loader import load_morphynet_canonical
from phonolex_tokenizer.data.negatives import find_opaque_words, find_root_only_words
from phonolex_tokenizer.decomposer.decomposer import Decomposer
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
logger = logging.getLogger(__name__)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=Path, default=Path("packages/tokenizer/data/morphynet"))
parser.add_argument("--output-dir", type=Path, default=Path("packages/tokenizer/models/decomposer"))
parser.add_argument("--epochs", type=int, default=10)
parser.add_argument("--batch-size", type=int, default=64)
parser.add_argument("--lr", type=float, default=0.001)
parser.add_argument("--embed-dim", type=int, default=64)
parser.add_argument("--hidden-dim", type=int, default=128)
args = parser.parse_args()
# Load data
logger.info("Loading MorphyNet canonical data from %s", args.data_dir)
positives = load_morphynet_canonical(args.data_dir)
logger.info("Loaded %d decomposition examples", len(positives))
# Build negatives
targets = {ex.word for ex in positives}
sources = {ex.base for ex in positives}
opaque_vocab = set()
try:
from phonolex_data.loaders.cmu import load_cmu_dict
cmu = load_cmu_dict()
opaque_vocab = set(cmu.keys())
except Exception:
opaque_vocab = sources | targets
opaque = find_opaque_words(targets, opaque_vocab)
roots = find_root_only_words(sources, targets)
negatives = opaque + roots
logger.info("Built %d negative examples (%d opaque, %d roots)", len(negatives), len(opaque), len(roots))
# Build model
decomposer = Decomposer.build(
positives, negatives,
embed_dim=args.embed_dim, hidden_dim=args.hidden_dim,
)
optimizer = torch.optim.Adam(decomposer.get_parameters(), lr=args.lr)
logger.info("Model built: %d parameters", sum(p.numel() for p in decomposer.get_parameters()))
# Train
for epoch in range(1, args.epochs + 1):
loss = decomposer.train_epoch(positives, negatives, optimizer, batch_size=args.batch_size)
logger.info("Epoch %d/%d — loss: %.4f", epoch, args.epochs, loss)
# Save
decomposer.save(args.output_dir)
logger.info("Model saved to %s", args.output_dir)
if __name__ == "__main__":
main()
- [ ] Step 2: Verify script runs (smoke test)
cd /Users/jneumann/Repos/PhonoLex
uv run python packages/tokenizer/scripts/train_decomposer.py --epochs 2 --batch-size 32
Expected: Training completes, model saved to packages/tokenizer/models/decomposer/
- [ ] Step 3: Commit
git add packages/tokenizer/scripts/train_decomposer.py
git commit -m "feat(tokenizer): add decomposer training script"
Task 11: Benchmark + SIGMORPHON Regression Suite¶
Files:
- Create: packages/tokenizer/src/phonolex_tokenizer/eval/decomposer_benchmark.py
- [ ] Step 1: Implement benchmark runner
run_decomposer_benchmark(decomposer, test_positives, test_negatives) → dict with decomposition_accuracy, canonical_form_accuracy, hard_negative_precision, stratified by surface_faithful vs allomorphic.
run_sigmorphon_regression(decomposer, sigmorphon_dir) → loads SIGMORPHON 2022 dev set, flattens multi-step canonical decompositions into per-step binary pairs, runs the decomposer recursively, compares trees. Returns pass/fail per entry and overall accuracy.
- [ ] Step 2: Verify benchmark runs against the trained model from Task 10
cd /Users/jneumann/Repos/PhonoLex
uv run python -c "
from phonolex_tokenizer.decomposer.decomposer import Decomposer
from phonolex_tokenizer.eval.decomposer_benchmark import run_decomposer_benchmark
import torch
d = Decomposer.load('packages/tokenizer/models/decomposer', device=torch.device('cpu'))
# Quick smoke test
tree = d.decompose('happily', max_depth=3)
print(tree)
"
- [ ] Step 3: Commit
git add packages/tokenizer/src/phonolex_tokenizer/eval/decomposer_benchmark.py
git commit -m "feat(tokenizer): add decomposer benchmark + SIGMORPHON regression suite"
Task 12: Exports and Governor Integration¶
Files:
- Modify: packages/tokenizer/src/phonolex_tokenizer/__init__.py
- Modify: packages/generation/server/governor.py:305-324
- [ ] Step 1: Update init.py exports
Add to packages/tokenizer/src/phonolex_tokenizer/__init__.py:
from phonolex_tokenizer.decomposer.schema import Mono, Split, MorphTree, flatten_tree
from phonolex_tokenizer.decomposer.decomposer import Decomposer
And add to __all__.
- [ ] Step 2: Update governor root extraction
In packages/generation/server/governor.py, update _get_segmenter() to load Decomposer if available, falling back to legacy Segmenter. Update the root extraction at line ~320 from:
root = "".join(m.text for m in seg.morphemes if m.label.value == "root")
if root in passing_roots:
passing_words.add(word)
To:
tree = decomposer.decompose(word, max_depth=1)
if isinstance(tree, Split):
root = tree.base
if root in passing_roots:
passing_words.add(word)
- [ ] Step 3: Test governor integration
cd /Users/jneumann/Repos/PhonoLex
uv run python -m pytest packages/generation/server/tests/ -v -k governor
- [ ] Step 4: Commit
git add packages/tokenizer/src/phonolex_tokenizer/__init__.py packages/generation/server/governor.py
git commit -m "feat: integrate canonical decomposer into governor root extraction"
Task 13: Autoresearch Eval Harness¶
Files:
- Create: packages/tokenizer/scripts/eval_decomposer.py
This is the fixed eval entry point that autoresearch calls after each training experiment.
- [ ] Step 1: Create eval script
Script that:
1. Loads the model from a given path
2. Loads MorphyNet test split + hard negatives + SIGMORPHON regression set
3. Runs all metrics (decomposition accuracy, canonical form accuracy, hard negative precision)
4. Runs SIGMORPHON regression tests
5. Outputs a JSON summary with a single score field (primary metric for autoresearch) and regression_pass: bool
uv run python packages/tokenizer/scripts/eval_decomposer.py --model-dir packages/tokenizer/models/decomposer
Expected output:
{
"decomposition_accuracy": 0.85,
"canonical_form_accuracy": 0.82,
"hard_negative_precision": 0.91,
"allomorphic_accuracy": 0.73,
"sigmorphon_regression_pass": true,
"score": 0.85
}
- [ ] Step 2: Verify it runs end-to-end against the trained model
- [ ] Step 3: Commit
git add packages/tokenizer/scripts/eval_decomposer.py
git commit -m "feat(tokenizer): add autoresearch eval harness for decomposer"