Canonical Morphological Decomposer¶

Date: 2026-04-07 Package: packages/tokenizer/ (replaces current semi-Markov CRF segmenter) License: Apache 2.0

Problem¶

The current morphological segmenter (semi-Markov CRF, boundary F1 ~0.892) operates on surface character spans. It can only output substrings of the input word, so "happily" yields ["happi", "ly"] — never ["happy", "ly"]. This is an architectural ceiling, not a data or tuning problem.

This breaks the governor integration at governor.py:320, where the extracted root "happi" fails to match "happy" in the norms DB. It also produces output that is not semantically interpretable to clinicians or researchers — "happi" is not a word.

Design¶

Task Definition¶

Recursive binary canonical decomposition. The model performs one decomposition step: given a word (or morpheme candidate), it returns either "monomorphemic" or a binary split into canonical base + canonical affix with a label.

Input: happily → Output: Split(base="happy", affix="ly", label=SUFFIX)
Input: happy → Output: Split(base="hap", affix="y", label=SUFFIX)
Input: hap → Output: Mono()
Input: butter → Output: Mono()

Recursive application builds a full decomposition tree. Each depth level is a valid stopping point. An API consumer (clinician, researcher, governor) picks their depth.

Output Schema¶

@dataclass(frozen=True)
class Mono:
    """Word is monomorphemic — no further decomposition."""
    word: str

@dataclass(frozen=True)
class Split:
    """Binary canonical decomposition."""
    word: str           # original input
    base: str           # canonical base form
    affix: str          # canonical affix form
    label: MorphLabel   # PREFIX, SUFFIX, or INFLECTION
    base_tree: MorphTree | None = None   # populated by recursive decomposition
    affix_tree: MorphTree | None = None  # populated by recursive decomposition

MorphTree = Mono | Split

Labels: ROOT (leaf label for monomorphemic), PREFIX, SUFFIX, INFLECTION. Same four as the current CRF.

Recursive Tree Example¶

decompose("unfortunately")

unfortunately
├── un (PREFIX)
└── fortunately
    ├── fortunate
    │   ├── fortune (ROOT)
    │   └── ate (SUFFIX)
    └── ly (SUFFIX)

The tree is built by calling the model recursively on the base of each split until Mono() is returned. Affixes are always leaves (a single affix is monomorphemic by definition in this schema). A hard max-depth limit (default: 10) prevents infinite recursion in degenerate cases.

Architecture¶

Decompose-and-Edit Paradigm¶

The model finds a split point on the surface form, classifies the affix, then predicts small boundary edits to recover the canonical spelling of each part. Single forward pass, no autoregressive decoding.

Encoder¶

Character-level BiLSTM. 2-layer bidirectional LSTM, character embeddings (64-dim), hidden dim 128 per direction, 256-dim contextual output per character position. LayerNorm on encoder output.

Same proven architecture as the current CRF encoder.

Four Prediction Heads¶

Head 1 — Monomorphemic classifier. Mean-pooled encoder output → Linear(256, 1) → sigmoid. If score < threshold, return Mono(). No other heads fire.

Head 2 — Split-point predictor. Softmax over interior character positions 1..N-1. Each position's score is computed from the boundary representation: concat(h[i-1], h[i], h[i-1] - h[i]) → Linear(768, 1). Predicts the single position where the morpheme boundary falls in the surface form.

Head 3 — Label classifier. Mean-pooled encoder hidden states over the affix span → Linear(256, 3) → softmax over {PREFIX, SUFFIX, INFLECTION}. The base is always the non-affix side; its label at this decomposition step is implicitly ROOT (it may decompose further on recursion).

Affix position (left = prefix, right = suffix/inflection) is determined by Head 3's label prediction. Head 2 predicts a position-only split point; Head 3 then determines which side is the affix. If PREFIX, the affix is the left side of the split; otherwise the right side. Both heads use the same encoder output and can be computed in parallel.

Head 4 — Canonical boundary edit. For each side of the split (base and affix), predict a boundary edit operation to recover the canonical form from the surface substring.

An edit operation is (strip_count, append_chars): - For the base: strip/append at the boundary-adjacent end (right end for suffixes, left end for prefixes) - For the affix: strip/append at the boundary-adjacent end (left end for suffixes, right end for prefixes)

The edit codebook is mined from training data — the set of all (strip, append) pairs observed. For English this is empirically small (~15-20 entries). Each side's edit is predicted as a classification over the codebook, using the boundary-adjacent encoder hidden states as input.

Examples: | Word | Split pos | Base surface | Base edit | Base canonical | Affix surface | Affix edit | Affix canonical | |------|-----------|-------------|-----------|---------------|--------------|-----------|----------------| | happily | 5 | happi | strip "i", append "y" | happy | ly | identity | ly | | running | 4 | runn | strip "n" | run | ing | identity | ing | | making | 3 | mak | append "e" | make | ing | identity | ing | | flies | 3 | fli | strip "i", append "y" | fly | es | strip left "e" | s | | darkness | 4 | dark | identity | dark | ness | identity | ness | | unhappy | 2 | (affix) un | identity | un | happy | identity | happy |

~85-90% of decompositions require identity edits on both sides. The edit head only needs to activate for the ~10-15% of cases involving spelling changes.

Parameter Estimate¶

~1.5-3M parameters total. Trainable on a laptop (MPS) in minutes per epoch.

Training Data¶

Primary: MorphyNet Derivational (~225K single-step pairs)¶

MorphyNet stores (source, target, affix, type) triples — exactly one derivational step. Each entry becomes one training example: - Input: target (the derived word) - Gold output: Split(base=source, affix=affix, label=type)

The ~8-15K entries currently dropped by the loader (where source + affix != target) are the allomorphic gold data. These are now the most valuable entries, not rejects.

Secondary: MorphyNet Inflectional (~650K pairs)¶

Same format for inflections: run → running via ing. Provides dense coverage of -ing, -ed, -s, -er, -est — the inflectional forms that MorphoLex treated as monomorphemic.

Tertiary: UniMorph Paradigm Tables¶

Lemma → inflected form pairs. Generate training examples algorithmically: (inflected_form, lemma, inferred_affix). Expands inflectional coverage beyond MorphyNet.

Negative Examples (Monomorphemic Words)¶

The model must learn when NOT to decompose. Three sources of hard negatives:

Morphologically opaque words. Words where an apparent suffix isn't actually one: "butter" (not "butt" + "-er"), "hammer" (not "ham" + "-mer"), "carpet" (not "car" + "-pet"), "monster", "winter". These are the critical hard negatives — they teach the model that surface pattern alone is insufficient.
Known monomorphemic roots. MorphyNet source words with no further derivational parent: "run", "dark", "cat", "hap". Confirmed roots.
Short function words and closed-class items. "the", "with", "very", "from" — unambiguously monomorphemic.

Not Used¶

MorphoLex — incoherent morpheme definitions. Too much is classified as root (treats all inflections as monomorphemic). Not viable for our granularity.
SIGMORPHON 2022 (for training) — canonical but traces full derivation chains (decomposes "happy" into "hap" + "y" in one entry, not one step at a time). Also has annotation errors. Used for evaluation only.

Estimated Training Set Size¶

~300-400K decomposition examples + ~100-150K monomorphemic negatives. Sufficient for a 2M-param model.

Licenses¶

All training data is CC BY-SA 3.0 (from Wiktionary). Model weights are a derived work — Apache 2.0 release with CC BY-SA attribution for training data sources (MorphyNet, UniMorph).

Evaluation¶

Metrics¶

Decomposition accuracy — correct binary split or correct monomorphemic classification, per step.
Canonical form accuracy — when a split is predicted, are both canonical forms correct? Isolates edit head performance.
Tree exact match — full recursive decomposition matches gold (strictest metric).
Hard negative precision — accuracy on morphologically opaque words. Measures OOV robustness.

Test Splits¶

Surface-faithful — words where canonical = surface (no spelling change). Baseline competence.
Allomorphic — words requiring canonical edits. The metric we exist to improve.
Hard negatives — morphologically opaque words. Guards against over-decomposition.
OOV — held-out words not seen in training. Generalization test.

Expected Trees (Autoresearch Regression Tests)¶

Derived from SIGMORPHON 2022 canonical decompositions. Flatten SIGMORPHON's multi-step entries into per-step binary pairs, then sample a regression suite covering surface-faithful, allomorphic, and multi-step cases. An autoresearch experiment that regresses on any expected tree is rejected regardless of aggregate metrics.

Hard negative regression tests (monomorphemic words like "butter", "hammer", "carpet") are added separately since SIGMORPHON does not cover negative examples.

Validation Against SIGMORPHON 2022¶

Flatten SIGMORPHON's multi-step canonical decompositions into per-step pairs. Run the model recursively and compare output trees. Provides a comparison point against the 93.84% F1 SOTA (different task formulation, but same underlying capability).

Integration¶

Governor (`governor.py:320`)¶

The existing root extraction code:

root = "".join(m.text for m in seg.morphemes if m.label.value == "root")

Becomes:

tree = decomposer.decompose(word, max_depth=1)
root = tree.base if isinstance(tree, Split) else word

Since the model outputs canonical forms, root will be "happy" (not "happi"), correctly matching entries in passing_roots.

API Endpoint¶

Expose as a PhonoLex API endpoint for clinicians and researchers:

GET /api/morphology/decompose?word=unfortunately&depth=2

{
  "word": "unfortunately",
  "type": "split",
  "affix": "un",
  "label": "prefix",
  "base": {
    "word": "fortunately",
    "type": "split",
    "affix": "ly",
    "label": "suffix",
    "base": {
      "word": "fortunate",
      "type": "mono"
    }
  }
}

Default depth: full recursive decomposition. depth=1 for one-step only.

Backward Compatibility¶

The Segmenter class API (.segment(word), .segment_batch(words)) is preserved as a compatibility wrapper that flattens the tree into a flat MorphSegmentation. New consumers use the Decomposer class and MorphTree directly.

Autoresearch Strategy¶

Same pattern as the CRF autoresearch (which pushed F1 from 0.756 to 0.892 in one session):

Define the eval harness with metrics + expected trees as hard pass/fail gates.
Let autoresearch iterate on: hidden dim, layer count, dropout, learning rate, codebook design, training schedule, head architecture details.
The eval harness is the fixed point; the model evolves.

Open Questions¶

Compound words: "blackbird" → "black" + "bird" (two roots, neither is an affix). The binary split model handles the segmentation, but the label would need to be something like COMPOUND rather than PREFIX/SUFFIX/INFLECTION. Defer to autoresearch — may need a 4th label or may be handleable as ROOT + ROOT.
Codebook discovery: The exact boundary edit codebook is determined empirically from training data. Size and contents are an autoresearch parameter.
Monomorphemic threshold: The sigmoid threshold for Head 1 is a tunable hyperparameter. Autoresearch can optimize it against the hard negative precision metric.