Pointer-Generator Canonical Morphological Decomposer¶

Date: 2026-04-14 Status: Approved Replaces: Split-point + edit-codebook decomposer (packages/tokenizer/src/phonolex_tokenizer/decomposer/)

Problem Statement¶

The current canonical morphological decomposer (BiLSTM with 4 heads: mono classifier, split-point predictor, label classifier, boundary edit predictor) hits an ~81.5% decomposition accuracy ceiling confirmed across three independent fair shake variants at full convergence. The root cause is architectural: a shared encoder serves both a monomorphemic gate and a decomposition pipeline, and the training objectives for these two tasks pull the encoder in opposite directions. This produces a Pareto frontier between allomorphic accuracy (~0.60 max) and hard negative precision (~0.74 max) that cannot be resolved by hyperparameter tuning, loss weighting, or capacity changes.

The decompose-and-edit paradigm also makes three structural commitments that limit ceiling:

Binary split — each decomposition level predicts a single cut point. No global view of morpheme structure.
Discrete codebook — edit operations are a closed set of ~15-20 mined (strip, append) pairs. Unseen allomorphic patterns silently fall back to identity.
Separate mono gate — the monomorphemic decision is a separate head, creating the encoder-level tension.

Solution: Character-Level Seq2Seq with Pointer-Generator¶

Replace the split-point + edit-codebook paradigm with a character-level sequence-to-sequence model that uses a pointer-generator mechanism to learn when to copy input characters vs. generate canonical characters.

Why This Dissolves the Pareto Tension¶

There is no separate mono classifier. A monomorphemic word is one where the model copies every input character and emits EOS without generating a + boundary token. The mono/poly decision is distributed across every decoding step rather than compressed into a single sigmoid threshold. The model doesn't decide "mono or poly" up front — it decides at each character whether to keep copying or to do something different.

There is no edit codebook. The decoder learns character-level mappings directly. Unseen allomorphic patterns are handled by the decoder's generative capacity, not by codebook coverage.

There is no separate split-point predictor. Morpheme boundaries emerge as + tokens in the output sequence. The model can emit multiple boundaries in a single pass (flat decomposition), eliminating recursive inference.

Literature Support¶

Mager et al. 2020 ("Tackling the Low-resource Challenge for Canonical Segmentation"): LSTM pointer-generator achieved 78% accuracy on English canonical segmentation. The pointer mechanism explicitly decides copy-vs-generate, reducing learning complexity because most of canonical segmentation is copying.
Makarov & Clematide 2018 / CLUZH 2022: Neural transducers with copy/insert/delete actions, trained via imitation learning. Won/shared-first in SIGMORPHON 2022 sentence-level segmentation. "Learning to apply a generic copy action enables the approach to generalize quickly."
Aharoni & Goldberg 2017: Hard monotonic attention for morphological inflection. Key insight: morphological tasks are dominated by copying lemma characters with localized edits.
DeepSPIN-3 (SIGMORPHON 2022 winner): Transformer with entmax (sparse) loss achieved 97.29% F1 on surface segmentation. Entmax produces crisper boundary decisions.

Our 81.5% on canonical segmentation is competitive with the literature (Mager 2020: 78% English, Ginn 2025: 66.59% on low-resource canonical tasks). Surface segmentation numbers (93-97%) are not directly comparable — canonical segmentation is substantially harder.

Architecture¶

Encoder¶

2-layer bidirectional LSTM over character embeddings.

Embedding: Embedding(vocab_size, 64) with padding_idx=0. Vocab: a-z + hyphen + apostrophe + PAD = ~29 characters.
LSTM: LSTM(64, 128, num_layers=2, bidirectional=True, dropout=0.1, batch_first=True)
Output: H ∈ (batch, seq_len, 256) — hidden states at each input position.

Decoder¶

1-layer unidirectional LSTM, autoregressive.

Input at each step: concatenation of previous output character embedding (64-dim) + context vector from attention (256-dim) = 320-dim.
LSTM: LSTM(320, 256, num_layers=1, batch_first=True)
Output: hidden state s_t ∈ (batch, 256) at each decoding step.

Attention¶

Bahdanau (additive) attention over encoder hidden states.

e_{t,i} = v^T tanh(W_h H_i + W_s s_t + b)
α_t = softmax(e_t)          # or entmax(e_t) in experiment variant
c_t = Σ α_{t,i} · H_i

Pointer-Generator Gate¶

A learned scalar p_gen ∈ (0, 1) computed at each decoding step:

p_gen = σ(w_s · s_t + w_c · c_t + w_y · y_{t-1} + b)

The final output distribution over the extended vocabulary (characters + + + EOS):

P(w) = p_gen · P_vocab(w) + (1 - p_gen) · Σ_{i: x_i = w} α_{t,i}

p_gen → 0: copy mode (pointer). Model copies a character from the input at the position with highest attention weight.
p_gen → 1: generate mode. Model generates from the decoder vocabulary (used for + boundary tokens, EOS, and allomorphic character edits).

Parameter Estimate¶

Component	Parameters
Character embedding (shared enc/dec)	29 * 64 = ~1.9K
Encoder BiLSTM (2-layer)	~400K
Decoder LSTM (1-layer)	~590K
Attention (W_h, W_s, v)	~200K
Pointer gate (w_s, w_c, w_y)	~800
Output projection (256 → vocab)	~8K
Total	~1.2M

Lighter than the current model (~2.6M) due to eliminating the 4 separate heads and the 4*enc_dim edit head input. If this proves undertrained, the 3-layer encoder experiment adds ~130K params.

Data Representation¶

Input-Output Format¶

Training pairs are character sequences. Input: the surface word, space-separated. Output: canonical morphemes joined by +, space-separated.

Polymorphemic (surface-faithful):

Input:  k i n d n e s s
Output: k i n d + n e s s

Polymorphemic (allomorphic):

Input:  h a p p i l y
Output: h a p p y + l y

Polymorphemic (prefix):

Input:  u n h a p p y
Output: u n + h a p p y

Multi-morphemic (flat, single pass):

Input:  u n h a p p i l y
Output: u n + h a p p y + l y

Monomorphemic (hard negative):

Input:  b u t t e r
Output: b u t t e r

Data Sources¶

Same sources as the current model, reformatted:

MorphyNet derivational (~225K pairs): source/target/affix/type → canonical decomposition pairs. The loader derives canonical forms from the (source, target, affix) triple.
MorphyNet inflectional: same treatment.
Hard negatives — opaque words: words in CMU dictionary matching suffix/prefix patterns but absent from MorphyNet decomposed set. Reformatted as identity copy pairs.
Hard negatives — root-only: MorphyNet sources never appearing as targets. Identity copy pairs.

Multi-Morphemic Handling¶

The current model does binary recursive decomposition: split once, then recurse on each half. The seq2seq model decomposes flat in a single pass: all morphemes and boundaries are emitted left-to-right.

For words with 3+ morphemes (e.g., "unhappily" → "un + happy + ly"), the training data must provide the full flat decomposition. This requires chaining MorphyNet's derivational pairs:

MorphyNet has: "happy" → "unhappy" (prefix "un") and "happy" → "happily" (suffix "ly")
Chain: "unhappily" = "un" + "happy" + "ly"

If a word has multiple derivational paths in MorphyNet, use the one that produces the most morphemes (deepest decomposition). When two paths produce the same depth but different segmentations, prefer the path where all intermediate forms exist as MorphyNet entries (validated chain). If still ambiguous, take the first encountered — the model is robust to occasional noise. Words with no chain (single derivation only) produce 2-morpheme pairs as before.

Train/Dev/Test Split¶

Same MD5-hash-based 80/10/10 split from packages/tokenizer/scripts/eval_decomposer.py. Direct metric comparability with the 81.5% baseline.

Training¶

Loss: Cross-entropy on the decoder output sequence, teacher-forced. Each character position in the target (including + and EOS) contributes equally.
Optimizer: AdamW, lr=0.001, weight_decay=1e-5.
Schedule: Cosine annealing, T_max=20 (matched to epoch count), eta_min=1e-5.
Batch size: 64.
Epochs: 20 (full convergence, matching fair shake protocol).
Steps/epoch: ~4,000 (255K examples / 64 batch size).
Device: MPS (Apple Silicon).

Evaluation¶

Metrics¶

Same five metrics as the current model, computed by parsing the decoder output:

decomposition_accuracy — full match: correct number of morphemes, correct boundaries, correct canonical forms.
canonical_form_accuracy — each predicted morpheme matches gold (averaged across morphemes).
allomorphic_accuracy — decomposition_accuracy on allomorphic subset.
hard_negative_precision — fraction of monomorphemic words where output contains no +.
surface_faithful_accuracy — decomposition_accuracy on surface-faithful polymorphemic subset.

Output Parsing¶

Split the output character sequence on +, strip whitespace from each segment. Compare resulting morpheme list to gold morphemes (order-sensitive). If output contains no +, classify as monomorphemic prediction.

Post-Training Threshold Sweep¶

After training, sweep the + emission probability threshold to map the Pareto frontier. At each decoding step, only emit + if P(+) > threshold. Sweep thresholds from 0.1 to 0.9 in steps of 0.05. This maps allomorphic_accuracy vs hard_negative_precision at zero additional training cost.

Experiment Sequence¶

Six experiments, each building on the last if it improves the primary metric (decomposition_accuracy), discarded if not. Same autoresearch keep/discard protocol as the prior round.

Experiment 0: Pointer-Generator LSTM Baseline¶

Hypothesis: The unified copy-vs-generate paradigm breaks through the 81.5% ceiling by eliminating the mono/edit architectural tension.

Config: Full architecture as described above, 20 epochs, greedy decoding.

Success criterion: decomposition_accuracy > 0.815. If this doesn't beat the current ceiling, experiments 1-5 are unlikely to close the gap and the problem formulation needs reassessment.

Experiment 1: Entmax Attention (α=1.5)¶

Hypothesis: Sparse attention produces crisper alignment to morpheme-relevant input positions, improving boundary precision and reducing attention smearing across irrelevant characters.

Change: Replace softmax(e_t) with entmax(e_t, alpha=1.5) in the attention computation. Requires the entmax PyTorch package.

Expected effect: Improved hard_negative_precision (crisper "don't attend to anything special" pattern for mono words) and canonical_form_accuracy (crisper alignment for edit positions).

Experiment 2: Labeled Boundaries¶

Hypothesis: Encoding morpheme type into the boundary token gives the decoder a richer training signal without adding a separate head, improving decomposition accuracy on prefixed words where boundary position is more ambiguous.

Change: Replace single + boundary token with three: +s (suffix boundary), +p (prefix boundary), +i (inflection boundary). Decoder vocabulary grows by 2 tokens (~30 total).

Example: "unhappily" → "u n +p h a p p y +s l y"

Experiment 3: Focal Loss (γ=2)¶

Hypothesis: Down-weighting easy copy steps (which dominate the loss) and focusing gradient on boundary positions and allomorphic edit characters improves performance on the hardest cases.

Change: Replace cross-entropy with focal loss: FL(p_t) = -(1 - p_t)^γ · log(p_t), γ=2.

Expected effect: Improved allomorphic_accuracy (the model spends more gradient budget on the characters where it diverges from copying).

Experiment 4: Curriculum — Surface-Faithful First¶

Hypothesis: Training epochs 1-10 on surface-faithful polymorphemic examples + negatives only, then mixing in allomorphic examples for epochs 11-20, establishes clean copy-then-boundary behavior before the model learns character-level edits.

Change: Data schedule. Epochs 1-10: exclude allomorphic training pairs. Epochs 11-20: full training set. No architecture change.

Expected effect: Better hard_negative_precision (the model's copy behavior is deeply established before allomorphic examples introduce divergence from input).

Experiment 5: 3-Layer Encoder¶

Hypothesis: Additional encoder depth captures longer-range orthographic patterns (prefix-root interactions, compound word structure) without increasing hidden dim or per-step cost.

Change: Encoder BiLSTM layers: 2 → 3. ~130K additional parameters.

Expected effect: Modest improvement across all metrics, especially on longer words with multiple morphemes.

Governor Integration¶

The governor at packages/generation/server/governor.py calls into the decomposer to extract root forms for norms lookup. The current interface is decompose(word) → MorphTree where MorphTree = Mono | Split.

The new model implements the same interface. Internally: 1. Encode the word as a character sequence. 2. Greedy decode to produce the output character sequence. 3. Parse the output by splitting on + (or +s/+p/+i). 4. Return a Decomposition(morphemes=[...], labels=[...]) or the existing MorphTree type.

The governor doesn't care how the decomposition was produced — only that it gets canonical morphemes back.

Inference Cost¶

Single forward pass: encoder (one pass) + decoder (greedy, ~20 steps max). No two-pass inference like the current model. Comparable or faster wall-clock time.

File Structure¶

New files within the existing packages/tokenizer/ package:

packages/tokenizer/
├── src/phonolex_tokenizer/
│   ├── seq2seq/                        # New subpackage
│   │   ├── __init__.py
│   │   ├── model.py                    # Encoder, Decoder, PointerGeneratorNet
│   │   ├── attention.py                # Bahdanau + entmax variant
│   │   ├── dataset.py                  # Char-sequence dataset from MorphyNet
│   │   └── decoder.py                  # Greedy/beam decode + output parsing
│   ├── data/
│   │   └── seq2seq_loader.py           # MorphyNet → input/output char sequences
│   └── eval/
│       └── seq2seq_benchmark.py        # Same 5 metrics, output parsing
├── scripts/
│   ├── train_seq2seq.py                # Training entry point
│   └── eval_seq2seq.py                 # Evaluation entry point
└── autoresearch/
    └── seq2seq/                        # Experiment harness
        ├── train_experiment.py
        └── experiments/

The existing decomposer/ subpackage remains untouched. The two implementations coexist until the seq2seq model is validated, at which point the decomposer becomes archival.