Dynamic Governor with Informed Backtracking¶
Date: 2026-04-06
Status: Implemented — phoneme exclusion working end-to-end
Branch: feature/governed-generation-ui
Problem¶
The original governor used static per-token masks built before generation starts. Phonemes are properties of words, not tokens — enforcing word-level constraints at the token level is a category error. BPE tokens are arbitrary character chunks with no inherent phonological meaning.
This isn't a data coverage problem. It's a paradigm problem.
Solution¶
Two-layer enforcement using HuggingFace's native model.generate():
-
Reranker (per-step) — a
LogitsProcessorthat checks top-k candidate tokens against word-level constraints using G2P, boosting compliant tokens and penalizing non-compliant ones. Steers the model proactively. -
GUARD-style retry (post-hoc) — after generation, check every word with G2P. If violations found, add violating words to
bad_words_idsand regenerate. Catches anything the reranker missed.
This preserves native model quality (proper KV cache, sampling, stopping) while adding constraint enforcement.
Prior Art¶
- GUARD (ICLR 2025) — rejection sampling with optimized proposals. Our reranker is the proposal optimizer; the post-hoc check is the rejection step.
- Grammar-Aligned Decoding (NeurIPS 2024) — distribution preservation when masking. Informed our reranker's boost/penalize approach over hard masking.
- IterGen (ICLR 2025) — backtracking with KV cache preservation. Informed our custom loop design (retained for future use).
No prior work applies constrained decoding to clinical phonology.
Architecture¶
G2P Integration¶
g2p_en converts words to ARPAbet, then phonolex_data.mappings.load_arpa_to_ipa() converts to IPA. Everything speaks IPA — constraints come in as IPA from the frontend, G2P output is IPA, comparison is direct set intersection. No mapping tables, works for any phoneme.
0.06ms per word. Per-generation cache (G2PCache) avoids redundant lookups.
Reranker¶
Portable engine that takes any word-level criterion function and applies it to the model's top-k candidates at each decoding step. Plugs into HuggingFace's LogitsProcessor interface.
# Phoneme exclusion
criterion = lambda w: not any(p in word_to_phonemes(w) for p in excluded)
# AoA bound (future)
criterion = lambda w: norms.get(w, {}).get("aoa", 0) <= max_aoa
# Any word-level check
reranker = Reranker(criterion, tokenizer, g2p_cache)
Word Checker¶
Orchestrates constraint checks on a completed word. Takes (word, CheckerConfig, G2PCache), returns CheckResult with pass/fail and violation details. Supports: phoneme exclusion, MSH stage, bounds (AoA, concreteness, etc.), vocab restriction.
Generation Flow¶
1. Encode prompt (chat template for -it models, raw for base)
2. model.generate() with:
- RerankerProcessor (per-step constraint steering)
- PunctuationBoostProcessor (sentence structure)
- repetition_penalty=1.2
- temperature=0.7, top_p=0.9, top_k=50
- bad_words_ids (accumulated from retries)
3. Truncate to last complete sentence
4. G2P word-level compliance check
5. If violations: ban violating words, retry (up to max_retries)
6. Return best compliant text
Package Structure¶
phonolex_governors/
├── checking/
│ ├── g2p.py # G2P wrapper, returns IPA, cached
│ ├── phonology.py # Exclude, MSH, clusters (IPA)
│ └── checker.py # Word-level orchestrator
├── generation/
│ ├── loop.py # Custom autoregressive loop (retained for future)
│ ├── backtrack.py # Backtrack state + intervention planning
│ ├── sampling.py # Top-k/p, repetition penalty, n-gram blocking
│ └── reranker.py # Portable reranking engine
├── boosts/ # Relocated soft mechanisms
├── constraint_types.py # Declarative constraint definitions
└── constraint_compiler.py # Compile to CheckerConfig + BoostConfig
Model¶
Current: google/t5gemma-9b-2b-ul2-it — Google's instruction-tuned T5Gemma (SFT+RLHF). 9B encoder, 2B decoder, bfloat16 on MPS (~24.6GB). Uses chat template via apply_chat_template().
Future: T5Gemma will track Gemma releases (Gemma 3, Gemma 4). Instruction-tuned variants will improve prose quality without any governor changes. Custom LoRA fine-tuning on clinical data remains an option for the base models.
Results¶
- Zero /ɹ/ + /ɝ/ + /ɚ/ violations across multiple generations
- ~10s per generation (128 tokens on MPS)
- Natural, creative prose: "Clementine longed to be an outlaw in the Wild West town of Dusty Gulch"
- Works for any IPA phoneme combination — no special cases
- Reranker checks ~24K tokens/generation, boosts ~15K, penalizes ~9K
What's Next¶
- Wire remaining constraint types (AoA bounds, MSH, complexity, vocab) into the word checker — the infrastructure supports them, they just need criterion functions
- Soft boosts (Include, Thematic, VocabBoost) as additional LogitsProcessors
- Norm/vocab data from PhonoLex database at the word level (replacing per-token governor lookup)
- Multi-draft generation (infrastructure built, default 1, configurable)
- Frontend analysis mode with G2P-based compliance highlighting