MLM Iterative Editor + Argstruc CFG Enumerator — Design (PHON-95)¶
Date: 2026-05-07
Status: Spec — pending user review
Branch: off release/v5.2.0 (working branch named at writing-plans handoff)
Ticket: PHON-95 (https://neumannsworkshop.atlassian.net/browse/PHON-95)
Predecessors:
- PHON-93 (runtime word-data layer + marisa-trie VocabTrie, merged PR #85, 2026-05-06) — provides WordStore + canonical Parquet artifacts
- PHON-94 (corpus DEP reannotation + selectional preference population, merged PR #86, 2026-05-07) — provides selectional.parquet (5.44M rows × 16 bands of (verb, role, filler) PPMI)
- PHON-66 (governed-generation rethink) — names this work the C1 "combinatorial" track
- PHON-92 (selectional-preference research-spike memo at packages/generation/research/2026-05-05-phon-92-selectional-preference/memo.md on research/phon-92-selectional-preference-spike) — empirical foundation
Sibling tickets: none currently active. PHON-67 (compositional fine-tune) and PHON-68 (retrieve-from-corpus) are WON'T DO; this is the chosen path.
Related cleanup: PHON-95's ticket description references findings-and-scope.md; the actual file on research/phon-92-selectional-preference-spike is FINDINGS.md at packages/generation/research/2026-05-05-phon-92-selectional-preference/diffusion-editor-probe/. The ticket's path is stale; this spec uses the actual filename.
Problem¶
The v6 governed-generation architecture has been deemed structurally defective by the PHON-66 rethink: it bolts a deterministic-NLG configuration surface onto a chatbot-shaped production mechanism (T5Gemma 9B-2B with token-time logit steering). The defects audited under PHON-58 are downstream symptoms, not independent bugs.
PHON-66 names three replacement tracks: - C1 (combinatorial) — enumerate spec-compliant candidates via an argument-structure CFG over WordStore-filtered slot terminals, then refine and rank via an iterative MLM editor + coherence scorer. This ticket. - C2 (compositional fine-tune) — train an encoder-decoder on a constraints↔output corpus. WON'T DO. - C3 (retrieve-from-phono-corpus) — retrieve attested in-spec sentences from a phonologically-indexed corpus. WON'T DO.
C1 was empirically validated on 2026-05-05 across 5 PHON-64 v2 failure-case seeds. The validated stack — verb-locked CFG seed + RoBERTa joint-mask + sampled trie-filtered fill + anti-repetition + best-of-N + joint-masked MLM-PLL coherence scoring — produces coherent in-spec English at sub-second per seed on Apple MPS, with 100% spec compliance.
PHON-95 is the productionization of that validated stack: lift it out of the probe scripts, wire it to PHON-93's WordStore + PHON-94's selectional.parquet, give it a stable module surface (phonolex_generators.*), and gate it with tests.
Goal: Ship the C1 generation track as a callable Python API on top of phonolex_data runtime artifacts. v1 = three modules + acceptance tests + a reproducible run script under packages/generation/research/<date>-phon-95-editor/.
Scope¶
In:
- New package packages/generators/ (Python, name: phonolex_generators).
- Module phonolex_generators.cfg_seed.argstruc_enumerator — verb-locked seed sentence enumeration via an argument-structure CFG + WordStore-filtered slot terminals.
- Module phonolex_generators.editor.mlm_iterative_editor — joint-mask + sampled trie-filtered fill + anti-repetition + best-of-N over an MLM (RoBERTa-large default; encoder-pluggable interface).
- Module phonolex_generators.scorer.joint_mask_pll — joint-masked MLM-PLL coherence scoring; reuses the editor's MLM in a single forward pass.
- Reuses phonolex_governors.generation.trie.VocabTrie (full 125K-word marisa-trie), retagged per-request via trie.tag(banned) where banned = all_words - spec_allowed. The package depends on phonolex_governors (deviates from the original spec's "not a dependency here" framing — change authorized 2026-05-07).
- v1 PMI integration: boolean admit — gate fillers by selectional PMI ≥ 0 in the requested band. Continuous bias deferred (see Open Question 1).
- Acceptance test suite: all 5 PHON-64 v2 failure-case seeds produce coherent in-spec English matching probe quality (sentence-level diff vs the probe's sampled_locked_dedup_output.txt golden).
- Reproducible run script packages/generation/research/2026-05-07-phon-95-editor/run.py that takes a spec ID + verb + n_seeds, returns the editor's outputs.
Out:
- Continuous PMI biasing (deferred to a measurable failure mode — see OQ1).
- Multi-clause sentence support (probes were 5–7 token toy seeds; longer-sentence scaling is OQ2).
- Subject-verb agreement / morphological correction (OQ3).
- Diversity at scale beyond best-of-8 (OQ4).
- Coherence-scorer LLM-as-judge backup (OQ5).
- Editor fine-tuning on PhonoLex CDS substrate (OQ6, candidate v2 work).
- Integration with the FastAPI generation server (separate ticket — once the API is stable we wire it into the existing /api/generate-single proxy or add a parallel route).
- v6 architecture removal (parallel-track for now; v6 stays online as the production behavior while PHON-95 is gated by feature flag).
Data contracts¶
Inputs (consumed)¶
WordStore (from phonolex_data.runtime.store) — loaded once at editor-init, queried per-request:
from phonolex_data.runtime.store import WordStore
store = WordStore.from_parquet(...)
# Spec-compliant lexicon for a slot (filter expression, Polars-style):
slot_words = store.subset(
pl.col("syllable_count") <= 2
& pl.col("phonemes_str").str.starts_with("|k|")
& pl.col("pos").is_in(["NOUN", "VERB"])
)
# Returns a Polars DataFrame; the editor extracts the `word` column for trie construction.
selectional.parquet (from PHON-94) — per-request join for PMI gating:
sel = pl.scan_parquet("data/runtime/selectional.parquet")
# For verb=cut, role=dobj, band=fineweb_adult: words with PMI ≥ 0
admitted_dobj = (
sel.filter(
(pl.col("verb") == verb)
& (pl.col("role") == role)
& (pl.col("band") == band)
& (pl.col("ppmi") > 0.0)
)
.select("filler")
.collect()["filler"]
.to_list()
)
The intersection slot_words ∩ admitted_dobj is the editor's allowed lexicon for that slot. v1 takes band="fineweb_adult" as default; childes_* / phonbank_* / fineweb_b* bands are selectable per-request.
MLM weights — RoBERTa-large via transformers.AutoModelForMaskedLM.from_pretrained("roberta-large"). Loaded once at editor-init, kept on mps if available, else cpu. Single shared instance across editor + scorer.
Outputs (produced)¶
@dataclass
class EditedSentence:
seed: str # the CFG-emitted seed sentence
spec_id: str # which spec produced it
verb: str # verb-locked across all edits
coherence_seed: float # joint-masked PLL of the seed
best: str # highest-coherence edit across N trajectories
coherence_best: float # joint-masked PLL of `best`
unique_outputs: list[str] # deduped set of converged trajectory outputs
trajectories: list[Trajectory] # full per-trajectory history (for debugging)
@dataclass
class Trajectory:
traj_id: int
history: list[tuple[str, float]] # (sentence, coherence) pairs across iterations
outcome: Literal["CONVERGED", "CYCLE", "TIMEOUT"]
No persisted artifact — the editor is a pure function from (spec_id, verb) to a list of EditedSentence. If a downstream consumer wants to cache outputs they can pickle the result themselves; PHON-95 does not write to disk.
Architecture¶
┌──────────────────────────────────────────────────────────────────┐
│ phonolex_generators (new package) │
│ │
│ ┌─────────────────────┐ ┌────────────────────────────┐ │
│ │ cfg_seed/ │ │ editor/ │ │
│ │ argstruc_enumerator├───→│ mlm_iterative_editor │ │
│ │ • CFG over slots │ │ • joint-mask MLM forward │ │
│ │ • WordStore subset │ │ • trie-filter top-K │ │
│ │ • verb-lock │ │ • temperature sampling │ │
│ │ → seed sentences │ │ • anti-repetition │ │
│ └─────────────────────┘ │ • best-of-N trajectories │ │
│ │ │ → EditedSentence[] │ │
│ │ └────────────────────────────┘ │
│ │ │ │
│ │ ↓ │
│ │ ┌────────────────────────────┐ │
│ │ │ scorer/ │ │
│ │ │ joint_mask_pll │ │
│ │ │ • shared MLM forward │ │
│ │ │ → coherence: float │ │
│ │ └────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
│ │
↓ ↓
┌──────────────────────────────────────────────────────────────────┐
│ phonolex_data (existing) │
│ • WordStore.subset(...) │
│ • selectional.parquet (verb, role, filler, band, ppmi) │
│ • words.parquet (PoS, phonemes, lemma, ...) │
└──────────────────────────────────────────────────────────────────┘
The three modules are siblings, not a stack: the enumerator emits seeds, the editor mutates them, the scorer ranks them. The editor and scorer share the MLM instance to avoid double-loading 1.4GB of RoBERTa weights.
Module 1: argstruc_enumerator¶
CFG productions per spec + verb. Verb is locked at production time; agent and patient slots are filled from WordStore.subset(spec_expr) ∩ selectional_admitted_for_slot(verb, role, band).
# Pseudocode
def enumerate_seeds(
spec_id: str,
verb: str,
band: str = "fineweb_adult",
max_seeds: int = 16,
) -> list[Seed]:
spec_filter = SPEC_FILTERS[spec_id] # Polars expr
slot_words = store.subset(spec_filter)["word"].to_list()
nsubj_admit = pmi_admit(verb, "nsubj", band)
dobj_admit = pmi_admit(verb, "dobj", band)
nsubj_pool = sorted(set(slot_words) & set(nsubj_admit))
dobj_pool = sorted(set(slot_words) & set(dobj_admit))
# CFG: NP V NP. Determiner = "the" (v1; pluralization deferred to OQ3).
seeds = []
for nsubj in random.sample(nsubj_pool, k=min(len(nsubj_pool), 4)):
for dobj in random.sample(dobj_pool, k=min(len(dobj_pool), 4)):
seeds.append(Seed(
sentence=f"the {nsubj} {verb} the {dobj}",
content_word_indices=(1, 4), # nsubj, dobj
locked_word_indices=(2,), # verb
spec_id=spec_id,
note=f"CFG-emitted, ({nsubj}, {verb}, {dobj})",
))
if len(seeds) >= max_seeds:
return seeds
return seeds
The probe used 5 hand-crafted seeds; the enumerator generates them programmatically. Both feed the same editor input shape (Seed dataclass).
Module 2: mlm_iterative_editor¶
Lift verbatim from probe_sampled_iterative.py. Three loops:
- Outer (per seed): spawn N=8 trajectories, return the best by coherence.
- Middle (per trajectory): iterate up to MAX_ITER=15 edits or until
CONVERGED(current sentence is its own argmax) /CYCLE(sentence revisits a previous state). - Inner (per iteration): joint-mask all content positions, forward through MLM, intersect top-K logits with the per-request word trie, sample (with anti-repetition over the trajectory's history) at temperature=0.7 from top-10 of the trie-filtered top-50.
Hyperparameters lifted from probe (validated):
TRIE_TOP_K = 50 # trie-filter pool size
SAMPLE_TOP_K = 10 # sample from top-K of trie-filtered pool
TEMPERATURE = 0.7
N_TRAJECTORIES = 8
MAX_ITER = 15
These are hyperparameters not constants — exposed as module-level defaults but overridable per-call.
Module 3: joint_mask_pll¶
Lift from probe. Mask all content positions, forward through MLM, sum log-probability of the actual tokens at masked positions. Higher = more coherent.
def joint_masked_coherence(
model, tokenizer, sentence: str, content_word_indices: list[int]
) -> float:
word_positions = word_to_token_positions(tokenizer, sentence, content_word_indices)
mask_positions = [p for ps in word_positions.values() for p in ps]
if not mask_positions:
return float("nan")
masked = input_ids.clone()
for ti in mask_positions:
masked[0, ti] = tokenizer.mask_token_id
with torch.no_grad():
logits = model(masked).logits
return sum(
torch.log_softmax(logits[0, ti], dim=-1)[input_ids[0, ti]].item()
for ti in mask_positions
)
The headline sanity test (cat chased ball > cat cat cat) holds. A wider sanity probe (N=50–100 well-formed-vs-degenerate pairs) is OQ5; v1 ships with the validated 5-seed probe set as the gate.
Methodology¶
Per-request flow¶
- Caller provides
(spec_id, verb, band). v1 default:band="fineweb_adult". argstruc_enumerator.enumerate_seeds(...)→list[Seed](max 16).- For each seed:
mlm_iterative_editor.edit(seed, store)→EditedSentence(best of 8 trajectories). - Return
list[EditedSentence].
Cold-start cost: ~3s for RoBERTa-large load + WordStore Parquet scan. Per-request: ~600ms per seed (8 trajectories × ~75ms each = ~600ms; observed in probe). 16 seeds → ~10s. Acceptable for batch generation; future optimization tracked under OQ4.
Module dependencies — package layout¶
packages/generators/
├── pyproject.toml # name: phonolex_generators
├── src/phonolex_generators/
│ ├── __init__.py
│ ├── cfg_seed/
│ │ ├── __init__.py
│ │ ├── argstruc_enumerator.py
│ │ └── spec_filters.py # SPEC_FILTERS dict
│ ├── editor/
│ │ ├── __init__.py
│ │ ├── mlm_iterative_editor.py
│ │ ├── trajectory.py # Trajectory + EditedSentence dataclasses
│ │ └── trie_filter.py # topk_in_trie_with_logits
│ ├── scorer/
│ │ ├── __init__.py
│ │ └── joint_mask_pll.py
│ └── shared/
│ ├── __init__.py
│ ├── mlm_loader.py # singleton MLM + tokenizer loader
│ └── word_to_tokens.py # word_to_token_positions helper
├── tests/
│ ├── test_argstruc_enumerator.py
│ ├── test_mlm_iterative_editor.py
│ ├── test_joint_mask_pll.py
│ └── test_acceptance_phon64v2.py # gold-standard regression
└── README.md
packages/generators/ mirrors packages/governors/ shape. Editable-installed via uv pip install -e packages/generators (added to root pyproject.toml workspaces).
Acceptance criteria¶
- PHON-64 v2 failure-case regression. All 5 seeds from
probe_sampled_iterative.pyproduce abestsentence withcoherence_best > coherence_seedand 100% spec compliance (every content word ∈WordStore.subset(spec_filter)). Gold output to compare against: thesampled_locked_dedup_output.txtfrom commit5cae898onresearch/phon-92-selectional-preference-spike. - Module unit tests.
test_argstruc_enumerator.py: emits ≥ 4 seeds for(spec_id="spec1", verb="cut")with all content words in the spec lexicon AND in the PMI-admit set.test_mlm_iterative_editor.py: given a fixed seed, fixed RNG seed, fixed N_TRAJECTORIES=2, the output is deterministic and matches a recorded golden.test_joint_mask_pll.py:coherence("the cat chased the ball", [1,2,4]) > coherence("the cat cat the cat", [1,2,4]). (Canonical PHON-92 headline test fromprobe_pll_sanity.py; the all-function-word degenerate fails under joint-mask scoring because "the" is over-predictable.)- Performance gate. A 16-seed batch on MPS completes in ≤ 30 s wall-clock (probe was ~10 s; budget allows 3× headroom for scaling).
- PMI integration. For
(verb="cut", role="dobj", band="fineweb_adult"), the dobj pool containscake(ppmi > 0 in our merged data — verified during PHON-94 sanity check) and excludesthunder(ppmi ≤ 0). - No regressions in
phonolex_datatests. All 201 packages/data tests still pass afterphonolex_generatorsis added to the workspace.
Open implementation questions¶
These survive from the PHON-95 ticket description and the FINDINGS memo. v1 ships with the v1 column; the v2 column tracks where the empirical decision lands.
| # | Question | v1 default | v2 deferred |
|---|---|---|---|
| 1 | Per-slot tag data shape: boolean PMI ≥ 0 admit, or continuous α·ppmi logit bias, or both? | Boolean admit. Probe outputs were already selectionally appropriate from boolean filter alone. | Continuous, gated by a measurable failure mode (e.g., a seed where boolean admits but PMI ≈ 0 produces a low-quality output). |
| 2 | Editor scaling 5–7 → 10–15 tokens — does the joint-mask + best-of-N trajectory dynamic survive longer sentences? | Locked at 5–7 tokens (the CFG only emits NP V NP). | New CFG productions for adverbial phrases / subordinate clauses + a length-stratified probe before commit. |
| 3 | Subject-verb agreement / morphology — probe output had the cold melt the snow (should be melts). |
Accept and document. v1 is research-grade; downstream morph-check is a separate filter. | Either a downstream morph-check pass OR enrich the trie with morphological variants tagged to verb/agreement context. |
| 4 | Diversity at scale — best-of-8 yielded 1–4 unique outputs per seed. Production may need 10–20 distinct outputs. | Ship best-of-8. Document the diversity ceiling. | Beam search vs MCTS vs more trajectories; needs a diversity metric (BLEU? edit distance?). |
| 5 | Coherence-signal robustness — joint-mask PLL passes the headline sanity test. Wider N=50–100 well-formed-vs-degenerate pair test should harden it. | Use validated probe seeds as the gate. | Wider sanity probe + LLM-as-judge backup if PLL ranks degenerate above well-formed in any case. |
| 6 | Editor fine-tuning on PhonoLex CDS substrate (PHON-86/87) — RoBERTa-large is pre-trained on adult web text; child-directed registers may produce off-distribution outputs. | Use stock RoBERTa-large. | Fine-tune candidate v2 work; gated by band="childes_*" outputs being qualitatively bad. |
Plan handoff¶
Successor plan should cover:
1. Workspace bootstrap — packages/generators/pyproject.toml, root pyproject workspace entry, editable install verification.
2. Lift probe_sampled_iterative.py into phonolex_generators.editor.mlm_iterative_editor with tests.
3. Implement phonolex_generators.cfg_seed.argstruc_enumerator against WordStore + selectional.parquet.
4. Implement phonolex_generators.scorer.joint_mask_pll.
5. Acceptance test against the 5 PHON-64 v2 seeds (gold output from sampled_locked_dedup_output.txt on research/phon-92-selectional-preference-spike @ 5cae898).
6. Reproducible run script under packages/generation/research/2026-05-07-phon-95-editor/.
7. Documentation: README in packages/generators/, CLAUDE.md update naming the new package alongside phonolex_data / phonolex_governors.
8. Open follow-up tickets for OQ1–OQ6 with the failure-mode triggers each one would respond to.
Estimated effort: 3–5 working sessions. The probe is functionally complete; PHON-95 is mostly module-extraction + test-coverage + integration with PHON-93/94 artifacts. No empirical risk on the architecture itself.