MLM Iterative Editor + Argstruc CFG Enumerator — Design (PHON-95)¶

Date: 2026-05-07 Status: Spec — pending user review Branch: off release/v5.2.0 (working branch named at writing-plans handoff) Ticket: PHON-95 (https://neumannsworkshop.atlassian.net/browse/PHON-95) Predecessors: - PHON-93 (runtime word-data layer + marisa-trie VocabTrie, merged PR #85, 2026-05-06) — provides WordStore + canonical Parquet artifacts - PHON-94 (corpus DEP reannotation + selectional preference population, merged PR #86, 2026-05-07) — provides selectional.parquet (5.44M rows × 16 bands of (verb, role, filler) PPMI) - PHON-66 (governed-generation rethink) — names this work the C1 "combinatorial" track - PHON-92 (selectional-preference research-spike memo at packages/generation/research/2026-05-05-phon-92-selectional-preference/memo.md on research/phon-92-selectional-preference-spike) — empirical foundation Sibling tickets: none currently active. PHON-67 (compositional fine-tune) and PHON-68 (retrieve-from-corpus) are WON'T DO; this is the chosen path. Related cleanup: PHON-95's ticket description references findings-and-scope.md; the actual file on research/phon-92-selectional-preference-spike is FINDINGS.md at packages/generation/research/2026-05-05-phon-92-selectional-preference/diffusion-editor-probe/. The ticket's path is stale; this spec uses the actual filename.

Problem¶

The v6 governed-generation architecture has been deemed structurally defective by the PHON-66 rethink: it bolts a deterministic-NLG configuration surface onto a chatbot-shaped production mechanism (T5Gemma 9B-2B with token-time logit steering). The defects audited under PHON-58 are downstream symptoms, not independent bugs.

PHON-66 names three replacement tracks: - C1 (combinatorial) — enumerate spec-compliant candidates via an argument-structure CFG over WordStore-filtered slot terminals, then refine and rank via an iterative MLM editor + coherence scorer. This ticket. - C2 (compositional fine-tune) — train an encoder-decoder on a constraints↔output corpus. WON'T DO. - C3 (retrieve-from-phono-corpus) — retrieve attested in-spec sentences from a phonologically-indexed corpus. WON'T DO.

C1 was empirically validated on 2026-05-05 across 5 PHON-64 v2 failure-case seeds. The validated stack — verb-locked CFG seed + RoBERTa joint-mask + sampled trie-filtered fill + anti-repetition + best-of-N + joint-masked MLM-PLL coherence scoring — produces coherent in-spec English at sub-second per seed on Apple MPS, with 100% spec compliance.

PHON-95 is the productionization of that validated stack: lift it out of the probe scripts, wire it to PHON-93's WordStore + PHON-94's selectional.parquet, give it a stable module surface (phonolex_generators.*), and gate it with tests.

Goal: Ship the C1 generation track as a callable Python API on top of phonolex_data runtime artifacts. v1 = three modules + acceptance tests + a reproducible run script under packages/generation/research/<date>-phon-95-editor/.

Scope¶

In: - New package packages/generators/ (Python, name: phonolex_generators). - Module phonolex_generators.cfg_seed.argstruc_enumerator — verb-locked seed sentence enumeration via an argument-structure CFG + WordStore-filtered slot terminals. - Module phonolex_generators.editor.mlm_iterative_editor — joint-mask + sampled trie-filtered fill + anti-repetition + best-of-N over an MLM (RoBERTa-large default; encoder-pluggable interface). - Module phonolex_generators.scorer.joint_mask_pll — joint-masked MLM-PLL coherence scoring; reuses the editor's MLM in a single forward pass. - Reuses phonolex_governors.generation.trie.VocabTrie (full 125K-word marisa-trie), retagged per-request via trie.tag(banned) where banned = all_words - spec_allowed. The package depends on phonolex_governors (deviates from the original spec's "not a dependency here" framing — change authorized 2026-05-07). - v1 PMI integration: boolean admit — gate fillers by selectional PMI ≥ 0 in the requested band. Continuous bias deferred (see Open Question 1). - Acceptance test suite: all 5 PHON-64 v2 failure-case seeds produce coherent in-spec English matching probe quality (sentence-level diff vs the probe's sampled_locked_dedup_output.txt golden). - Reproducible run script packages/generation/research/2026-05-07-phon-95-editor/run.py that takes a spec ID + verb + n_seeds, returns the editor's outputs.

Out: - Continuous PMI biasing (deferred to a measurable failure mode — see OQ1). - Multi-clause sentence support (probes were 5–7 token toy seeds; longer-sentence scaling is OQ2). - Subject-verb agreement / morphological correction (OQ3). - Diversity at scale beyond best-of-8 (OQ4). - Coherence-scorer LLM-as-judge backup (OQ5). - Editor fine-tuning on PhonoLex CDS substrate (OQ6, candidate v2 work). - Integration with the FastAPI generation server (separate ticket — once the API is stable we wire it into the existing /api/generate-single proxy or add a parallel route). - v6 architecture removal (parallel-track for now; v6 stays online as the production behavior while PHON-95 is gated by feature flag).

Data contracts¶

Inputs (consumed)¶

WordStore (from phonolex_data.runtime.store) — loaded once at editor-init, queried per-request:

from phonolex_data.runtime.store import WordStore

store = WordStore.from_parquet(...)

# Spec-compliant lexicon for a slot (filter expression, Polars-style):
slot_words = store.subset(
    pl.col("syllable_count") <= 2
    & pl.col("phonemes_str").str.starts_with("|k|")
    & pl.col("pos").is_in(["NOUN", "VERB"])
)

# Returns a Polars DataFrame; the editor extracts the `word` column for trie construction.

selectional.parquet (from PHON-94) — per-request join for PMI gating:

sel = pl.scan_parquet("data/runtime/selectional.parquet")

# For verb=cut, role=dobj, band=fineweb_adult: words with PMI ≥ 0
admitted_dobj = (
    sel.filter(
        (pl.col("verb") == verb)
        & (pl.col("role") == role)
        & (pl.col("band") == band)
        & (pl.col("ppmi") > 0.0)
    )
    .select("filler")
    .collect()["filler"]
    .to_list()
)

The intersection slot_words ∩ admitted_dobj is the editor's allowed lexicon for that slot. v1 takes band="fineweb_adult" as default; childes_* / phonbank_* / fineweb_b* bands are selectable per-request.

MLM weights — RoBERTa-large via transformers.AutoModelForMaskedLM.from_pretrained("roberta-large"). Loaded once at editor-init, kept on mps if available, else cpu. Single shared instance across editor + scorer.

Outputs (produced)¶

@dataclass
class EditedSentence:
    seed: str                  # the CFG-emitted seed sentence
    spec_id: str               # which spec produced it
    verb: str                  # verb-locked across all edits
    coherence_seed: float      # joint-masked PLL of the seed
    best: str                  # highest-coherence edit across N trajectories
    coherence_best: float      # joint-masked PLL of `best`
    unique_outputs: list[str]  # deduped set of converged trajectory outputs
    trajectories: list[Trajectory]  # full per-trajectory history (for debugging)

@dataclass
class Trajectory:
    traj_id: int
    history: list[tuple[str, float]]  # (sentence, coherence) pairs across iterations
    outcome: Literal["CONVERGED", "CYCLE", "TIMEOUT"]

No persisted artifact — the editor is a pure function from (spec_id, verb) to a list of EditedSentence. If a downstream consumer wants to cache outputs they can pickle the result themselves; PHON-95 does not write to disk.

Architecture¶

┌──────────────────────────────────────────────────────────────────┐
│ phonolex_generators (new package)                                │
│                                                                  │
│  ┌─────────────────────┐    ┌────────────────────────────┐      │
│  │ cfg_seed/           │    │ editor/                    │      │
│  │  argstruc_enumerator├───→│  mlm_iterative_editor      │      │
│  │  • CFG over slots   │    │  • joint-mask MLM forward  │      │
│  │  • WordStore subset │    │  • trie-filter top-K       │      │
│  │  • verb-lock        │    │  • temperature sampling    │      │
│  │  → seed sentences   │    │  • anti-repetition         │      │
│  └─────────────────────┘    │  • best-of-N trajectories  │      │
│           │                  │  → EditedSentence[]        │      │
│           │                  └────────────────────────────┘      │
│           │                          │                           │
│           │                          ↓                           │
│           │                  ┌────────────────────────────┐     │
│           │                  │ scorer/                    │     │
│           │                  │  joint_mask_pll            │     │
│           │                  │  • shared MLM forward      │     │
│           │                  │  → coherence: float        │     │
│           │                  └────────────────────────────┘     │
└──────────────────────────────────────────────────────────────────┘
            │                          │
            ↓                          ↓
┌──────────────────────────────────────────────────────────────────┐
│ phonolex_data (existing)                                         │
│  • WordStore.subset(...)                                         │
│  • selectional.parquet (verb, role, filler, band, ppmi)          │
│  • words.parquet (PoS, phonemes, lemma, ...)                     │
└──────────────────────────────────────────────────────────────────┘

The three modules are siblings, not a stack: the enumerator emits seeds, the editor mutates them, the scorer ranks them. The editor and scorer share the MLM instance to avoid double-loading 1.4GB of RoBERTa weights.

Module 1: `argstruc_enumerator`¶

CFG productions per spec + verb. Verb is locked at production time; agent and patient slots are filled from WordStore.subset(spec_expr) ∩ selectional_admitted_for_slot(verb, role, band).

# Pseudocode
def enumerate_seeds(
    spec_id: str,
    verb: str,
    band: str = "fineweb_adult",
    max_seeds: int = 16,
) -> list[Seed]:
    spec_filter = SPEC_FILTERS[spec_id]            # Polars expr
    slot_words = store.subset(spec_filter)["word"].to_list()
    nsubj_admit = pmi_admit(verb, "nsubj", band)
    dobj_admit  = pmi_admit(verb, "dobj",  band)

    nsubj_pool = sorted(set(slot_words) & set(nsubj_admit))
    dobj_pool  = sorted(set(slot_words) & set(dobj_admit))

    # CFG: NP V NP. Determiner = "the" (v1; pluralization deferred to OQ3).
    seeds = []
    for nsubj in random.sample(nsubj_pool, k=min(len(nsubj_pool), 4)):
        for dobj in random.sample(dobj_pool, k=min(len(dobj_pool), 4)):
            seeds.append(Seed(
                sentence=f"the {nsubj} {verb} the {dobj}",
                content_word_indices=(1, 4),    # nsubj, dobj
                locked_word_indices=(2,),       # verb
                spec_id=spec_id,
                note=f"CFG-emitted, ({nsubj}, {verb}, {dobj})",
            ))
            if len(seeds) >= max_seeds:
                return seeds
    return seeds

The probe used 5 hand-crafted seeds; the enumerator generates them programmatically. Both feed the same editor input shape (Seed dataclass).

Module 2: `mlm_iterative_editor`¶

Lift verbatim from probe_sampled_iterative.py. Three loops:

Outer (per seed): spawn N=8 trajectories, return the best by coherence.
Middle (per trajectory): iterate up to MAX_ITER=15 edits or until CONVERGED (current sentence is its own argmax) / CYCLE (sentence revisits a previous state).
Inner (per iteration): joint-mask all content positions, forward through MLM, intersect top-K logits with the per-request word trie, sample (with anti-repetition over the trajectory's history) at temperature=0.7 from top-10 of the trie-filtered top-50.

Hyperparameters lifted from probe (validated):

TRIE_TOP_K     = 50    # trie-filter pool size
SAMPLE_TOP_K   = 10    # sample from top-K of trie-filtered pool
TEMPERATURE    = 0.7
N_TRAJECTORIES = 8
MAX_ITER       = 15

These are hyperparameters not constants — exposed as module-level defaults but overridable per-call.

Module 3: `joint_mask_pll`¶

Lift from probe. Mask all content positions, forward through MLM, sum log-probability of the actual tokens at masked positions. Higher = more coherent.

def joint_masked_coherence(
    model, tokenizer, sentence: str, content_word_indices: list[int]
) -> float:
    word_positions = word_to_token_positions(tokenizer, sentence, content_word_indices)
    mask_positions = [p for ps in word_positions.values() for p in ps]
    if not mask_positions:
        return float("nan")
    masked = input_ids.clone()
    for ti in mask_positions:
        masked[0, ti] = tokenizer.mask_token_id
    with torch.no_grad():
        logits = model(masked).logits
    return sum(
        torch.log_softmax(logits[0, ti], dim=-1)[input_ids[0, ti]].item()
        for ti in mask_positions
    )

The headline sanity test (cat chased ball > cat cat cat) holds. A wider sanity probe (N=50–100 well-formed-vs-degenerate pairs) is OQ5; v1 ships with the validated 5-seed probe set as the gate.

Methodology¶

Per-request flow¶

Caller provides (spec_id, verb, band). v1 default: band="fineweb_adult".
argstruc_enumerator.enumerate_seeds(...) → list[Seed] (max 16).
For each seed: mlm_iterative_editor.edit(seed, store) → EditedSentence (best of 8 trajectories).
Return list[EditedSentence].

Cold-start cost: ~3s for RoBERTa-large load + WordStore Parquet scan. Per-request: ~600ms per seed (8 trajectories × ~75ms each = ~600ms; observed in probe). 16 seeds → ~10s. Acceptable for batch generation; future optimization tracked under OQ4.

Module dependencies — package layout¶

packages/generators/
├── pyproject.toml                                       # name: phonolex_generators
├── src/phonolex_generators/
│   ├── __init__.py
│   ├── cfg_seed/
│   │   ├── __init__.py
│   │   ├── argstruc_enumerator.py
│   │   └── spec_filters.py                              # SPEC_FILTERS dict
│   ├── editor/
│   │   ├── __init__.py
│   │   ├── mlm_iterative_editor.py
│   │   ├── trajectory.py                                # Trajectory + EditedSentence dataclasses
│   │   └── trie_filter.py                               # topk_in_trie_with_logits
│   ├── scorer/
│   │   ├── __init__.py
│   │   └── joint_mask_pll.py
│   └── shared/
│       ├── __init__.py
│       ├── mlm_loader.py                                # singleton MLM + tokenizer loader
│       └── word_to_tokens.py                            # word_to_token_positions helper
├── tests/
│   ├── test_argstruc_enumerator.py
│   ├── test_mlm_iterative_editor.py
│   ├── test_joint_mask_pll.py
│   └── test_acceptance_phon64v2.py                      # gold-standard regression
└── README.md

packages/generators/ mirrors packages/governors/ shape. Editable-installed via uv pip install -e packages/generators (added to root pyproject.toml workspaces).

Acceptance criteria¶

PHON-64 v2 failure-case regression. All 5 seeds from probe_sampled_iterative.py produce a best sentence with coherence_best > coherence_seed and 100% spec compliance (every content word ∈ WordStore.subset(spec_filter)). Gold output to compare against: the sampled_locked_dedup_output.txt from commit 5cae898 on research/phon-92-selectional-preference-spike.
Module unit tests.
test_argstruc_enumerator.py: emits ≥ 4 seeds for (spec_id="spec1", verb="cut") with all content words in the spec lexicon AND in the PMI-admit set.
test_mlm_iterative_editor.py: given a fixed seed, fixed RNG seed, fixed N_TRAJECTORIES=2, the output is deterministic and matches a recorded golden.
test_joint_mask_pll.py: coherence("the cat chased the ball", [1,2,4]) > coherence("the cat cat the cat", [1,2,4]). (Canonical PHON-92 headline test from probe_pll_sanity.py; the all-function-word degenerate fails under joint-mask scoring because "the" is over-predictable.)
Performance gate. A 16-seed batch on MPS completes in ≤ 30 s wall-clock (probe was ~10 s; budget allows 3× headroom for scaling).
PMI integration. For (verb="cut", role="dobj", band="fineweb_adult"), the dobj pool contains cake (ppmi > 0 in our merged data — verified during PHON-94 sanity check) and excludes thunder (ppmi ≤ 0).
No regressions in phonolex_data tests. All 201 packages/data tests still pass after phonolex_generators is added to the workspace.

Open implementation questions¶

These survive from the PHON-95 ticket description and the FINDINGS memo. v1 ships with the v1 column; the v2 column tracks where the empirical decision lands.

#	Question	v1 default	v2 deferred
1	Per-slot tag data shape: boolean PMI ≥ 0 admit, or continuous α·ppmi logit bias, or both?	Boolean admit. Probe outputs were already selectionally appropriate from boolean filter alone.	Continuous, gated by a measurable failure mode (e.g., a seed where boolean admits but PMI ≈ 0 produces a low-quality output).
2	Editor scaling 5–7 → 10–15 tokens — does the joint-mask + best-of-N trajectory dynamic survive longer sentences?	Locked at 5–7 tokens (the CFG only emits NP V NP).	New CFG productions for adverbial phrases / subordinate clauses + a length-stratified probe before commit.
3	Subject-verb agreement / morphology — probe output had `the cold melt the snow` (should be `melts`).	Accept and document. v1 is research-grade; downstream morph-check is a separate filter.	Either a downstream morph-check pass OR enrich the trie with morphological variants tagged to verb/agreement context.
4	Diversity at scale — best-of-8 yielded 1–4 unique outputs per seed. Production may need 10–20 distinct outputs.	Ship best-of-8. Document the diversity ceiling.	Beam search vs MCTS vs more trajectories; needs a diversity metric (BLEU? edit distance?).
5	Coherence-signal robustness — joint-mask PLL passes the headline sanity test. Wider N=50–100 well-formed-vs-degenerate pair test should harden it.	Use validated probe seeds as the gate.	Wider sanity probe + LLM-as-judge backup if PLL ranks degenerate above well-formed in any case.
6	Editor fine-tuning on PhonoLex CDS substrate (PHON-86/87) — RoBERTa-large is pre-trained on adult web text; child-directed registers may produce off-distribution outputs.	Use stock RoBERTa-large.	Fine-tune candidate v2 work; gated by `band="childes_*"` outputs being qualitatively bad.

Plan handoff¶

Successor plan should cover: 1. Workspace bootstrap — packages/generators/pyproject.toml, root pyproject workspace entry, editable install verification. 2. Lift probe_sampled_iterative.py into phonolex_generators.editor.mlm_iterative_editor with tests. 3. Implement phonolex_generators.cfg_seed.argstruc_enumerator against WordStore + selectional.parquet. 4. Implement phonolex_generators.scorer.joint_mask_pll. 5. Acceptance test against the 5 PHON-64 v2 seeds (gold output from sampled_locked_dedup_output.txt on research/phon-92-selectional-preference-spike @ 5cae898). 6. Reproducible run script under packages/generation/research/2026-05-07-phon-95-editor/. 7. Documentation: README in packages/generators/, CLAUDE.md update naming the new package alongside phonolex_data / phonolex_governors. 8. Open follow-up tickets for OQ1–OQ6 with the failure-mode triggers each one would respond to.

Estimated effort: 3–5 working sessions. The probe is functionally complete; PHON-95 is mostly module-extraction + test-coverage + integration with PHON-93/94 artifacts. No empirical risk on the architecture itself.