Skip to content

PHON-95 — MLM Iterative Editor + Argstruc CFG Enumerator Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Productionize the PHON-92 validated stack (verb-locked CFG seed + RoBERTa-large joint-mask iterative editor + joint-mask PLL coherence scorer) as a callable Python API in a new packages/generators/ workspace package.

Architecture: Three sibling modules in phonolex_generatorscfg_seed.argstruc_enumerator (CFG → seed sentences), editor.mlm_iterative_editor (best-of-N joint-mask iterative refinement via prefix-walking decode), scorer.joint_mask_pll (coherence ranking). The editor and scorer share a single MLM instance via a singleton loader. The editor uses phonolex_governors.generation.trie.VocabTrie (marisa-trie, character-level) — built once at editor init from the full store.df["word"] vocab (~125K), then re-tagged per request with banned = all_words - allowed where allowed = store.subset(SPEC_FILTERS[spec_id])["word"].

Prefix-walking decode (generalization of the PHON-92 probe). For each content slot covering mask positions p_0..p_n, the editor walks the trie left-to-right: at each position, top-K candidates are filtered to those whose decoded fragment, concatenated to the accumulated prefix, has dead_end_ratio < 1.0 (i.e., at least one compliant completion is reachable from this prefix). Sample one with temperature, accumulate, repeat. Greedy-stop when walk_to(accumulated).is_end and not is_banned_word(accumulated) — that's the fill word. This mirrors phonolex_governors.generation.reranker.Reranker._steer_sequence (which walks the trie token-by-token at causal-LM generation time); the MLM editor applies the same prefix-walking pattern across mask positions of a content slot. The probe's "filter at first position only, admit if decoded token is itself a complete word" approach is the natural single-step subset of this; multi-token compliant completions (e.g., a 2-BPE word like "courtroom" filling a single mask position with "▁court" + cont. "room") are reachable that the probe couldn't reach.

PMI gating for CFG slot fills derives from selectional.parquet. Deps: phonolex_data, phonolex_governors, transformers/torch.

Tech Stack: Python 3.10+ · phonolex_data (WordStore + Parquet) · transformers (RoBERTa-large MLM) · torch (MPS/CPU) · polars (filtering) · pytest (TDD)

Spec: docs/superpowers/specs/2026-05-07-phon-95-mlm-editor-cfg-enumerator-design.md

Working branch: feature/phon-95-impl off release/v5.2.0. Commits target release/v5.2.0 via PR.

Source artifacts (read-only): - Probe (verbatim lift target): research/phon-92-selectional-preference-spike @ packages/generation/research/2026-05-05-phon-92-selectional-preference/diffusion-editor-probe/probe_sampled_iterative.py - Gold output: same branch @ commit 5cae898, file sampled_locked_dedup_output.txt


Pre-flight

Before starting Task 1, set up the working branch:

git fetch origin
git checkout release/v5.2.0
git pull --ff-only origin release/v5.2.0
git checkout -b feature/phon-95-impl

Verify the editable installs are current:

uv pip install -e packages/data
uv run python -c "from phonolex_data.runtime.store import WordStore; print('OK')"

Expected: OK.


Task 1: Workspace Bootstrap

Files: - Create: packages/generators/pyproject.toml - Create: packages/generators/src/phonolex_generators/__init__.py - Create: packages/generators/src/phonolex_generators/cfg_seed/__init__.py - Create: packages/generators/src/phonolex_generators/editor/__init__.py - Create: packages/generators/src/phonolex_generators/scorer/__init__.py - Create: packages/generators/src/phonolex_generators/shared/__init__.py - Create: packages/generators/tests/__init__.py - Create: packages/generators/tests/conftest.py - Modify: pyproject.toml (workspace registration)

  • [ ] Step 1: Create packages/generators/pyproject.toml
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "phonolex-generators"
version = "0.1.0"
description = "Combinatorial generation track (C1): CFG seed + MLM iterative editor + joint-mask PLL scorer"
license = "LicenseRef-Proprietary"
requires-python = ">=3.10"
dependencies = [
    "torch>=2.0",
    "transformers>=4.38",
    "phonolex-data",
    "phonolex-governors",
    "polars>=1.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
    "ruff>=0.4",
]

[tool.hatch.build.targets.wheel]
packages = ["src/phonolex_generators"]

[tool.ruff]
target-version = "py310"
line-length = 100

[tool.pytest.ini_options]
testpaths = ["tests"]
markers = [
    "slow: requires RoBERTa-large weights (~1.4GB; deselect with '-m \"not slow\"')",
    "acceptance: PHON-64 v2 regression gold test (slow)",
]
  • [ ] Step 2: Create the package skeleton (six empty __init__.py files plus one conftest.py)

packages/generators/src/phonolex_generators/__init__.py:

"""C1 combinatorial generation track — CFG seeds + MLM iterative editor + PLL scorer."""

__version__ = "0.1.0"

packages/generators/src/phonolex_generators/cfg_seed/__init__.py, packages/generators/src/phonolex_generators/editor/__init__.py, packages/generators/src/phonolex_generators/scorer/__init__.py, packages/generators/src/phonolex_generators/shared/__init__.py — each one empty:


packages/generators/tests/__init__.py — empty:


packages/generators/tests/conftest.py:

"""Shared test fixtures.

The `slow` marker gates RoBERTa-large-loading tests behind explicit
opt-in; default `pytest -v` runs only the unit tests.
"""

import pytest


def pytest_collection_modifyitems(config, items):
    if config.getoption("-m") == "":
        skip_slow = pytest.mark.skip(reason="needs '-m slow' to run model-loading tests")
        for item in items:
            if "slow" in item.keywords:
                item.add_marker(skip_slow)
  • [ ] Step 3: Register in workspace (pyproject.toml at repo root)

Add "packages/generators" to [tool.uv.workspace].members and phonolex-generators = { workspace = true } to [tool.uv.sources]. After edits the file should read:

[tool.uv.workspace]
members = [
    "packages/data",
    "packages/governors",
    "packages/generation",
    "packages/features",
    "packages/generators",
]

[tool.uv.sources]
phonolex-data = { workspace = true }
phonolex-governors = { workspace = true }
phonolex-generation = { workspace = true }
phonolex-features = { workspace = true }
phonolex-generators = { workspace = true }
  • [ ] Step 4: Editable install + import smoke test

Run:

uv pip install -e packages/generators
uv run python -c "import phonolex_generators; print(phonolex_generators.__version__)"

Expected: 0.1.0.

  • [ ] Step 5: Run the empty test directory

Run: uv run --package phonolex-generators pytest packages/generators/tests/ -v Expected: no tests ran in 0.0Xs. (Confirms collection works.)

  • [ ] Step 6: Commit
git add packages/generators/ pyproject.toml
git commit -m "PHON-95 Task 1: bootstrap phonolex_generators workspace package"

Task 2: Shared MLM Loader + word_to_token_positions Helper

Files: - Create: packages/generators/src/phonolex_generators/shared/mlm_loader.py - Create: packages/generators/src/phonolex_generators/shared/word_to_tokens.py - Test: packages/generators/tests/test_shared.py

The MLM is a singleton: get_mlm() loads RoBERTa-large once, subsequent calls return the cached (model, tokenizer, device) triple. The helper resolves word-indices in a space-split sentence to RoBERTa BPE token positions via offset_mapping.

  • [ ] Step 1: Write the failing test for word_to_token_positions

packages/generators/tests/test_shared.py:

"""Shared helpers — token-position math test (no MLM weights needed)."""

from transformers import AutoTokenizer

from phonolex_generators.shared.word_to_tokens import word_to_token_positions


def test_word_to_token_positions_known_sentence():
    tokenizer = AutoTokenizer.from_pretrained("roberta-large")
    sentence = "the cat chased the ball"
    # word indices 1 and 4 = 'cat' and 'ball'
    out = word_to_token_positions(tokenizer, sentence, [1, 4])
    assert set(out.keys()) == {1, 4}
    assert all(len(v) >= 1 for v in out.values())
    # The token at out[1][0] decodes to something that includes 'cat'
    enc = tokenizer(sentence)
    cat_token_text = tokenizer.decode([enc["input_ids"][out[1][0]]]).strip().lower()
    assert "cat" in cat_token_text


def test_word_to_token_positions_out_of_range_indices_skipped():
    tokenizer = AutoTokenizer.from_pretrained("roberta-large")
    sentence = "the cat chased the ball"
    out = word_to_token_positions(tokenizer, sentence, [99])
    assert out == {}
  • [ ] Step 2: Run the test (expect FAIL)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_shared.py -v Expected: ImportError: cannot import name 'word_to_token_positions'.

  • [ ] Step 3: Implement word_to_token_positions

packages/generators/src/phonolex_generators/shared/word_to_tokens.py:

"""Map space-split word indices to RoBERTa BPE token-position lists.

Lifted from PHON-92 probe (`probe_sampled_iterative.py::word_to_token_positions`).
"""

from __future__ import annotations


def word_to_token_positions(
    tokenizer, sentence: str, word_indices: list[int]
) -> dict[int, list[int]]:
    enc = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
    offsets = enc["offset_mapping"][0].tolist()
    words = sentence.split()
    char_starts: list[tuple[int, int]] = []
    cursor = 0
    for w in words:
        idx = sentence.find(w, cursor)
        char_starts.append((idx, idx + len(w)))
        cursor = idx + len(w)
    out: dict[int, list[int]] = {}
    for wi in word_indices:
        if wi >= len(char_starts):
            continue
        wstart, wend = char_starts[wi]
        positions = [
            ti
            for ti, (s, e) in enumerate(offsets)
            if (s, e) != (0, 0) and s >= wstart and e <= wend
        ]
        out[wi] = positions
    return out
  • [ ] Step 4: Run the test (expect PASS)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_shared.py -v Expected: 2 passed.

  • [ ] Step 5: Implement get_mlm singleton

packages/generators/src/phonolex_generators/shared/mlm_loader.py:

"""Singleton MLM loader.

The editor and scorer share a single (model, tokenizer, device) triple to
avoid double-loading 1.4GB of RoBERTa-large weights.
"""

from __future__ import annotations

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

DEFAULT_MODEL_ID = "roberta-large"

_cache: dict[str, tuple] = {}


def _resolve_device() -> str:
    if torch.backends.mps.is_available():
        return "mps"
    return "cpu"


def get_mlm(model_id: str = DEFAULT_MODEL_ID):
    """Load (or return cached) (model, tokenizer, device) for the given MLM."""
    if model_id in _cache:
        return _cache[model_id]
    device = _resolve_device()
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForMaskedLM.from_pretrained(model_id).to(device).eval()
    _cache[model_id] = (model, tokenizer, device)
    return _cache[model_id]


def reset_mlm_cache() -> None:
    """Test hook — clear the singleton (next get_mlm call reloads weights)."""
    _cache.clear()
  • [ ] Step 6: Add a slow-marked sanity test for get_mlm (extends test_shared.py)

Append to packages/generators/tests/test_shared.py:

import pytest

from phonolex_generators.shared.mlm_loader import get_mlm, reset_mlm_cache


@pytest.mark.slow
def test_get_mlm_returns_same_triple_twice():
    reset_mlm_cache()
    m1, t1, d1 = get_mlm()
    m2, t2, d2 = get_mlm()
    assert m1 is m2
    assert t1 is t2
    assert d1 == d2
    assert d1 in {"mps", "cpu"}
  • [ ] Step 7: Run slow test once to verify

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_shared.py -v -m slow Expected: 1 passed (and the 2 unmarked tests skipped). May take ~10s on first run for model download/cache.

  • [ ] Step 8: Commit
git add packages/generators/src/phonolex_generators/shared packages/generators/tests/test_shared.py
git commit -m "PHON-95 Task 2: shared MLM loader + word_to_token_positions helper"

Task 3: Scorer — joint_mask_pll

Files: - Create: packages/generators/src/phonolex_generators/scorer/joint_mask_pll.py - Modify: packages/generators/src/phonolex_generators/scorer/__init__.py (export joint_masked_coherence) - Test: packages/generators/tests/test_joint_mask_pll.py

The scorer is implemented before the editor because the editor depends on it for trajectory ranking and best-of-N selection. Joint-mask PLL = mask all content positions, forward through MLM, sum the log-probability of the actual tokens at masked positions. Higher = more coherent.

  • [ ] Step 1: Write the failing sanity test (the spec's headline acceptance gate)

packages/generators/tests/test_joint_mask_pll.py:

"""joint_mask_pll — coherence ranks well-formed > repeated-content.

This is the canonical PHON-92 headline sanity test, verified empirically
by `probe_pll_sanity.py` on the spike branch (RoBERTa-large @ MPS):
  joint-mask  wf=-29.57  rp=-30.02  PASS

Note: comparing well-formed against `"the the the the the"` does NOT pass
under joint-mask scoring — the all-function-word skeleton is highly
predictable. The canonical degenerate is repeated CONTENT word ("the cat
cat the cat"), which matches the editor's actual use case (ranking
candidate fills of a shared masked context).
"""

import pytest

from phonolex_generators.scorer.joint_mask_pll import joint_masked_coherence
from phonolex_generators.shared.mlm_loader import get_mlm


@pytest.mark.slow
def test_well_formed_outranks_degenerate():
    model, tokenizer, _ = get_mlm()
    # Content words at indices 1, 2, 4 — masking 'cat chased ... ball' vs 'cat cat ... cat'
    well = joint_masked_coherence(model, tokenizer, "the cat chased the ball", [1, 2, 4])
    bad = joint_masked_coherence(model, tokenizer, "the cat cat the cat", [1, 2, 4])
    assert well > bad


@pytest.mark.slow
def test_no_content_indices_returns_nan():
    import math

    model, tokenizer, _ = get_mlm()
    nan = joint_masked_coherence(model, tokenizer, "the cat", [])
    assert math.isnan(nan)
  • [ ] Step 2: Run test (expect FAIL)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_joint_mask_pll.py -v -m slow Expected: ImportError: cannot import name 'joint_masked_coherence'.

  • [ ] Step 3: Implement joint_masked_coherence

packages/generators/src/phonolex_generators/scorer/joint_mask_pll.py:

"""Joint-masked pseudo-log-likelihood coherence scorer.

Mask all content positions simultaneously, forward through the MLM, sum the
log-probability of the actual tokens at masked positions. Higher = more
coherent. Lifted from PHON-92 probe (`probe_sampled_iterative.py::joint_masked_coherence`).
"""

from __future__ import annotations

import torch

from phonolex_generators.shared.word_to_tokens import word_to_token_positions


def joint_masked_coherence(
    model,
    tokenizer,
    sentence: str,
    content_word_indices: list[int],
) -> float:
    """Sum log P(actual_token | masked_context) over all content-word token positions."""
    device = next(model.parameters()).device
    enc = tokenizer(sentence, return_tensors="pt").to(device)
    input_ids = enc["input_ids"]
    word_positions = word_to_token_positions(tokenizer, sentence, content_word_indices)
    mask_positions = [p for ps in word_positions.values() for p in ps]
    if not mask_positions:
        return float("nan")
    masked = input_ids.clone()
    for ti in mask_positions:
        masked[0, ti] = tokenizer.mask_token_id
    with torch.no_grad():
        logits = model(masked).logits
    total = 0.0
    for ti in mask_positions:
        actual = input_ids[0, ti].item()
        log_probs = torch.log_softmax(logits[0, ti], dim=-1)
        total += log_probs[actual].item()
    return total
  • [ ] Step 4: Export from package

packages/generators/src/phonolex_generators/scorer/__init__.py:

from phonolex_generators.scorer.joint_mask_pll import joint_masked_coherence

__all__ = ["joint_masked_coherence"]
  • [ ] Step 5: Run test (expect PASS)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_joint_mask_pll.py -v -m slow Expected: 2 passed.

  • [ ] Step 6: Commit
git add packages/generators/src/phonolex_generators/scorer packages/generators/tests/test_joint_mask_pll.py
git commit -m "PHON-95 Task 3: scorer.joint_mask_pll — coherence via masked PLL"

Task 4: Editor — Per-Position Prefix-Walking Admit Helper

Files: - Create: packages/generators/src/phonolex_generators/editor/trie_filter.py - Test: packages/generators/tests/test_trie_filter.py

The editor's per-position helper takes the FULL tagged VocabTrie plus the prefix accumulated so far in the current content slot, and returns top-K admissible (token_id, fragment, raw_logit) tuples. Admit criterion: the decoded fragment, concatenated to the accumulated prefix, has dead_end_ratio < 1.0 in the tagged trie (i.e., at least one compliant completion reachable). At the first position of a slot (is_first=True), only word-start tokens (RoBERTa: leading-space prefix) are admitted; at subsequent positions, only continuation tokens (no leading space). This mirrors phonolex_governors.Reranker._steer_sequence flipped for MLM (forward-walking through mask positions instead of backward through emitted tokens).

  • [ ] Step 1: Write failing tests

packages/generators/tests/test_trie_filter.py:

"""topk_admit_at_position — per-mask-position prefix-walking admit filter."""

import torch
from phonolex_governors.generation.trie import VocabTrie

from phonolex_generators.editor.trie_filter import topk_admit_at_position


def _trie(words: list[str], banned: set[str] | None = None) -> VocabTrie:
    t = VocabTrie(words)
    t.tag(banned or set())
    return t


class FakeTokenizer:
    """Decoded text matches RoBERTa shape: word-start tokens have leading space."""

    def __init__(self, vocab: dict[int, str]):
        self._vocab = vocab

    def decode(self, ids):
        return "".join(self._vocab[i] for i in ids)


def test_first_position_admits_word_starts_only():
    # vocab includes a word-start " cat" (leading space) and a continuation "cat" (no space).
    tok = FakeTokenizer({0: " cat", 1: "cat", 2: " dog"})
    trie = _trie(["cat", "dog"], banned=set())
    logits = torch.zeros((1, 1, 10))
    logits[0, 0, 0] = 5.0
    logits[0, 0, 1] = 4.0  # continuation "cat" — should be rejected at p_0
    logits[0, 0, 2] = 3.0

    out = topk_admit_at_position(
        logits, position=0, tokenizer=tok, trie=trie,
        accumulated="", is_first=True, k=4,
    )
    fragments = [f for _, f, _ in out]
    assert "cat" in fragments  # from token 0 (word-start)
    assert "dog" in fragments  # from token 2 (word-start)
    assert fragments.count("cat") == 1  # token 1 (continuation) rejected at p_0


def test_continuation_position_rejects_word_starts():
    tok = FakeTokenizer({0: "slaw", 1: " slaw"})
    trie = _trie(["coleslaw"], banned=set())
    logits = torch.zeros((1, 1, 10))
    logits[0, 0, 0] = 5.0  # "slaw" continuation — admit (extends "cole" to "coleslaw")
    logits[0, 0, 1] = 4.0  # " slaw" word-start — reject at p_1

    out = topk_admit_at_position(
        logits, position=0, tokenizer=tok, trie=trie,
        accumulated="cole", is_first=False, k=4,
    )
    fragments = [f for _, f, _ in out]
    assert fragments == ["slaw"]


def test_dead_end_ratio_gate_rejects_all_banned_completions():
    # Trie has "cat" (allowed) and "cab" (banned); prefix "c" → ratio = 1/2 = 0.5 < 1.0 → admit.
    tok = FakeTokenizer({0: " c", 1: " x"})
    trie = _trie(["cat", "cab"], banned={"cab"})
    logits = torch.zeros((1, 1, 10))
    logits[0, 0, 0] = 5.0  # " c" → ratio 0.5 → admit
    logits[0, 0, 1] = 4.0  # " x" → not in trie → ratio 1.0 → reject

    out = topk_admit_at_position(
        logits, position=0, tokenizer=tok, trie=trie,
        accumulated="", is_first=True, k=4,
    )
    fragments = [f for _, f, _ in out]
    assert fragments == ["c"]


def test_dead_end_ratio_gate_rejects_when_all_completions_banned():
    # All words at "c" prefix are banned → ratio 1.0 → reject.
    tok = FakeTokenizer({0: " c"})
    trie = _trie(["cat", "cab"], banned={"cat", "cab"})
    logits = torch.zeros((1, 1, 10))
    logits[0, 0, 0] = 5.0

    out = topk_admit_at_position(
        logits, position=0, tokenizer=tok, trie=trie,
        accumulated="", is_first=True, k=4,
    )
    assert out == []


def test_non_alpha_fragments_rejected():
    tok = FakeTokenizer({0: " 123", 1: " cat", 2: " ,"})
    trie = _trie(["cat"], banned=set())
    logits = torch.zeros((1, 1, 10))
    logits[0, 0, 0] = 5.0
    logits[0, 0, 1] = 4.0
    logits[0, 0, 2] = 3.0

    out = topk_admit_at_position(
        logits, position=0, tokenizer=tok, trie=trie,
        accumulated="", is_first=True, k=4,
    )
    assert [f for _, f, _ in out] == ["cat"]


def test_k_cap_respected():
    tok = FakeTokenizer({0: " cat", 1: " dog", 2: " ball"})
    trie = _trie(["cat", "dog", "ball"], banned=set())
    logits = torch.zeros((1, 1, 10))
    logits[0, 0, 0] = 5.0
    logits[0, 0, 1] = 4.0
    logits[0, 0, 2] = 3.0

    out = topk_admit_at_position(
        logits, position=0, tokenizer=tok, trie=trie,
        accumulated="", is_first=True, k=2,
    )
    assert len(out) == 2
  • [ ] Step 2: Run tests (expect FAIL)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_trie_filter.py -v Expected: ImportError: cannot import name 'topk_admit_at_position'.

  • [ ] Step 3: Implement topk_admit_at_position

packages/generators/src/phonolex_generators/editor/trie_filter.py:

"""Per-position prefix-walking admit helper for the MLM editor.

Generalizes the PHON-92 probe's `topk_in_trie_with_logits` from a single-
position complete-word filter to a per-position prefix walker. Mirrors the
`phonolex_governors.Reranker._steer_sequence` pattern (causal-LM, walks
backward through emitted tokens) flipped for MLM (forward-walks across
the mask positions of a content slot).

Admit criterion at each position:
  - is_first=True  → token must be a word-start (RoBERTa leading space)
  - is_first=False → token must be a continuation (no leading space)
  - extended = (accumulated + fragment).lower() must satisfy
    `trie.dead_end_ratio(extended) < 1.0`, i.e. at least one
    compliant completion is reachable from this prefix.
"""

from __future__ import annotations

import torch
from phonolex_governors.generation.trie import VocabTrie


def topk_admit_at_position(
    logits: torch.Tensor,
    position: int,
    tokenizer,
    trie: VocabTrie,
    accumulated: str,
    is_first: bool,
    k: int,
) -> list[tuple[int, str, float]]:
    """Return up to k admit candidates as (token_id, fragment_lower, raw_logit).

    `accumulated` is the lowercased prefix built from previous positions in
    this slot (empty string at p_0). `fragment_lower` is the candidate's
    contribution to the prefix at THIS position (without leading space).
    """
    raw_logits = logits[0, position]
    top_logits, top_ids = torch.topk(raw_logits, k * 4)
    out: list[tuple[int, str, float]] = []
    for tid, lg in zip(top_ids.tolist(), top_logits.tolist()):
        text = tokenizer.decode([tid])
        starts_with_space = text.startswith(" ") or text.startswith("▁")
        if is_first:
            if not starts_with_space:
                continue
            fragment = text.lstrip().lstrip("▁").lower()
        else:
            if starts_with_space:
                continue
            fragment = text.lower()
        if not fragment or not fragment.isalpha():
            continue
        extended = accumulated + fragment
        if trie.dead_end_ratio(extended) >= 1.0:
            continue
        out.append((tid, fragment, lg))
        if len(out) >= k:
            break
    return out
  • [ ] Step 4: Run tests (expect PASS)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_trie_filter.py -v Expected: 6 passed.

  • [ ] Step 5: Commit
git add packages/generators/src/phonolex_generators/editor/trie_filter.py packages/generators/tests/test_trie_filter.py
git commit -m "PHON-95 Task 4: editor.trie_filter — per-position prefix-walking admit helper"

Task 5: Editor — Trajectory + EditedSentence + Seed Dataclasses

Files: - Create: packages/generators/src/phonolex_generators/editor/trajectory.py - Modify: packages/generators/src/phonolex_generators/editor/__init__.py (export dataclasses) - Test: packages/generators/tests/test_trajectory.py

Pure data definitions — no logic, but stable types let the editor and CFG enumerator share a contract.

  • [ ] Step 1: Write the failing test

packages/generators/tests/test_trajectory.py:

"""Trajectory / EditedSentence / Seed dataclass smoke tests."""

from phonolex_generators.editor.trajectory import (
    EditedSentence,
    Seed,
    Trajectory,
)


def test_seed_is_frozen_hashable():
    s = Seed(
        sentence="the cat chased the ball",
        content_word_indices=(1, 4),
        locked_word_indices=(2,),
        spec_id="spec1",
        note="control",
    )
    assert hash(s) is not None
    assert s.sentence == "the cat chased the ball"


def test_trajectory_outcome_default_running():
    t = Trajectory(traj_id=0)
    assert t.outcome == "RUNNING"
    assert t.history == []


def test_edited_sentence_holds_aggregate():
    seed = "the cat chased the ball"
    es = EditedSentence(
        seed=seed,
        spec_id="spec1",
        verb="chased",
        coherence_seed=-9.95,
        best="the cat ate the cake",
        coherence_best=-6.41,
        unique_outputs=["the cat ate the cake", "the cat ate the cookies"],
        trajectories=[],
    )
    assert es.coherence_best > es.coherence_seed
    assert len(es.unique_outputs) == 2
  • [ ] Step 2: Run test (expect FAIL)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_trajectory.py -v Expected: ImportError.

  • [ ] Step 3: Implement dataclasses

packages/generators/src/phonolex_generators/editor/trajectory.py:

"""Editor data contracts — Seed / Trajectory / EditedSentence."""

from __future__ import annotations

from dataclasses import dataclass, field
from typing import Literal

Outcome = Literal["RUNNING", "CONVERGED", "CYCLE", "TIMEOUT"]


@dataclass(frozen=True)
class Seed:
    """A CFG-emitted (or hand-crafted) seed sentence + which positions are editable."""

    sentence: str
    content_word_indices: tuple[int, ...]
    locked_word_indices: tuple[int, ...]
    spec_id: str
    note: str = ""


@dataclass
class Trajectory:
    """Per-trajectory edit history."""

    traj_id: int
    history: list[tuple[str, float]] = field(default_factory=list)
    outcome: Outcome = "RUNNING"
    best_sentence: str | None = None
    best_coherence: float = float("-inf")


@dataclass
class EditedSentence:
    """Best-of-N result for a single seed."""

    seed: str
    spec_id: str
    verb: str
    coherence_seed: float
    best: str
    coherence_best: float
    unique_outputs: list[str]
    trajectories: list[Trajectory]
  • [ ] Step 4: Export from package

packages/generators/src/phonolex_generators/editor/__init__.py:

from phonolex_generators.editor.trajectory import (
    EditedSentence,
    Outcome,
    Seed,
    Trajectory,
)

__all__ = ["EditedSentence", "Outcome", "Seed", "Trajectory"]
  • [ ] Step 5: Run test (expect PASS)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_trajectory.py -v Expected: 3 passed.

  • [ ] Step 6: Commit
git add packages/generators/src/phonolex_generators/editor packages/generators/tests/test_trajectory.py
git commit -m "PHON-95 Task 5: editor — Seed/Trajectory/EditedSentence dataclasses"

Task 6: Editor — mlm_iterative_editor (prefix-walking decode + best-of-N)

Files: - Create: packages/generators/src/phonolex_generators/editor/mlm_iterative_editor.py - Modify: packages/generators/src/phonolex_generators/editor/__init__.py (export edit) - Test: packages/generators/tests/test_mlm_iterative_editor.py

The editor generalizes the PHON-92 probe: outer loop (sentence-level edits with CONVERGED/CYCLE/TIMEOUT) and best-of-N trajectory shape are kept; the inner fill mechanic is replaced with prefix-walking decode (_fill_slot). For each content slot covering mask positions p_0..p_n: walk the trie left-to-right, sample one admitted token per position (gated by dead_end_ratio < 1.0), accumulate the prefix, greedy-stop the first time walk_to(accumulated).is_end and not is_banned_word(accumulated). Trie contract: caller passes a VocabTrie already tagged via trie.tag(banned).

The deterministic test pins RNG and hyperparameters so per-position sampling is reproducible.

  • [ ] Step 1: Write the failing deterministic-output test

packages/generators/tests/test_mlm_iterative_editor.py:

"""Editor — deterministic best-of-N output with fixed RNG."""

from pathlib import Path

import pytest
from phonolex_data.runtime.store import WordStore
from phonolex_governors.generation.trie import VocabTrie

from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.editor.trajectory import Seed
from phonolex_generators.shared.mlm_loader import get_mlm


@pytest.mark.slow
def test_edit_deterministic_with_n2_fixed_rng():
    """N=2 trajectories, fixed RNG seed → deterministic best output.

    Trie is the FULL 125K-word vocab tagged per-request with
    `banned = all - allowed(spec1)`. Per-position prefix-walking decode
    operates on this tagged trie.
    """
    model, tokenizer, _ = get_mlm()
    store = WordStore.from_parquet(Path("data/runtime/words.parquet"))
    all_words = [w.lower() for w in store.df["word"].to_list()]
    allowed = set(
        store.subset(SPEC_FILTERS["spec1"]).get_column("word").str.to_lowercase().to_list()
    )
    banned = set(all_words) - allowed

    trie = VocabTrie(all_words)
    trie.tag(banned)

    seed = Seed(
        sentence="the dog ate the bone",
        content_word_indices=(1, 4),
        locked_word_indices=(2,),
        spec_id="spec1",
        note="control",
    )

    result_a = edit(seed, model=model, tokenizer=tokenizer, trie=trie, n_trajectories=2, rng_base_seed=42)
    result_b = edit(seed, model=model, tokenizer=tokenizer, trie=trie, n_trajectories=2, rng_base_seed=42)

    assert result_a.best == result_b.best
    assert result_a.coherence_best == result_b.coherence_best
    # Best content words are spec-compliant: complete words in trie + not banned.
    best_tokens = result_a.best.split()
    for ci in seed.content_word_indices:
        w = best_tokens[ci].lower()
        node = trie.walk_to(w)
        assert node is not None and node.is_end, f"{w!r} not is_end"
        assert not trie.is_banned_word(w), f"{w!r} banned"
    # Verb (locked, index 2) preserved.
    assert best_tokens[2] == "ate"
    # Coherence improves (or stays equal).
    assert result_a.coherence_best >= result_a.coherence_seed
  • [ ] Step 2: Run test (expect FAIL)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_mlm_iterative_editor.py -v -m slow Expected: ImportError: cannot import name 'edit'.

  • [ ] Step 3: Implement the iterative editor

packages/generators/src/phonolex_generators/editor/mlm_iterative_editor.py:

"""MLM iterative editor — prefix-walking decode with best-of-N trajectories.

Outer loop and best-of-N shape lifted from PHON-92 probe; inner fill mechanic
is the per-content-slot prefix walk over the tagged VocabTrie. Greedy-stops
the first time the accumulated prefix is is_end-and-not-banned.

Trie contract: caller passes a `phonolex_governors.VocabTrie` already tagged
for this request via `trie.tag(banned)`.
"""

from __future__ import annotations

import torch
from phonolex_governors.generation.trie import VocabTrie

from phonolex_generators.editor.trajectory import EditedSentence, Seed, Trajectory
from phonolex_generators.editor.trie_filter import topk_admit_at_position
from phonolex_generators.scorer.joint_mask_pll import joint_masked_coherence
from phonolex_generators.shared.word_to_tokens import word_to_token_positions

# Hyperparameters — overridable per-call via `edit(...)` kwargs.
TRIE_TOP_K = 50  # admit-pool size at each mask position
SAMPLE_TOP_K = 10  # sample from top-K of admit pool
TEMPERATURE = 0.7
N_TRAJECTORIES = 8
MAX_ITER = 15


def _joint_mask_forward(
    model, tokenizer, sentence: str, mask_token_positions: list[int], device: str
):
    enc = tokenizer(sentence, return_tensors="pt").to(device)
    input_ids = enc["input_ids"]
    masked = input_ids.clone()
    for ti in mask_token_positions:
        masked[0, ti] = tokenizer.mask_token_id
    with torch.no_grad():
        logits = model(masked).logits
    return input_ids, logits


def _sample_from_topk(
    admit: list[tuple[int, str, float]],
    sample_top_k: int,
    temperature: float,
    rng: torch.Generator,
    device: str,
) -> tuple[int, str, float] | None:
    if not admit:
        return None
    pool = admit[:sample_top_k]
    raw = torch.tensor([t[2] for t in pool], device=device)
    probs = torch.softmax(raw / temperature, dim=-1)
    idx = torch.multinomial(probs, num_samples=1, generator=rng).item()
    return pool[idx]


def _fill_slot(
    logits: torch.Tensor,
    slot_positions: list[int],
    tokenizer,
    trie: VocabTrie,
    rng: torch.Generator,
    device: str,
    *,
    trie_top_k: int,
    sample_top_k: int,
    temperature: float,
    forbid: set[str],
) -> str | None:
    """Walk the trie left-to-right across slot_positions; return a complete admissible word.

    Returns None if any position has no admit, or if the loop ends without
    accumulated being is_end-and-not-banned.
    """
    accumulated = ""
    for pos_idx, pos in enumerate(slot_positions):
        # Greedy early-stop: if we already have a complete admissible word, return it.
        if accumulated:
            node = trie.walk_to(accumulated)
            if (
                node is not None
                and node.is_end
                and not trie.is_banned_word(accumulated)
                and accumulated not in forbid
            ):
                return accumulated

        admit = topk_admit_at_position(
            logits, pos, tokenizer, trie,
            accumulated=accumulated,
            is_first=(pos_idx == 0),
            k=trie_top_k,
        )
        if not admit:
            return None
        # Anti-repetition across slots within an iteration: skip extensions whose
        # would-be complete word is already used in another slot.
        admit_filtered = [
            (tid, frag, lg) for (tid, frag, lg) in admit
            if (accumulated + frag) not in forbid
        ]
        if not admit_filtered:
            return None

        sampled = _sample_from_topk(
            admit_filtered, sample_top_k, temperature, rng, device
        )
        if sampled is None:
            return None
        accumulated = accumulated + sampled[1]

    # End of mask region — accept iff is_end-and-not-banned.
    node = trie.walk_to(accumulated)
    if (
        node is not None
        and node.is_end
        and not trie.is_banned_word(accumulated)
        and accumulated not in forbid
    ):
        return accumulated
    return None


def _run_trajectory(
    seed: Seed,
    model,
    tokenizer,
    trie: VocabTrie,
    traj_id: int,
    rng: torch.Generator,
    device: str,
    *,
    trie_top_k: int,
    sample_top_k: int,
    temperature: float,
    max_iter: int,
) -> Trajectory:
    traj = Trajectory(traj_id=traj_id)
    current = seed.sentence
    seen = {current}
    coh = joint_masked_coherence(
        model, tokenizer, current, list(seed.content_word_indices)
    )
    traj.history.append((current, coh))
    traj.best_sentence, traj.best_coherence = current, coh

    for _ in range(max_iter):
        word_pos = word_to_token_positions(
            tokenizer, current, list(seed.content_word_indices)
        )
        mask_token_positions = [p for ps in word_pos.values() for p in ps]
        _, logits = _joint_mask_forward(
            model, tokenizer, current, mask_token_positions, device
        )

        new_words = current.split()
        any_change = False
        already_used: set[str] = set()
        for wi in seed.content_word_indices:
            slot_positions = word_pos.get(wi, [])
            if not slot_positions:
                continue
            fill = _fill_slot(
                logits, slot_positions, tokenizer, trie, rng, device,
                trie_top_k=trie_top_k,
                sample_top_k=sample_top_k,
                temperature=temperature,
                forbid=already_used,
            )
            if fill is None:
                continue
            already_used.add(fill)
            if fill != new_words[wi].lower():
                new_words[wi] = fill
                any_change = True

        new_sentence = " ".join(new_words)
        new_coh = joint_masked_coherence(
            model, tokenizer, new_sentence, list(seed.content_word_indices)
        )
        traj.history.append((new_sentence, new_coh))
        if new_coh > traj.best_coherence:
            traj.best_sentence = new_sentence
            traj.best_coherence = new_coh

        if not any_change:
            traj.outcome = "CONVERGED"
            return traj
        if new_sentence in seen:
            traj.outcome = "CYCLE"
            return traj
        seen.add(new_sentence)
        current = new_sentence

    traj.outcome = "TIMEOUT"
    return traj


def edit(
    seed: Seed,
    model,
    tokenizer,
    trie: VocabTrie,
    *,
    verb: str | None = None,
    n_trajectories: int = N_TRAJECTORIES,
    rng_base_seed: int = 42,
    trie_top_k: int = TRIE_TOP_K,
    sample_top_k: int = SAMPLE_TOP_K,
    temperature: float = TEMPERATURE,
    max_iter: int = MAX_ITER,
) -> EditedSentence:
    """Best-of-N sampled iterative edit for a single seed."""
    device = next(model.parameters()).device
    seed_coh = joint_masked_coherence(
        model, tokenizer, seed.sentence, list(seed.content_word_indices)
    )

    if verb is None:
        # Default: pick the first locked word (probe convention: index 2 is the verb).
        words = seed.sentence.split()
        verb = words[seed.locked_word_indices[0]] if seed.locked_word_indices else ""

    rng = torch.Generator(device=str(device))
    trajectories: list[Trajectory] = []
    unique: dict[str, None] = {}
    best_sentence: str = seed.sentence
    best_coh: float = float("-inf")

    for tid in range(n_trajectories):
        rng.manual_seed(rng_base_seed + tid)
        traj = _run_trajectory(
            seed,
            model,
            tokenizer,
            trie,
            tid,
            rng,
            str(device),
            trie_top_k=trie_top_k,
            sample_top_k=sample_top_k,
            temperature=temperature,
            max_iter=max_iter,
        )
        trajectories.append(traj)
        if traj.best_sentence is not None:
            unique[traj.best_sentence] = None
            if traj.best_coherence > best_coh:
                best_coh = traj.best_coherence
                best_sentence = traj.best_sentence

    return EditedSentence(
        seed=seed.sentence,
        spec_id=seed.spec_id,
        verb=verb,
        coherence_seed=seed_coh,
        best=best_sentence,
        coherence_best=best_coh,
        unique_outputs=list(unique.keys()),
        trajectories=trajectories,
    )
  • [ ] Step 4: Export from package

Modify packages/generators/src/phonolex_generators/editor/__init__.py:

from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.editor.trajectory import (
    EditedSentence,
    Outcome,
    Seed,
    Trajectory,
)
from phonolex_generators.editor.trie_filter import topk_admit_at_position

__all__ = [
    "EditedSentence",
    "Outcome",
    "Seed",
    "Trajectory",
    "edit",
    "topk_admit_at_position",
]
  • [ ] Step 5: Run test (expect PASS)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_mlm_iterative_editor.py -v -m slow Expected: 1 passed (~5–10s wall time on MPS).

  • [ ] Step 6: Commit
git add packages/generators/src/phonolex_generators/editor packages/generators/tests/test_mlm_iterative_editor.py
git commit -m "PHON-95 Task 6: editor.mlm_iterative_editor — best-of-N sampled iterative edit"

Task 7: CFG Seed — spec_filters

Files: - Create: packages/generators/src/phonolex_generators/cfg_seed/spec_filters.py - Test: packages/generators/tests/test_spec_filters.py

The probe used SQL queries for spec1/spec6 against the D1 SQLite. The productionized version uses Polars expressions against WordStore.subset(...). This module is the dictionary of those expressions, callable by spec_id.

  • [ ] Step 1: Write the failing test

packages/generators/tests/test_spec_filters.py:

"""Spec filters — Polars expressions yielding the probe's lexicon counts."""

from pathlib import Path

from phonolex_data.runtime.store import WordStore

from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS


def _store() -> WordStore:
    return WordStore.from_parquet(Path("data/runtime/words.parquet"))


def test_spec1_count_matches_probe():
    """spec1 = words starting /k/, syllable_count ≤ 2, POS in NOUN/VERB.

    Probe printed: 'spec spec1 VocabTrie: 1,798 words'
    """
    store = _store()
    df = store.subset(SPEC_FILTERS["spec1"])
    assert df.height == 1798


def test_spec6_count_matches_probe():
    """spec6 = syllable_count ≤ 2, POS in NOUN/VERB/ADJ, iconicity ≥ 1.8, imageability ≥ 4.5.

    Probe printed: 'spec spec6 VocabTrie: 649 words'
    """
    store = _store()
    df = store.subset(SPEC_FILTERS["spec6"])
    assert df.height == 649


def test_unknown_spec_id_raises_keyerror():
    assert "spec1" in SPEC_FILTERS
    assert "spec999" not in SPEC_FILTERS
  • [ ] Step 2: Run test (expect FAIL)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_spec_filters.py -v Expected: ImportError.

  • [ ] Step 3: Implement spec filters

packages/generators/src/phonolex_generators/cfg_seed/spec_filters.py:

"""Spec ID → Polars filter expression. PHON-64 v2 failure-case lexicon.

These match the SQL queries in `probe_sampled_iterative.py::load_spec_words`.
Counts verified against the probe's runtime printout (spec1=1798, spec6=649).
"""

from __future__ import annotations

import polars as pl

SPEC_FILTERS: dict[str, pl.Expr] = {
    "spec1": (
        pl.col("phonemes_str").str.starts_with("|k|")
        & (pl.col("syllable_count") <= 2)
        & pl.col("pos").is_in(["NOUN", "VERB"])
    ),
    "spec6": (
        (pl.col("syllable_count") <= 2)
        & pl.col("pos").is_in(["NOUN", "VERB", "ADJ"])
        & (pl.col("iconicity") >= 1.8)
        & (pl.col("imageability") >= 4.5)
    ),
}
  • [ ] Step 4: Run test (expect PASS)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_spec_filters.py -v Expected: 3 passed.

  • [ ] Step 5: Commit
git add packages/generators/src/phonolex_generators/cfg_seed/spec_filters.py packages/generators/tests/test_spec_filters.py
git commit -m "PHON-95 Task 7: cfg_seed.spec_filters — PHON-64 v2 spec1/spec6 Polars exprs"

Task 8: CFG Seed — argstruc_enumerator

Files: - Create: packages/generators/src/phonolex_generators/cfg_seed/argstruc_enumerator.py - Modify: packages/generators/src/phonolex_generators/cfg_seed/__init__.py (export enumerate_seeds) - Test: packages/generators/tests/test_argstruc_enumerator.py

Verb-locked CFG NP V NP enumerator. Slot terminals are the intersection of (a) the spec lexicon (store.subset(spec_expr)) and (b) the per-(verb, role) PMI-admit set from selectional.parquet. v1 picks 4 nsubj × 4 dobj fills randomly to cap at 16 seeds; the determiner is "the" and pluralization/agreement is deferred (OQ3).

  • [ ] Step 1: Write the failing test

packages/generators/tests/test_argstruc_enumerator.py:

"""argstruc_enumerator — verb-locked NP V NP CFG with WordStore + PMI gating."""

from pathlib import Path

import pytest
from phonolex_data.runtime.store import WordStore

from phonolex_generators.cfg_seed.argstruc_enumerator import enumerate_seeds, pmi_admit
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS


@pytest.fixture(scope="module")
def store() -> WordStore:
    s = WordStore.from_parquet(Path("data/runtime/words.parquet"))
    s.attach_selectional(Path("data/runtime/selectional.parquet"))
    return s


def test_pmi_admit_cake_yes_thunder_no(store: WordStore):
    """Acceptance criterion 4: cake ∈ admit, thunder ∉ admit (verb=cut, dobj, fineweb_adult)."""
    admit = pmi_admit(store, verb="cut", role="dobj", band="fineweb_adult")
    assert "cake" in admit
    assert "thunder" not in admit


def test_enumerate_seeds_for_cut_spec1_emits_in_spec_seeds(store: WordStore):
    """Acceptance criterion 2.a: ≥4 seeds; all content words in spec lexicon AND admit set."""
    spec_lex = set(store.subset(SPEC_FILTERS["spec1"])["word"].str.to_lowercase().to_list())
    nsubj_admit = pmi_admit(store, verb="cut", role="nsubj", band="fineweb_adult")
    dobj_admit = pmi_admit(store, verb="cut", role="dobj", band="fineweb_adult")

    seeds = enumerate_seeds(
        store=store,
        spec_id="spec1",
        verb="cut",
        band="fineweb_adult",
        max_seeds=16,
        rng_seed=7,
    )
    assert len(seeds) >= 4
    for s in seeds:
        words = s.sentence.split()
        nsubj_word, dobj_word = words[1].lower(), words[4].lower()
        assert nsubj_word in spec_lex, f"{nsubj_word} not in spec1 lexicon"
        assert dobj_word in spec_lex, f"{dobj_word} not in spec1 lexicon"
        assert nsubj_word in nsubj_admit, f"{nsubj_word} not in nsubj admit"
        assert dobj_word in dobj_admit, f"{dobj_word} not in dobj admit"
        assert s.spec_id == "spec1"
        assert s.locked_word_indices == (2,)
        assert s.content_word_indices == (1, 4)


def test_enumerate_seeds_deterministic_with_rng_seed(store: WordStore):
    a = enumerate_seeds(store, "spec1", "cut", "fineweb_adult", max_seeds=8, rng_seed=99)
    b = enumerate_seeds(store, "spec1", "cut", "fineweb_adult", max_seeds=8, rng_seed=99)
    assert [s.sentence for s in a] == [s.sentence for s in b]
  • [ ] Step 2: Run test (expect FAIL)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_argstruc_enumerator.py -v Expected: ImportError.

  • [ ] Step 3: Implement pmi_admit and enumerate_seeds

packages/generators/src/phonolex_generators/cfg_seed/argstruc_enumerator.py:

"""Verb-locked argument-structure CFG seed enumerator.

Productions:
    S → NP V NP
    NP → "the" N

V is locked at production time. Both N slots are filled from the
intersection of (a) the spec lexicon (`WordStore.subset(spec_expr)`) and
(b) the per-(verb, role, band) PMI-admit set from `selectional.parquet`.

v1: determiner fixed to "the"; pluralization/agreement deferred (OQ3).
"""

from __future__ import annotations

import random

import polars as pl
from phonolex_data.runtime.store import WordStore

from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.trajectory import Seed


def pmi_admit(store: WordStore, verb: str, role: str, band: str) -> set[str]:
    """Per-(verb, role, band) selectional admit set: fillers with PPMI > 0."""
    if store._selectional_df is None:
        raise RuntimeError(
            "selectional.parquet not attached; call "
            "`store.attach_selectional(path)` before enumerate_seeds()"
        )
    df = store._selectional_df.filter(
        (pl.col("verb") == verb)
        & (pl.col("role") == role)
        & (pl.col("band") == band)
        & (pl.col("ppmi") > 0.0)
    )
    return set(df.get_column("filler").to_list())


def enumerate_seeds(
    store: WordStore,
    spec_id: str,
    verb: str,
    band: str = "fineweb_adult",
    *,
    max_seeds: int = 16,
    nsubj_per_pool: int = 4,
    dobj_per_pool: int = 4,
    rng_seed: int = 42,
) -> list[Seed]:
    """Emit up to `max_seeds` `the {nsubj} {verb} the {dobj}` seed sentences."""
    if spec_id not in SPEC_FILTERS:
        raise KeyError(f"unknown spec_id: {spec_id!r}; known: {list(SPEC_FILTERS)}")

    spec_words = set(
        store.subset(SPEC_FILTERS[spec_id])
        .get_column("word")
        .str.to_lowercase()
        .to_list()
    )
    nsubj_admit = pmi_admit(store, verb=verb, role="nsubj", band=band)
    dobj_admit = pmi_admit(store, verb=verb, role="dobj", band=band)

    nsubj_pool = sorted(spec_words & nsubj_admit)
    dobj_pool = sorted(spec_words & dobj_admit)

    rng = random.Random(rng_seed)
    n_pick = min(len(nsubj_pool), nsubj_per_pool)
    d_pick = min(len(dobj_pool), dobj_per_pool)
    nsubj_sample = rng.sample(nsubj_pool, k=n_pick) if n_pick else []
    dobj_sample = rng.sample(dobj_pool, k=d_pick) if d_pick else []

    seeds: list[Seed] = []
    for nsubj in nsubj_sample:
        for dobj in dobj_sample:
            if nsubj == dobj:
                continue
            seeds.append(
                Seed(
                    sentence=f"the {nsubj} {verb} the {dobj}",
                    content_word_indices=(1, 4),
                    locked_word_indices=(2,),
                    spec_id=spec_id,
                    note=f"CFG-emitted ({nsubj}, {verb}, {dobj}) band={band}",
                )
            )
            if len(seeds) >= max_seeds:
                return seeds
    return seeds
  • [ ] Step 4: Export from package

packages/generators/src/phonolex_generators/cfg_seed/__init__.py:

from phonolex_generators.cfg_seed.argstruc_enumerator import (
    enumerate_seeds,
    pmi_admit,
)
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS

__all__ = ["SPEC_FILTERS", "enumerate_seeds", "pmi_admit"]
  • [ ] Step 5: Run test (expect PASS)

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_argstruc_enumerator.py -v Expected: 3 passed.

  • [ ] Step 6: Commit
git add packages/generators/src/phonolex_generators/cfg_seed packages/generators/tests/test_argstruc_enumerator.py
git commit -m "PHON-95 Task 8: cfg_seed.argstruc_enumerator — verb-locked NP V NP enumerator"

Task 9: PHON-64 v2 Acceptance Test (compliance + perf, not byte-equality)

Files: - Create: packages/generators/tests/test_acceptance_phon64v2.py

The PHON-64 v2 regression feeds all 5 hand-crafted probe seeds through the productionized editor + scorer, asserting per seed: (a) coherence improves, (b) verbs are locked, (c) every content word is spec-compliant under the tagged trie (walk_to(w).is_end and not is_banned_word(w)). Byte-equality with the probe's sampled_locked_dedup_output.txt is NOT a target — the prefix-walking decode generalizes the probe's single-position filter, so outputs are expected to differ (and ideally improve, since multi-token spec words become reachable). The probe gold remains in the spike branch as a smoke baseline; we don't copy it as a fixture.

  • [ ] Step 1: Write the failing acceptance test

packages/generators/tests/test_acceptance_phon64v2.py:

"""PHON-64 v2 acceptance: 5 probe seeds → coherent in-spec English.

Spec acceptance criterion 1: all 5 seeds produce a `best` with
coherence_best > coherence_seed and 100% spec compliance under the tagged
trie. The probe's `sampled_locked_dedup_output.txt` is a smoke baseline,
not a byte-equality target — prefix-walking decode generalizes the probe.
"""

import time
from pathlib import Path

import pytest
from phonolex_data.runtime.store import WordStore
from phonolex_governors.generation.trie import VocabTrie

from phonolex_generators.cfg_seed.argstruc_enumerator import enumerate_seeds
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.editor.trajectory import Seed
from phonolex_generators.shared.mlm_loader import get_mlm

PROBE_SEEDS = [
    Seed("the puppy melt the baby", (1, 4), (2,), "spec6", "F1 case, verb locked"),
    Seed("the cat chased the ball", (1, 4), (2,), "spec1", "Well-formed control"),
    Seed("the snow filled the cup", (1, 4), (2,), "spec1", "Reasonable spec1 seed"),
    Seed("the coleslaw cut the control", (1, 4), (2,), "spec1", "F1 + multi-BPE"),
    Seed("the dog ate the bone", (1, 4), (2,), "spec1", "Well-formed control 2"),
]


@pytest.fixture(scope="module")
def store() -> WordStore:
    s = WordStore.from_parquet(Path("data/runtime/words.parquet"))
    s.attach_selectional(Path("data/runtime/selectional.parquet"))
    return s


@pytest.fixture(scope="module")
def all_words(store: WordStore) -> list[str]:
    return [w.lower() for w in store.df["word"].to_list()]


@pytest.fixture(scope="module")
def vocab_trie(all_words: list[str]) -> VocabTrie:
    """Build the trie ONCE; per-test we only retag with a different banned set."""
    return VocabTrie(all_words)


def _retag_for_spec(trie: VocabTrie, store: WordStore, all_words: list[str], spec_id: str) -> set[str]:
    allowed = set(
        store.subset(SPEC_FILTERS[spec_id]).get_column("word").str.to_lowercase().to_list()
    )
    banned = set(all_words) - allowed
    trie.tag(banned)
    return allowed


@pytest.mark.acceptance
@pytest.mark.slow
@pytest.mark.parametrize("seed", PROBE_SEEDS, ids=lambda s: s.sentence.replace(" ", "_"))
def test_probe_seed_improves_and_stays_in_spec(
    seed: Seed,
    store: WordStore,
    all_words: list[str],
    vocab_trie: VocabTrie,
):
    _retag_for_spec(vocab_trie, store, all_words, seed.spec_id)

    model, tokenizer, _ = get_mlm()
    result = edit(seed, model=model, tokenizer=tokenizer, trie=vocab_trie)

    assert result.coherence_best > result.coherence_seed, (
        f"Seed {seed.sentence!r}: edit did not improve coherence "
        f"(seed={result.coherence_seed:+.2f}, best={result.coherence_best:+.2f})"
    )
    best_words = result.best.split()
    # Verb is locked.
    seed_words = seed.sentence.split()
    for li in seed.locked_word_indices:
        assert best_words[li] == seed_words[li], (
            f"Locked word at index {li} changed: {seed_words[li]!r}{best_words[li]!r}"
        )
    # Content words are spec-compliant under the trie.
    for ci in seed.content_word_indices:
        w = best_words[ci].lower()
        node = vocab_trie.walk_to(w)
        assert node is not None and node.is_end, (
            f"Content word at index {ci} {w!r} is not is_end in trie"
        )
        assert not vocab_trie.is_banned_word(w), (
            f"Content word at index {ci} {w!r} is banned for spec {seed.spec_id}"
        )


@pytest.mark.acceptance
@pytest.mark.slow
def test_performance_gate_16_seeds_under_30s(
    store: WordStore,
    all_words: list[str],
    vocab_trie: VocabTrie,
):
    """Acceptance criterion 3: 16-seed batch ≤ 30s wall-clock on MPS."""
    _retag_for_spec(vocab_trie, store, all_words, "spec1")

    seeds = enumerate_seeds(store, "spec1", "cut", "fineweb_adult", max_seeds=16, rng_seed=42)
    assert len(seeds) >= 8

    model, tokenizer, _ = get_mlm()
    t0 = time.perf_counter()
    for s in seeds:
        edit(s, model=model, tokenizer=tokenizer, trie=vocab_trie)
    elapsed = time.perf_counter() - t0
    assert elapsed <= 30.0, f"16-seed batch took {elapsed:.1f}s; budget is 30s"
  • [ ] Step 2: Run the acceptance test

Run: uv run --package phonolex-generators pytest packages/generators/tests/test_acceptance_phon64v2.py -v -m "slow and acceptance" Expected: 6 passed (5 seed cases + perf gate). Wall-clock: ~30–60s total (model load + 5 best-of-8 edits + 16-seed sweep). If a seed-case fails, investigate whether the prefix-walking decode is dead-ending early (likely culprit: spec lexicon too restrictive AND dead_end_ratio < 1.0 not finding admits at p_0).

  • [ ] Step 3: Run the full unit-test suite to confirm no regressions

Run: uv run --package phonolex-generators pytest packages/generators/tests/ -v (default — slow tests skipped) Expected: all unmarked tests passed.

Run: uv run python -m pytest packages/data/tests/ -v (acceptance criterion 5) Expected: all 201 packages/data tests still pass.

  • [ ] Step 4: Commit
git add packages/generators/tests/test_acceptance_phon64v2.py
git commit -m "PHON-95 Task 9: acceptance — PHON-64 v2 compliance + 16-seed perf gate"

Task 10: Reproducible Run Script

Files: - Create: packages/generation/research/2026-05-07-phon-95-editor/run.py - Create: packages/generation/research/2026-05-07-phon-95-editor/README.md

A small CLI under the existing packages/generation/research/ tree (where 2026-04-29-eval-harness-v1 and similar live). Takes (spec_id, verb, n_seeds) and prints the editor's outputs in the same shape the probe printed.

  • [ ] Step 1: Implement the run script

packages/generation/research/2026-05-07-phon-95-editor/run.py:

"""PHON-95 reproducible run — CFG enumerate → MLM edit (prefix-walking decode) → coherence-rank.

Usage:
    uv run python packages/generation/research/2026-05-07-phon-95-editor/run.py \\
        --spec-id spec1 --verb cut --band fineweb_adult --n-seeds 8

Prints per-seed best output + coherence + unique-output count.
"""

from __future__ import annotations

import argparse
import time
from pathlib import Path

from phonolex_data.runtime.store import WordStore
from phonolex_governors.generation.trie import VocabTrie

from phonolex_generators.cfg_seed.argstruc_enumerator import enumerate_seeds
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.shared.mlm_loader import get_mlm


def main() -> None:
    parser = argparse.ArgumentParser(description="PHON-95 editor run")
    parser.add_argument("--spec-id", required=True, choices=sorted(SPEC_FILTERS.keys()))
    parser.add_argument("--verb", required=True)
    parser.add_argument("--band", default="fineweb_adult")
    parser.add_argument("--n-seeds", type=int, default=8)
    parser.add_argument("--n-trajectories", type=int, default=8)
    parser.add_argument("--rng-seed", type=int, default=42)
    parser.add_argument(
        "--words-parquet", type=Path, default=Path("data/runtime/words.parquet")
    )
    parser.add_argument(
        "--selectional-parquet",
        type=Path,
        default=Path("data/runtime/selectional.parquet"),
    )
    args = parser.parse_args()

    print(f"Loading WordStore from {args.words_parquet} ...")
    store = WordStore.from_parquet(args.words_parquet)
    store.attach_selectional(args.selectional_parquet)

    print("Loading RoBERTa-large ...")
    model, tokenizer, device = get_mlm()
    print(f"  device = {device}")

    print("Building VocabTrie over full vocab ...")
    all_words = [w.lower() for w in store.df["word"].to_list()]
    trie = VocabTrie(all_words)

    allowed = set(
        store.subset(SPEC_FILTERS[args.spec_id])
        .get_column("word").str.to_lowercase().to_list()
    )
    banned = set(all_words) - allowed
    trie.tag(banned)
    print(f"  spec {args.spec_id}: {len(allowed):,} allowed / {len(banned):,} banned\n")

    seeds = enumerate_seeds(
        store,
        spec_id=args.spec_id,
        verb=args.verb,
        band=args.band,
        max_seeds=args.n_seeds,
        rng_seed=args.rng_seed,
    )
    if not seeds:
        print(f"No seeds emitted for spec={args.spec_id} verb={args.verb} band={args.band}")
        return

    t0 = time.perf_counter()
    for s in seeds:
        print("=" * 78)
        print(f"SEED:  {s.sentence!r}  (spec={s.spec_id})")
        result = edit(
            s,
            model=model,
            tokenizer=tokenizer,
            trie=trie,
            n_trajectories=args.n_trajectories,
            rng_base_seed=args.rng_seed,
        )
        print(f"  seed coherence  = {result.coherence_seed:+.2f}")
        print(f"  best            = {result.best!r}")
        print(f"  best coherence  = {result.coherence_best:+.2f}")
        print(f"  unique outputs  = {len(result.unique_outputs)} / {args.n_trajectories}")
    elapsed = time.perf_counter() - t0
    print("=" * 78)
    print(f"\nTotal: {elapsed:.1f}s for {len(seeds)} seeds "
          f"({elapsed / len(seeds):.1f}s/seed average)")


if __name__ == "__main__":
    main()
  • [ ] Step 2: Add a brief README for the research dir

packages/generation/research/2026-05-07-phon-95-editor/README.md:

# PHON-95 — MLM Iterative Editor + Argstruc CFG Enumerator

Reproducible run script for the productionized PHON-92 stack
(`phonolex_generators`).

## Usage

    uv run python run.py --spec-id spec1 --verb cut --band fineweb_adult --n-seeds 8

## What this is

Tiny driver around the new `phonolex_generators` package. The package
itself lives at `packages/generators/`; this directory only holds the
demo CLI + any artifacts produced from probing.

See:
- spec: `docs/superpowers/specs/2026-05-07-phon-95-mlm-editor-cfg-enumerator-design.md`
- plan: `docs/superpowers/plans/2026-05-07-phon-95-mlm-editor-cfg-enumerator.md`
- predecessor probe (gold): `research/phon-92-selectional-preference-spike` @
  `packages/generation/research/2026-05-05-phon-92-selectional-preference/diffusion-editor-probe/`
  • [ ] Step 3: Smoke-run the script

Run: uv run python packages/generation/research/2026-05-07-phon-95-editor/run.py --spec-id spec1 --verb cut --n-seeds 4 --n-trajectories 4 Expected: prints 4 seeds with best ≠ seed sentence for at least 3, no exceptions.

  • [ ] Step 4: Commit
git add packages/generation/research/2026-05-07-phon-95-editor/
git commit -m "PHON-95 Task 10: reproducible run script under packages/generation/research/"

Task 11: Documentation — README + CLAUDE.md update

Files: - Create: packages/generators/README.md - Modify: CLAUDE.md (mention the new package alongside phonolex_data / phonolex_governors)

  • [ ] Step 1: Write the package README

packages/generators/README.md:

# phonolex_generators

C1 combinatorial generation track — productionization of the PHON-92
validated stack.

## Modules

- `cfg_seed.argstruc_enumerator` — verb-locked NP V NP CFG; slot fills =
  `WordStore.subset(spec_expr) ∩ pmi_admit(verb, role, band)`.
- `editor.mlm_iterative_editor` — joint-mask + sampled trie-filtered fill
  + best-of-N trajectories over RoBERTa-large.
- `scorer.joint_mask_pll` — joint-masked pseudo-log-likelihood; shared
  MLM with the editor.

## Quick start

```python
from pathlib import Path

from phonolex_data.runtime.store import WordStore
from phonolex_governors.generation.trie import VocabTrie
from phonolex_generators.cfg_seed.argstruc_enumerator import enumerate_seeds
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.shared.mlm_loader import get_mlm

store = WordStore.from_parquet(Path("data/runtime/words.parquet"))
store.attach_selectional(Path("data/runtime/selectional.parquet"))
model, tokenizer, _ = get_mlm()

all_words = [w.lower() for w in store.df["word"].to_list()]
trie = VocabTrie(all_words)
allowed = set(store.subset(SPEC_FILTERS["spec1"]).get_column("word").str.to_lowercase().to_list())
trie.tag(set(all_words) - allowed)

for seed in enumerate_seeds(store, "spec1", "cut", "fineweb_adult"):
    result = edit(seed, model=model, tokenizer=tokenizer, trie=trie)
    print(result.best, "  coh =", result.coherence_best)

Tests

uv run --package phonolex-generators pytest packages/generators/tests/ -v          # unit (fast)
uv run --package phonolex-generators pytest packages/generators/tests/ -v -m slow  # + MLM-loading
uv run --package phonolex-generators pytest packages/generators/tests/ -v \\
    -m "slow and acceptance"                                                      # PHON-64 v2 gold

Dependencies

phonolex_data (WordStore + Parquet) · phonolex_governors (VocabTrie, re-tagged per request via trie.tag(banned)) · transformers · torch · polars.

See

  • spec: docs/superpowers/specs/2026-05-07-phon-95-mlm-editor-cfg-enumerator-design.md
  • plan: docs/superpowers/plans/2026-05-07-phon-95-mlm-editor-cfg-enumerator.md
    - [ ] **Step 2: Modify `CLAUDE.md` — Architecture and Project Structure sections**
    
    In `CLAUDE.md` under the Architecture section, add `phonolex_generators` to the list of in-house Python packages. The relevant existing block reads:
    
  • Governor engine: packages/governors/ — word-level checker (G2P, phonology), Reranker (penalty-only trie steering), PunctuationBoost, VocabTrie, TargetedRolloutProcessor. Package name: phonolex_governors.
  • Generation server: packages/generation/ — FastAPI + T5Gemma 9B-2B. Local dev via uvicorn, production via RunPod Serverless (scale-to-zero GPU). Cloudflare Worker proxies /api/generate-single to RunPod.
    After the Governor engine line, insert:
    
  • Generators (C1 combinatorial): packages/generators/ — productionized PHON-92 stack (PHON-95), generalized to prefix-walking decode. CFG seed enumerator + MLM iterative editor (per-content-slot trie walk over a tagged phonolex_governors.VocabTrie) + joint-mask PLL coherence scorer. Package name: phonolex_generators. Depends on phonolex_data + phonolex_governors.
    In the Project Structure tree, add a `generators/` block between `governors/` and `web/`:
    
    │ ├── generators/ # C1 combinatorial generation (phonolex_generators) — PHON-95 │ │ ├── src/phonolex_generators/ │ │ │ ├── cfg_seed/ # argstruc_enumerator + spec_filters │ │ │ ├── editor/ # mlm_iterative_editor + trajectory + trie_filter │ │ │ ├── scorer/ # joint_mask_pll │ │ │ └── shared/ # mlm_loader + word_to_tokens │ │ ├── tests/ # unit + acceptance (slow markers) │ │ └── pyproject.toml │ │
    Also update the Dev Setup section's editable-install line so `phonolex_generators` is added:
    
    Change:
    
    ```bash
    uv pip install -e packages/data -e packages/governors
    

to:

uv pip install -e packages/data -e packages/governors -e packages/generators
  • [ ] Step 3: Verify CLAUDE.md lints clean (no broken links / out-of-date references)

Run: git diff CLAUDE.md and visually scan for typos / missing newlines.

  • [ ] Step 4: Commit
git add packages/generators/README.md CLAUDE.md
git commit -m "PHON-95 Task 11: docs — package README + CLAUDE.md update"

Task 12: Open OQ1–OQ6 Follow-Up Tickets (DRAFT — user approval before creating)

Files: none (Jira step)

The spec has 6 open implementation questions. Per user feedback (feedback_authorization_per_item), this task DRAFTS the ticket bodies in the plan document and surfaces them for explicit user authorization before creating any Jira issues. Do NOT call mcp__plugin_atlassian_atlassian__createJiraIssue until the user signs off.

  • [ ] Step 1: Verify free PHON-XX numbers

Run JQL via the Atlassian MCP:

project = PHON ORDER BY created DESC

Read off the highest existing key. Per user feedback (feedback_verify_jira_state), don't promise specific numbers ahead of time — list which 6 are next free, in order.

  • [ ] Step 2: Draft ticket bodies (paste into the plan as a comment block; do NOT create yet)

For each OQ, the draft has shape:

Title: PHON-95 OQ<N>: <one-line summary>
Workstream: <pick one from the 10>
Body:
  Trigger: <the measurable failure mode that would activate this work>
  v1 default (current): <as-is>
  v2 candidate: <what changes>
  Predecessor: PHON-95

Six drafts:

  1. OQ1 — Continuous PMI biasing (default Boolean PMI ≥ 0; v2: α·ppmi logit bias). Trigger: a seed where boolean admit set produces a low-quality output AND a PPMI-ranked top-K differs meaningfully.
  2. OQ2 — Editor scaling to 10–15 tokens. Trigger: longer-sentence CFG productions land and best-of-N coherence stops improving over the seed.
  3. OQ3 — Subject-verb agreement / morphology. Trigger: any output flagged "morphologically wrong" by a clinician reviewer in the PHON-69 survey.
  4. OQ4 — Diversity at scale (best-of-8 → 10–20 unique). Trigger: a customer-facing batch task requires N distinct outputs > 4.
  5. OQ5 — Coherence robustness (N=50–100 sanity). Trigger: a degenerate output ranks above a well-formed one in any acceptance run.
  6. OQ6 — Editor fine-tune on PhonoLex CDS. Trigger: band="childes_*" outputs are qualitatively bad on the regression seeds.

  7. [ ] Step 3: Surface to user

Print the 6 drafts with proposed PHON-XX numbers and ask: "Approve creating these 6 OQ follow-up tickets in Jira? (yes/no/edit)"

  • [ ] Step 4: Create on user approval (only if "yes")

For each approved draft, call mcp__plugin_atlassian_atlassian__createJiraIssue with the body above. Do NOT batch-create without per-ticket confirmation if the user says "edit."

  • [ ] Step 5: No commit

This task creates Jira tickets, not code. No git operation required. The plan-doc trail is sufficient.


Self-Review

Spec coverage: - §Scope In: — new package ✓ (T1), three modules ✓ (T3/T6/T8), per-request trie tagging on full vocab ✓ (T4/T6 — see deviation note below), boolean PMI admit ✓ (T8), acceptance test ✓ (T9), reproducible run script ✓ (T10). - §Data contracts — WordStore.subset use ✓ (T7/T8), selectional.parquet PMI admit ✓ (T8), MLM weights singleton ✓ (T2), EditedSentence / Trajectory ✓ (T5). - §Architecture three modules + shared MLM ✓ (T2/T3/T6/T8). - §Acceptance criteria — (1) 5-seed compliance regression ✓ (T9), (2) module unit tests ✓ (T6/T7/T8/T3/T4), (3) 16-seed ≤ 30s ✓ (T9 perf gate), (4) cake/thunder PMI ✓ (T8), (5) phonolex_data no regressions ✓ (T9 step 3). - §Open implementation questions — OQ1–OQ6 fan-out ✓ (T12, gated on user approval). - §Plan handoff items 1–8 → tasks T1, T6, T8, T3, T9, T10, T11, T12 — all covered.

Placeholder scan: none. Every step contains exact paths and complete code.

Type consistency: Seed(sentence, content_word_indices: tuple[int,...], locked_word_indices: tuple[int,...], spec_id, note) consistent across T5/T6/T8/T9. EditedSentence.unique_outputs: list[str] — note this is a list (not set) to preserve insertion order; T6 builds it via dict[str, None] to dedupe while preserving order, T9 does not assert ordering. VocabTrie (from phonolex_governors) used as the trie type in T4/T6/T9/T10/T11; per-request retag via trie.tag(banned).

Deliberate deviations from the spec:

  1. Spec §Scope says "Per-request small dict-trie (~500–2K words) ... Distinct from v6's static 126K-word marisa-trie (which stays in phonolex_governors and is not a dependency here)." The plan reuses phonolex_governors.VocabTrie (full-vocab marisa-trie + per-request tag(banned)) instead of building a parallel small dict-trie. Justification: we already own that infrastructure and it's the canonical representation; building a parallel small trie was over-engineering. phonolex_generators therefore depends on phonolex_governors.

  2. Spec §Module 2 says "Lift verbatim from probe_sampled_iterative.py. ... joint-mask all content positions, forward through MLM, intersect top-K logits with the per-request word trie, sample ... at temperature=0.7 from top-10 of the trie-filtered top-50." The plan generalizes this from a single-position complete-word filter (probe) to per-content-slot prefix-walking decode that walks the trie left-to-right across each slot's mask positions, gating per-position admits by dead_end_ratio < 1.0, and greedy-stopping when the accumulated prefix is is_end and not banned. Justification: this is what the trie was designed for (mirrors phonolex_governors.Reranker._steer_sequence); the probe's filter was a single-step special case that couldn't reach multi-token compliant words. The probe's gold output (sampled_locked_dedup_output.txt) becomes a smoke baseline rather than a byte-equality target.

Both deviations were authorized in the planning conversation 2026-05-07; the spec text should be updated in a follow-up edit.


Execution Handoff

Plan complete and saved to docs/superpowers/plans/2026-05-07-phon-95-mlm-editor-cfg-enumerator.md. Two execution options:

1. Subagent-Driven (recommended) — dispatch a fresh subagent per task, review between tasks, fast iteration. Best fit for this plan because Tasks 2–10 each have a single concrete deliverable + tests.

2. Inline Execution — execute tasks in this session using executing-plans, batch execution with checkpoints.

Which approach?