PHON-95 — MLM Iterative Editor + Argstruc CFG Enumerator Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Productionize the PHON-92 validated stack (verb-locked CFG seed + RoBERTa-large joint-mask iterative editor + joint-mask PLL coherence scorer) as a callable Python API in a new packages/generators/ workspace package.
Architecture: Three sibling modules in phonolex_generators — cfg_seed.argstruc_enumerator (CFG → seed sentences), editor.mlm_iterative_editor (best-of-N joint-mask iterative refinement via prefix-walking decode), scorer.joint_mask_pll (coherence ranking). The editor and scorer share a single MLM instance via a singleton loader. The editor uses phonolex_governors.generation.trie.VocabTrie (marisa-trie, character-level) — built once at editor init from the full store.df["word"] vocab (~125K), then re-tagged per request with banned = all_words - allowed where allowed = store.subset(SPEC_FILTERS[spec_id])["word"].
Prefix-walking decode (generalization of the PHON-92 probe). For each content slot covering mask positions p_0..p_n, the editor walks the trie left-to-right: at each position, top-K candidates are filtered to those whose decoded fragment, concatenated to the accumulated prefix, has dead_end_ratio < 1.0 (i.e., at least one compliant completion is reachable from this prefix). Sample one with temperature, accumulate, repeat. Greedy-stop when walk_to(accumulated).is_end and not is_banned_word(accumulated) — that's the fill word. This mirrors phonolex_governors.generation.reranker.Reranker._steer_sequence (which walks the trie token-by-token at causal-LM generation time); the MLM editor applies the same prefix-walking pattern across mask positions of a content slot. The probe's "filter at first position only, admit if decoded token is itself a complete word" approach is the natural single-step subset of this; multi-token compliant completions (e.g., a 2-BPE word like "courtroom" filling a single mask position with "▁court" + cont. "room") are reachable that the probe couldn't reach.
PMI gating for CFG slot fills derives from selectional.parquet. Deps: phonolex_data, phonolex_governors, transformers/torch.
Tech Stack: Python 3.10+ · phonolex_data (WordStore + Parquet) · transformers (RoBERTa-large MLM) · torch (MPS/CPU) · polars (filtering) · pytest (TDD)
Spec: docs/superpowers/specs/2026-05-07-phon-95-mlm-editor-cfg-enumerator-design.md
Working branch: feature/phon-95-impl off release/v5.2.0. Commits target release/v5.2.0 via PR.
Source artifacts (read-only):
- Probe (verbatim lift target): research/phon-92-selectional-preference-spike @ packages/generation/research/2026-05-05-phon-92-selectional-preference/diffusion-editor-probe/probe_sampled_iterative.py
- Gold output: same branch @ commit 5cae898, file sampled_locked_dedup_output.txt
Pre-flight¶
Before starting Task 1, set up the working branch:
git fetch origin
git checkout release/v5.2.0
git pull --ff-only origin release/v5.2.0
git checkout -b feature/phon-95-impl
Verify the editable installs are current:
uv pip install -e packages/data
uv run python -c "from phonolex_data.runtime.store import WordStore; print('OK')"
Expected: OK.
Task 1: Workspace Bootstrap¶
Files:
- Create: packages/generators/pyproject.toml
- Create: packages/generators/src/phonolex_generators/__init__.py
- Create: packages/generators/src/phonolex_generators/cfg_seed/__init__.py
- Create: packages/generators/src/phonolex_generators/editor/__init__.py
- Create: packages/generators/src/phonolex_generators/scorer/__init__.py
- Create: packages/generators/src/phonolex_generators/shared/__init__.py
- Create: packages/generators/tests/__init__.py
- Create: packages/generators/tests/conftest.py
- Modify: pyproject.toml (workspace registration)
- [ ] Step 1: Create
packages/generators/pyproject.toml
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "phonolex-generators"
version = "0.1.0"
description = "Combinatorial generation track (C1): CFG seed + MLM iterative editor + joint-mask PLL scorer"
license = "LicenseRef-Proprietary"
requires-python = ">=3.10"
dependencies = [
"torch>=2.0",
"transformers>=4.38",
"phonolex-data",
"phonolex-governors",
"polars>=1.0",
]
[project.optional-dependencies]
dev = [
"pytest>=7.0",
"ruff>=0.4",
]
[tool.hatch.build.targets.wheel]
packages = ["src/phonolex_generators"]
[tool.ruff]
target-version = "py310"
line-length = 100
[tool.pytest.ini_options]
testpaths = ["tests"]
markers = [
"slow: requires RoBERTa-large weights (~1.4GB; deselect with '-m \"not slow\"')",
"acceptance: PHON-64 v2 regression gold test (slow)",
]
- [ ] Step 2: Create the package skeleton (six empty
__init__.pyfiles plus oneconftest.py)
packages/generators/src/phonolex_generators/__init__.py:
"""C1 combinatorial generation track — CFG seeds + MLM iterative editor + PLL scorer."""
__version__ = "0.1.0"
packages/generators/src/phonolex_generators/cfg_seed/__init__.py,
packages/generators/src/phonolex_generators/editor/__init__.py,
packages/generators/src/phonolex_generators/scorer/__init__.py,
packages/generators/src/phonolex_generators/shared/__init__.py — each one empty:
packages/generators/tests/__init__.py — empty:
packages/generators/tests/conftest.py:
"""Shared test fixtures.
The `slow` marker gates RoBERTa-large-loading tests behind explicit
opt-in; default `pytest -v` runs only the unit tests.
"""
import pytest
def pytest_collection_modifyitems(config, items):
if config.getoption("-m") == "":
skip_slow = pytest.mark.skip(reason="needs '-m slow' to run model-loading tests")
for item in items:
if "slow" in item.keywords:
item.add_marker(skip_slow)
- [ ] Step 3: Register in workspace (
pyproject.tomlat repo root)
Add "packages/generators" to [tool.uv.workspace].members and phonolex-generators = { workspace = true } to [tool.uv.sources]. After edits the file should read:
[tool.uv.workspace]
members = [
"packages/data",
"packages/governors",
"packages/generation",
"packages/features",
"packages/generators",
]
[tool.uv.sources]
phonolex-data = { workspace = true }
phonolex-governors = { workspace = true }
phonolex-generation = { workspace = true }
phonolex-features = { workspace = true }
phonolex-generators = { workspace = true }
- [ ] Step 4: Editable install + import smoke test
Run:
uv pip install -e packages/generators
uv run python -c "import phonolex_generators; print(phonolex_generators.__version__)"
Expected: 0.1.0.
- [ ] Step 5: Run the empty test directory
Run: uv run --package phonolex-generators pytest packages/generators/tests/ -v
Expected: no tests ran in 0.0Xs. (Confirms collection works.)
- [ ] Step 6: Commit
git add packages/generators/ pyproject.toml
git commit -m "PHON-95 Task 1: bootstrap phonolex_generators workspace package"
Task 2: Shared MLM Loader + word_to_token_positions Helper¶
Files:
- Create: packages/generators/src/phonolex_generators/shared/mlm_loader.py
- Create: packages/generators/src/phonolex_generators/shared/word_to_tokens.py
- Test: packages/generators/tests/test_shared.py
The MLM is a singleton: get_mlm() loads RoBERTa-large once, subsequent calls return the cached (model, tokenizer, device) triple. The helper resolves word-indices in a space-split sentence to RoBERTa BPE token positions via offset_mapping.
- [ ] Step 1: Write the failing test for
word_to_token_positions
packages/generators/tests/test_shared.py:
"""Shared helpers — token-position math test (no MLM weights needed)."""
from transformers import AutoTokenizer
from phonolex_generators.shared.word_to_tokens import word_to_token_positions
def test_word_to_token_positions_known_sentence():
tokenizer = AutoTokenizer.from_pretrained("roberta-large")
sentence = "the cat chased the ball"
# word indices 1 and 4 = 'cat' and 'ball'
out = word_to_token_positions(tokenizer, sentence, [1, 4])
assert set(out.keys()) == {1, 4}
assert all(len(v) >= 1 for v in out.values())
# The token at out[1][0] decodes to something that includes 'cat'
enc = tokenizer(sentence)
cat_token_text = tokenizer.decode([enc["input_ids"][out[1][0]]]).strip().lower()
assert "cat" in cat_token_text
def test_word_to_token_positions_out_of_range_indices_skipped():
tokenizer = AutoTokenizer.from_pretrained("roberta-large")
sentence = "the cat chased the ball"
out = word_to_token_positions(tokenizer, sentence, [99])
assert out == {}
- [ ] Step 2: Run the test (expect FAIL)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_shared.py -v
Expected: ImportError: cannot import name 'word_to_token_positions'.
- [ ] Step 3: Implement
word_to_token_positions
packages/generators/src/phonolex_generators/shared/word_to_tokens.py:
"""Map space-split word indices to RoBERTa BPE token-position lists.
Lifted from PHON-92 probe (`probe_sampled_iterative.py::word_to_token_positions`).
"""
from __future__ import annotations
def word_to_token_positions(
tokenizer, sentence: str, word_indices: list[int]
) -> dict[int, list[int]]:
enc = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
offsets = enc["offset_mapping"][0].tolist()
words = sentence.split()
char_starts: list[tuple[int, int]] = []
cursor = 0
for w in words:
idx = sentence.find(w, cursor)
char_starts.append((idx, idx + len(w)))
cursor = idx + len(w)
out: dict[int, list[int]] = {}
for wi in word_indices:
if wi >= len(char_starts):
continue
wstart, wend = char_starts[wi]
positions = [
ti
for ti, (s, e) in enumerate(offsets)
if (s, e) != (0, 0) and s >= wstart and e <= wend
]
out[wi] = positions
return out
- [ ] Step 4: Run the test (expect PASS)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_shared.py -v
Expected: 2 passed.
- [ ] Step 5: Implement
get_mlmsingleton
packages/generators/src/phonolex_generators/shared/mlm_loader.py:
"""Singleton MLM loader.
The editor and scorer share a single (model, tokenizer, device) triple to
avoid double-loading 1.4GB of RoBERTa-large weights.
"""
from __future__ import annotations
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
DEFAULT_MODEL_ID = "roberta-large"
_cache: dict[str, tuple] = {}
def _resolve_device() -> str:
if torch.backends.mps.is_available():
return "mps"
return "cpu"
def get_mlm(model_id: str = DEFAULT_MODEL_ID):
"""Load (or return cached) (model, tokenizer, device) for the given MLM."""
if model_id in _cache:
return _cache[model_id]
device = _resolve_device()
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id).to(device).eval()
_cache[model_id] = (model, tokenizer, device)
return _cache[model_id]
def reset_mlm_cache() -> None:
"""Test hook — clear the singleton (next get_mlm call reloads weights)."""
_cache.clear()
- [ ] Step 6: Add a slow-marked sanity test for
get_mlm(extendstest_shared.py)
Append to packages/generators/tests/test_shared.py:
import pytest
from phonolex_generators.shared.mlm_loader import get_mlm, reset_mlm_cache
@pytest.mark.slow
def test_get_mlm_returns_same_triple_twice():
reset_mlm_cache()
m1, t1, d1 = get_mlm()
m2, t2, d2 = get_mlm()
assert m1 is m2
assert t1 is t2
assert d1 == d2
assert d1 in {"mps", "cpu"}
- [ ] Step 7: Run slow test once to verify
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_shared.py -v -m slow
Expected: 1 passed (and the 2 unmarked tests skipped). May take ~10s on first run for model download/cache.
- [ ] Step 8: Commit
git add packages/generators/src/phonolex_generators/shared packages/generators/tests/test_shared.py
git commit -m "PHON-95 Task 2: shared MLM loader + word_to_token_positions helper"
Task 3: Scorer — joint_mask_pll¶
Files:
- Create: packages/generators/src/phonolex_generators/scorer/joint_mask_pll.py
- Modify: packages/generators/src/phonolex_generators/scorer/__init__.py (export joint_masked_coherence)
- Test: packages/generators/tests/test_joint_mask_pll.py
The scorer is implemented before the editor because the editor depends on it for trajectory ranking and best-of-N selection. Joint-mask PLL = mask all content positions, forward through MLM, sum the log-probability of the actual tokens at masked positions. Higher = more coherent.
- [ ] Step 1: Write the failing sanity test (the spec's headline acceptance gate)
packages/generators/tests/test_joint_mask_pll.py:
"""joint_mask_pll — coherence ranks well-formed > repeated-content.
This is the canonical PHON-92 headline sanity test, verified empirically
by `probe_pll_sanity.py` on the spike branch (RoBERTa-large @ MPS):
joint-mask wf=-29.57 rp=-30.02 PASS
Note: comparing well-formed against `"the the the the the"` does NOT pass
under joint-mask scoring — the all-function-word skeleton is highly
predictable. The canonical degenerate is repeated CONTENT word ("the cat
cat the cat"), which matches the editor's actual use case (ranking
candidate fills of a shared masked context).
"""
import pytest
from phonolex_generators.scorer.joint_mask_pll import joint_masked_coherence
from phonolex_generators.shared.mlm_loader import get_mlm
@pytest.mark.slow
def test_well_formed_outranks_degenerate():
model, tokenizer, _ = get_mlm()
# Content words at indices 1, 2, 4 — masking 'cat chased ... ball' vs 'cat cat ... cat'
well = joint_masked_coherence(model, tokenizer, "the cat chased the ball", [1, 2, 4])
bad = joint_masked_coherence(model, tokenizer, "the cat cat the cat", [1, 2, 4])
assert well > bad
@pytest.mark.slow
def test_no_content_indices_returns_nan():
import math
model, tokenizer, _ = get_mlm()
nan = joint_masked_coherence(model, tokenizer, "the cat", [])
assert math.isnan(nan)
- [ ] Step 2: Run test (expect FAIL)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_joint_mask_pll.py -v -m slow
Expected: ImportError: cannot import name 'joint_masked_coherence'.
- [ ] Step 3: Implement
joint_masked_coherence
packages/generators/src/phonolex_generators/scorer/joint_mask_pll.py:
"""Joint-masked pseudo-log-likelihood coherence scorer.
Mask all content positions simultaneously, forward through the MLM, sum the
log-probability of the actual tokens at masked positions. Higher = more
coherent. Lifted from PHON-92 probe (`probe_sampled_iterative.py::joint_masked_coherence`).
"""
from __future__ import annotations
import torch
from phonolex_generators.shared.word_to_tokens import word_to_token_positions
def joint_masked_coherence(
model,
tokenizer,
sentence: str,
content_word_indices: list[int],
) -> float:
"""Sum log P(actual_token | masked_context) over all content-word token positions."""
device = next(model.parameters()).device
enc = tokenizer(sentence, return_tensors="pt").to(device)
input_ids = enc["input_ids"]
word_positions = word_to_token_positions(tokenizer, sentence, content_word_indices)
mask_positions = [p for ps in word_positions.values() for p in ps]
if not mask_positions:
return float("nan")
masked = input_ids.clone()
for ti in mask_positions:
masked[0, ti] = tokenizer.mask_token_id
with torch.no_grad():
logits = model(masked).logits
total = 0.0
for ti in mask_positions:
actual = input_ids[0, ti].item()
log_probs = torch.log_softmax(logits[0, ti], dim=-1)
total += log_probs[actual].item()
return total
- [ ] Step 4: Export from package
packages/generators/src/phonolex_generators/scorer/__init__.py:
from phonolex_generators.scorer.joint_mask_pll import joint_masked_coherence
__all__ = ["joint_masked_coherence"]
- [ ] Step 5: Run test (expect PASS)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_joint_mask_pll.py -v -m slow
Expected: 2 passed.
- [ ] Step 6: Commit
git add packages/generators/src/phonolex_generators/scorer packages/generators/tests/test_joint_mask_pll.py
git commit -m "PHON-95 Task 3: scorer.joint_mask_pll — coherence via masked PLL"
Task 4: Editor — Per-Position Prefix-Walking Admit Helper¶
Files:
- Create: packages/generators/src/phonolex_generators/editor/trie_filter.py
- Test: packages/generators/tests/test_trie_filter.py
The editor's per-position helper takes the FULL tagged VocabTrie plus the prefix accumulated so far in the current content slot, and returns top-K admissible (token_id, fragment, raw_logit) tuples. Admit criterion: the decoded fragment, concatenated to the accumulated prefix, has dead_end_ratio < 1.0 in the tagged trie (i.e., at least one compliant completion reachable). At the first position of a slot (is_first=True), only word-start tokens (RoBERTa: leading-space prefix) are admitted; at subsequent positions, only continuation tokens (no leading space). This mirrors phonolex_governors.Reranker._steer_sequence flipped for MLM (forward-walking through mask positions instead of backward through emitted tokens).
- [ ] Step 1: Write failing tests
packages/generators/tests/test_trie_filter.py:
"""topk_admit_at_position — per-mask-position prefix-walking admit filter."""
import torch
from phonolex_governors.generation.trie import VocabTrie
from phonolex_generators.editor.trie_filter import topk_admit_at_position
def _trie(words: list[str], banned: set[str] | None = None) -> VocabTrie:
t = VocabTrie(words)
t.tag(banned or set())
return t
class FakeTokenizer:
"""Decoded text matches RoBERTa shape: word-start tokens have leading space."""
def __init__(self, vocab: dict[int, str]):
self._vocab = vocab
def decode(self, ids):
return "".join(self._vocab[i] for i in ids)
def test_first_position_admits_word_starts_only():
# vocab includes a word-start " cat" (leading space) and a continuation "cat" (no space).
tok = FakeTokenizer({0: " cat", 1: "cat", 2: " dog"})
trie = _trie(["cat", "dog"], banned=set())
logits = torch.zeros((1, 1, 10))
logits[0, 0, 0] = 5.0
logits[0, 0, 1] = 4.0 # continuation "cat" — should be rejected at p_0
logits[0, 0, 2] = 3.0
out = topk_admit_at_position(
logits, position=0, tokenizer=tok, trie=trie,
accumulated="", is_first=True, k=4,
)
fragments = [f for _, f, _ in out]
assert "cat" in fragments # from token 0 (word-start)
assert "dog" in fragments # from token 2 (word-start)
assert fragments.count("cat") == 1 # token 1 (continuation) rejected at p_0
def test_continuation_position_rejects_word_starts():
tok = FakeTokenizer({0: "slaw", 1: " slaw"})
trie = _trie(["coleslaw"], banned=set())
logits = torch.zeros((1, 1, 10))
logits[0, 0, 0] = 5.0 # "slaw" continuation — admit (extends "cole" to "coleslaw")
logits[0, 0, 1] = 4.0 # " slaw" word-start — reject at p_1
out = topk_admit_at_position(
logits, position=0, tokenizer=tok, trie=trie,
accumulated="cole", is_first=False, k=4,
)
fragments = [f for _, f, _ in out]
assert fragments == ["slaw"]
def test_dead_end_ratio_gate_rejects_all_banned_completions():
# Trie has "cat" (allowed) and "cab" (banned); prefix "c" → ratio = 1/2 = 0.5 < 1.0 → admit.
tok = FakeTokenizer({0: " c", 1: " x"})
trie = _trie(["cat", "cab"], banned={"cab"})
logits = torch.zeros((1, 1, 10))
logits[0, 0, 0] = 5.0 # " c" → ratio 0.5 → admit
logits[0, 0, 1] = 4.0 # " x" → not in trie → ratio 1.0 → reject
out = topk_admit_at_position(
logits, position=0, tokenizer=tok, trie=trie,
accumulated="", is_first=True, k=4,
)
fragments = [f for _, f, _ in out]
assert fragments == ["c"]
def test_dead_end_ratio_gate_rejects_when_all_completions_banned():
# All words at "c" prefix are banned → ratio 1.0 → reject.
tok = FakeTokenizer({0: " c"})
trie = _trie(["cat", "cab"], banned={"cat", "cab"})
logits = torch.zeros((1, 1, 10))
logits[0, 0, 0] = 5.0
out = topk_admit_at_position(
logits, position=0, tokenizer=tok, trie=trie,
accumulated="", is_first=True, k=4,
)
assert out == []
def test_non_alpha_fragments_rejected():
tok = FakeTokenizer({0: " 123", 1: " cat", 2: " ,"})
trie = _trie(["cat"], banned=set())
logits = torch.zeros((1, 1, 10))
logits[0, 0, 0] = 5.0
logits[0, 0, 1] = 4.0
logits[0, 0, 2] = 3.0
out = topk_admit_at_position(
logits, position=0, tokenizer=tok, trie=trie,
accumulated="", is_first=True, k=4,
)
assert [f for _, f, _ in out] == ["cat"]
def test_k_cap_respected():
tok = FakeTokenizer({0: " cat", 1: " dog", 2: " ball"})
trie = _trie(["cat", "dog", "ball"], banned=set())
logits = torch.zeros((1, 1, 10))
logits[0, 0, 0] = 5.0
logits[0, 0, 1] = 4.0
logits[0, 0, 2] = 3.0
out = topk_admit_at_position(
logits, position=0, tokenizer=tok, trie=trie,
accumulated="", is_first=True, k=2,
)
assert len(out) == 2
- [ ] Step 2: Run tests (expect FAIL)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_trie_filter.py -v
Expected: ImportError: cannot import name 'topk_admit_at_position'.
- [ ] Step 3: Implement
topk_admit_at_position
packages/generators/src/phonolex_generators/editor/trie_filter.py:
"""Per-position prefix-walking admit helper for the MLM editor.
Generalizes the PHON-92 probe's `topk_in_trie_with_logits` from a single-
position complete-word filter to a per-position prefix walker. Mirrors the
`phonolex_governors.Reranker._steer_sequence` pattern (causal-LM, walks
backward through emitted tokens) flipped for MLM (forward-walks across
the mask positions of a content slot).
Admit criterion at each position:
- is_first=True → token must be a word-start (RoBERTa leading space)
- is_first=False → token must be a continuation (no leading space)
- extended = (accumulated + fragment).lower() must satisfy
`trie.dead_end_ratio(extended) < 1.0`, i.e. at least one
compliant completion is reachable from this prefix.
"""
from __future__ import annotations
import torch
from phonolex_governors.generation.trie import VocabTrie
def topk_admit_at_position(
logits: torch.Tensor,
position: int,
tokenizer,
trie: VocabTrie,
accumulated: str,
is_first: bool,
k: int,
) -> list[tuple[int, str, float]]:
"""Return up to k admit candidates as (token_id, fragment_lower, raw_logit).
`accumulated` is the lowercased prefix built from previous positions in
this slot (empty string at p_0). `fragment_lower` is the candidate's
contribution to the prefix at THIS position (without leading space).
"""
raw_logits = logits[0, position]
top_logits, top_ids = torch.topk(raw_logits, k * 4)
out: list[tuple[int, str, float]] = []
for tid, lg in zip(top_ids.tolist(), top_logits.tolist()):
text = tokenizer.decode([tid])
starts_with_space = text.startswith(" ") or text.startswith("▁")
if is_first:
if not starts_with_space:
continue
fragment = text.lstrip().lstrip("▁").lower()
else:
if starts_with_space:
continue
fragment = text.lower()
if not fragment or not fragment.isalpha():
continue
extended = accumulated + fragment
if trie.dead_end_ratio(extended) >= 1.0:
continue
out.append((tid, fragment, lg))
if len(out) >= k:
break
return out
- [ ] Step 4: Run tests (expect PASS)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_trie_filter.py -v
Expected: 6 passed.
- [ ] Step 5: Commit
git add packages/generators/src/phonolex_generators/editor/trie_filter.py packages/generators/tests/test_trie_filter.py
git commit -m "PHON-95 Task 4: editor.trie_filter — per-position prefix-walking admit helper"
Task 5: Editor — Trajectory + EditedSentence + Seed Dataclasses¶
Files:
- Create: packages/generators/src/phonolex_generators/editor/trajectory.py
- Modify: packages/generators/src/phonolex_generators/editor/__init__.py (export dataclasses)
- Test: packages/generators/tests/test_trajectory.py
Pure data definitions — no logic, but stable types let the editor and CFG enumerator share a contract.
- [ ] Step 1: Write the failing test
packages/generators/tests/test_trajectory.py:
"""Trajectory / EditedSentence / Seed dataclass smoke tests."""
from phonolex_generators.editor.trajectory import (
EditedSentence,
Seed,
Trajectory,
)
def test_seed_is_frozen_hashable():
s = Seed(
sentence="the cat chased the ball",
content_word_indices=(1, 4),
locked_word_indices=(2,),
spec_id="spec1",
note="control",
)
assert hash(s) is not None
assert s.sentence == "the cat chased the ball"
def test_trajectory_outcome_default_running():
t = Trajectory(traj_id=0)
assert t.outcome == "RUNNING"
assert t.history == []
def test_edited_sentence_holds_aggregate():
seed = "the cat chased the ball"
es = EditedSentence(
seed=seed,
spec_id="spec1",
verb="chased",
coherence_seed=-9.95,
best="the cat ate the cake",
coherence_best=-6.41,
unique_outputs=["the cat ate the cake", "the cat ate the cookies"],
trajectories=[],
)
assert es.coherence_best > es.coherence_seed
assert len(es.unique_outputs) == 2
- [ ] Step 2: Run test (expect FAIL)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_trajectory.py -v
Expected: ImportError.
- [ ] Step 3: Implement dataclasses
packages/generators/src/phonolex_generators/editor/trajectory.py:
"""Editor data contracts — Seed / Trajectory / EditedSentence."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Literal
Outcome = Literal["RUNNING", "CONVERGED", "CYCLE", "TIMEOUT"]
@dataclass(frozen=True)
class Seed:
"""A CFG-emitted (or hand-crafted) seed sentence + which positions are editable."""
sentence: str
content_word_indices: tuple[int, ...]
locked_word_indices: tuple[int, ...]
spec_id: str
note: str = ""
@dataclass
class Trajectory:
"""Per-trajectory edit history."""
traj_id: int
history: list[tuple[str, float]] = field(default_factory=list)
outcome: Outcome = "RUNNING"
best_sentence: str | None = None
best_coherence: float = float("-inf")
@dataclass
class EditedSentence:
"""Best-of-N result for a single seed."""
seed: str
spec_id: str
verb: str
coherence_seed: float
best: str
coherence_best: float
unique_outputs: list[str]
trajectories: list[Trajectory]
- [ ] Step 4: Export from package
packages/generators/src/phonolex_generators/editor/__init__.py:
from phonolex_generators.editor.trajectory import (
EditedSentence,
Outcome,
Seed,
Trajectory,
)
__all__ = ["EditedSentence", "Outcome", "Seed", "Trajectory"]
- [ ] Step 5: Run test (expect PASS)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_trajectory.py -v
Expected: 3 passed.
- [ ] Step 6: Commit
git add packages/generators/src/phonolex_generators/editor packages/generators/tests/test_trajectory.py
git commit -m "PHON-95 Task 5: editor — Seed/Trajectory/EditedSentence dataclasses"
Task 6: Editor — mlm_iterative_editor (prefix-walking decode + best-of-N)¶
Files:
- Create: packages/generators/src/phonolex_generators/editor/mlm_iterative_editor.py
- Modify: packages/generators/src/phonolex_generators/editor/__init__.py (export edit)
- Test: packages/generators/tests/test_mlm_iterative_editor.py
The editor generalizes the PHON-92 probe: outer loop (sentence-level edits with CONVERGED/CYCLE/TIMEOUT) and best-of-N trajectory shape are kept; the inner fill mechanic is replaced with prefix-walking decode (_fill_slot). For each content slot covering mask positions p_0..p_n: walk the trie left-to-right, sample one admitted token per position (gated by dead_end_ratio < 1.0), accumulate the prefix, greedy-stop the first time walk_to(accumulated).is_end and not is_banned_word(accumulated). Trie contract: caller passes a VocabTrie already tagged via trie.tag(banned).
The deterministic test pins RNG and hyperparameters so per-position sampling is reproducible.
- [ ] Step 1: Write the failing deterministic-output test
packages/generators/tests/test_mlm_iterative_editor.py:
"""Editor — deterministic best-of-N output with fixed RNG."""
from pathlib import Path
import pytest
from phonolex_data.runtime.store import WordStore
from phonolex_governors.generation.trie import VocabTrie
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.editor.trajectory import Seed
from phonolex_generators.shared.mlm_loader import get_mlm
@pytest.mark.slow
def test_edit_deterministic_with_n2_fixed_rng():
"""N=2 trajectories, fixed RNG seed → deterministic best output.
Trie is the FULL 125K-word vocab tagged per-request with
`banned = all - allowed(spec1)`. Per-position prefix-walking decode
operates on this tagged trie.
"""
model, tokenizer, _ = get_mlm()
store = WordStore.from_parquet(Path("data/runtime/words.parquet"))
all_words = [w.lower() for w in store.df["word"].to_list()]
allowed = set(
store.subset(SPEC_FILTERS["spec1"]).get_column("word").str.to_lowercase().to_list()
)
banned = set(all_words) - allowed
trie = VocabTrie(all_words)
trie.tag(banned)
seed = Seed(
sentence="the dog ate the bone",
content_word_indices=(1, 4),
locked_word_indices=(2,),
spec_id="spec1",
note="control",
)
result_a = edit(seed, model=model, tokenizer=tokenizer, trie=trie, n_trajectories=2, rng_base_seed=42)
result_b = edit(seed, model=model, tokenizer=tokenizer, trie=trie, n_trajectories=2, rng_base_seed=42)
assert result_a.best == result_b.best
assert result_a.coherence_best == result_b.coherence_best
# Best content words are spec-compliant: complete words in trie + not banned.
best_tokens = result_a.best.split()
for ci in seed.content_word_indices:
w = best_tokens[ci].lower()
node = trie.walk_to(w)
assert node is not None and node.is_end, f"{w!r} not is_end"
assert not trie.is_banned_word(w), f"{w!r} banned"
# Verb (locked, index 2) preserved.
assert best_tokens[2] == "ate"
# Coherence improves (or stays equal).
assert result_a.coherence_best >= result_a.coherence_seed
- [ ] Step 2: Run test (expect FAIL)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_mlm_iterative_editor.py -v -m slow
Expected: ImportError: cannot import name 'edit'.
- [ ] Step 3: Implement the iterative editor
packages/generators/src/phonolex_generators/editor/mlm_iterative_editor.py:
"""MLM iterative editor — prefix-walking decode with best-of-N trajectories.
Outer loop and best-of-N shape lifted from PHON-92 probe; inner fill mechanic
is the per-content-slot prefix walk over the tagged VocabTrie. Greedy-stops
the first time the accumulated prefix is is_end-and-not-banned.
Trie contract: caller passes a `phonolex_governors.VocabTrie` already tagged
for this request via `trie.tag(banned)`.
"""
from __future__ import annotations
import torch
from phonolex_governors.generation.trie import VocabTrie
from phonolex_generators.editor.trajectory import EditedSentence, Seed, Trajectory
from phonolex_generators.editor.trie_filter import topk_admit_at_position
from phonolex_generators.scorer.joint_mask_pll import joint_masked_coherence
from phonolex_generators.shared.word_to_tokens import word_to_token_positions
# Hyperparameters — overridable per-call via `edit(...)` kwargs.
TRIE_TOP_K = 50 # admit-pool size at each mask position
SAMPLE_TOP_K = 10 # sample from top-K of admit pool
TEMPERATURE = 0.7
N_TRAJECTORIES = 8
MAX_ITER = 15
def _joint_mask_forward(
model, tokenizer, sentence: str, mask_token_positions: list[int], device: str
):
enc = tokenizer(sentence, return_tensors="pt").to(device)
input_ids = enc["input_ids"]
masked = input_ids.clone()
for ti in mask_token_positions:
masked[0, ti] = tokenizer.mask_token_id
with torch.no_grad():
logits = model(masked).logits
return input_ids, logits
def _sample_from_topk(
admit: list[tuple[int, str, float]],
sample_top_k: int,
temperature: float,
rng: torch.Generator,
device: str,
) -> tuple[int, str, float] | None:
if not admit:
return None
pool = admit[:sample_top_k]
raw = torch.tensor([t[2] for t in pool], device=device)
probs = torch.softmax(raw / temperature, dim=-1)
idx = torch.multinomial(probs, num_samples=1, generator=rng).item()
return pool[idx]
def _fill_slot(
logits: torch.Tensor,
slot_positions: list[int],
tokenizer,
trie: VocabTrie,
rng: torch.Generator,
device: str,
*,
trie_top_k: int,
sample_top_k: int,
temperature: float,
forbid: set[str],
) -> str | None:
"""Walk the trie left-to-right across slot_positions; return a complete admissible word.
Returns None if any position has no admit, or if the loop ends without
accumulated being is_end-and-not-banned.
"""
accumulated = ""
for pos_idx, pos in enumerate(slot_positions):
# Greedy early-stop: if we already have a complete admissible word, return it.
if accumulated:
node = trie.walk_to(accumulated)
if (
node is not None
and node.is_end
and not trie.is_banned_word(accumulated)
and accumulated not in forbid
):
return accumulated
admit = topk_admit_at_position(
logits, pos, tokenizer, trie,
accumulated=accumulated,
is_first=(pos_idx == 0),
k=trie_top_k,
)
if not admit:
return None
# Anti-repetition across slots within an iteration: skip extensions whose
# would-be complete word is already used in another slot.
admit_filtered = [
(tid, frag, lg) for (tid, frag, lg) in admit
if (accumulated + frag) not in forbid
]
if not admit_filtered:
return None
sampled = _sample_from_topk(
admit_filtered, sample_top_k, temperature, rng, device
)
if sampled is None:
return None
accumulated = accumulated + sampled[1]
# End of mask region — accept iff is_end-and-not-banned.
node = trie.walk_to(accumulated)
if (
node is not None
and node.is_end
and not trie.is_banned_word(accumulated)
and accumulated not in forbid
):
return accumulated
return None
def _run_trajectory(
seed: Seed,
model,
tokenizer,
trie: VocabTrie,
traj_id: int,
rng: torch.Generator,
device: str,
*,
trie_top_k: int,
sample_top_k: int,
temperature: float,
max_iter: int,
) -> Trajectory:
traj = Trajectory(traj_id=traj_id)
current = seed.sentence
seen = {current}
coh = joint_masked_coherence(
model, tokenizer, current, list(seed.content_word_indices)
)
traj.history.append((current, coh))
traj.best_sentence, traj.best_coherence = current, coh
for _ in range(max_iter):
word_pos = word_to_token_positions(
tokenizer, current, list(seed.content_word_indices)
)
mask_token_positions = [p for ps in word_pos.values() for p in ps]
_, logits = _joint_mask_forward(
model, tokenizer, current, mask_token_positions, device
)
new_words = current.split()
any_change = False
already_used: set[str] = set()
for wi in seed.content_word_indices:
slot_positions = word_pos.get(wi, [])
if not slot_positions:
continue
fill = _fill_slot(
logits, slot_positions, tokenizer, trie, rng, device,
trie_top_k=trie_top_k,
sample_top_k=sample_top_k,
temperature=temperature,
forbid=already_used,
)
if fill is None:
continue
already_used.add(fill)
if fill != new_words[wi].lower():
new_words[wi] = fill
any_change = True
new_sentence = " ".join(new_words)
new_coh = joint_masked_coherence(
model, tokenizer, new_sentence, list(seed.content_word_indices)
)
traj.history.append((new_sentence, new_coh))
if new_coh > traj.best_coherence:
traj.best_sentence = new_sentence
traj.best_coherence = new_coh
if not any_change:
traj.outcome = "CONVERGED"
return traj
if new_sentence in seen:
traj.outcome = "CYCLE"
return traj
seen.add(new_sentence)
current = new_sentence
traj.outcome = "TIMEOUT"
return traj
def edit(
seed: Seed,
model,
tokenizer,
trie: VocabTrie,
*,
verb: str | None = None,
n_trajectories: int = N_TRAJECTORIES,
rng_base_seed: int = 42,
trie_top_k: int = TRIE_TOP_K,
sample_top_k: int = SAMPLE_TOP_K,
temperature: float = TEMPERATURE,
max_iter: int = MAX_ITER,
) -> EditedSentence:
"""Best-of-N sampled iterative edit for a single seed."""
device = next(model.parameters()).device
seed_coh = joint_masked_coherence(
model, tokenizer, seed.sentence, list(seed.content_word_indices)
)
if verb is None:
# Default: pick the first locked word (probe convention: index 2 is the verb).
words = seed.sentence.split()
verb = words[seed.locked_word_indices[0]] if seed.locked_word_indices else ""
rng = torch.Generator(device=str(device))
trajectories: list[Trajectory] = []
unique: dict[str, None] = {}
best_sentence: str = seed.sentence
best_coh: float = float("-inf")
for tid in range(n_trajectories):
rng.manual_seed(rng_base_seed + tid)
traj = _run_trajectory(
seed,
model,
tokenizer,
trie,
tid,
rng,
str(device),
trie_top_k=trie_top_k,
sample_top_k=sample_top_k,
temperature=temperature,
max_iter=max_iter,
)
trajectories.append(traj)
if traj.best_sentence is not None:
unique[traj.best_sentence] = None
if traj.best_coherence > best_coh:
best_coh = traj.best_coherence
best_sentence = traj.best_sentence
return EditedSentence(
seed=seed.sentence,
spec_id=seed.spec_id,
verb=verb,
coherence_seed=seed_coh,
best=best_sentence,
coherence_best=best_coh,
unique_outputs=list(unique.keys()),
trajectories=trajectories,
)
- [ ] Step 4: Export from package
Modify packages/generators/src/phonolex_generators/editor/__init__.py:
from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.editor.trajectory import (
EditedSentence,
Outcome,
Seed,
Trajectory,
)
from phonolex_generators.editor.trie_filter import topk_admit_at_position
__all__ = [
"EditedSentence",
"Outcome",
"Seed",
"Trajectory",
"edit",
"topk_admit_at_position",
]
- [ ] Step 5: Run test (expect PASS)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_mlm_iterative_editor.py -v -m slow
Expected: 1 passed (~5–10s wall time on MPS).
- [ ] Step 6: Commit
git add packages/generators/src/phonolex_generators/editor packages/generators/tests/test_mlm_iterative_editor.py
git commit -m "PHON-95 Task 6: editor.mlm_iterative_editor — best-of-N sampled iterative edit"
Task 7: CFG Seed — spec_filters¶
Files:
- Create: packages/generators/src/phonolex_generators/cfg_seed/spec_filters.py
- Test: packages/generators/tests/test_spec_filters.py
The probe used SQL queries for spec1/spec6 against the D1 SQLite. The productionized version uses Polars expressions against WordStore.subset(...). This module is the dictionary of those expressions, callable by spec_id.
- [ ] Step 1: Write the failing test
packages/generators/tests/test_spec_filters.py:
"""Spec filters — Polars expressions yielding the probe's lexicon counts."""
from pathlib import Path
from phonolex_data.runtime.store import WordStore
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
def _store() -> WordStore:
return WordStore.from_parquet(Path("data/runtime/words.parquet"))
def test_spec1_count_matches_probe():
"""spec1 = words starting /k/, syllable_count ≤ 2, POS in NOUN/VERB.
Probe printed: 'spec spec1 VocabTrie: 1,798 words'
"""
store = _store()
df = store.subset(SPEC_FILTERS["spec1"])
assert df.height == 1798
def test_spec6_count_matches_probe():
"""spec6 = syllable_count ≤ 2, POS in NOUN/VERB/ADJ, iconicity ≥ 1.8, imageability ≥ 4.5.
Probe printed: 'spec spec6 VocabTrie: 649 words'
"""
store = _store()
df = store.subset(SPEC_FILTERS["spec6"])
assert df.height == 649
def test_unknown_spec_id_raises_keyerror():
assert "spec1" in SPEC_FILTERS
assert "spec999" not in SPEC_FILTERS
- [ ] Step 2: Run test (expect FAIL)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_spec_filters.py -v
Expected: ImportError.
- [ ] Step 3: Implement spec filters
packages/generators/src/phonolex_generators/cfg_seed/spec_filters.py:
"""Spec ID → Polars filter expression. PHON-64 v2 failure-case lexicon.
These match the SQL queries in `probe_sampled_iterative.py::load_spec_words`.
Counts verified against the probe's runtime printout (spec1=1798, spec6=649).
"""
from __future__ import annotations
import polars as pl
SPEC_FILTERS: dict[str, pl.Expr] = {
"spec1": (
pl.col("phonemes_str").str.starts_with("|k|")
& (pl.col("syllable_count") <= 2)
& pl.col("pos").is_in(["NOUN", "VERB"])
),
"spec6": (
(pl.col("syllable_count") <= 2)
& pl.col("pos").is_in(["NOUN", "VERB", "ADJ"])
& (pl.col("iconicity") >= 1.8)
& (pl.col("imageability") >= 4.5)
),
}
- [ ] Step 4: Run test (expect PASS)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_spec_filters.py -v
Expected: 3 passed.
- [ ] Step 5: Commit
git add packages/generators/src/phonolex_generators/cfg_seed/spec_filters.py packages/generators/tests/test_spec_filters.py
git commit -m "PHON-95 Task 7: cfg_seed.spec_filters — PHON-64 v2 spec1/spec6 Polars exprs"
Task 8: CFG Seed — argstruc_enumerator¶
Files:
- Create: packages/generators/src/phonolex_generators/cfg_seed/argstruc_enumerator.py
- Modify: packages/generators/src/phonolex_generators/cfg_seed/__init__.py (export enumerate_seeds)
- Test: packages/generators/tests/test_argstruc_enumerator.py
Verb-locked CFG NP V NP enumerator. Slot terminals are the intersection of (a) the spec lexicon (store.subset(spec_expr)) and (b) the per-(verb, role) PMI-admit set from selectional.parquet. v1 picks 4 nsubj × 4 dobj fills randomly to cap at 16 seeds; the determiner is "the" and pluralization/agreement is deferred (OQ3).
- [ ] Step 1: Write the failing test
packages/generators/tests/test_argstruc_enumerator.py:
"""argstruc_enumerator — verb-locked NP V NP CFG with WordStore + PMI gating."""
from pathlib import Path
import pytest
from phonolex_data.runtime.store import WordStore
from phonolex_generators.cfg_seed.argstruc_enumerator import enumerate_seeds, pmi_admit
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
@pytest.fixture(scope="module")
def store() -> WordStore:
s = WordStore.from_parquet(Path("data/runtime/words.parquet"))
s.attach_selectional(Path("data/runtime/selectional.parquet"))
return s
def test_pmi_admit_cake_yes_thunder_no(store: WordStore):
"""Acceptance criterion 4: cake ∈ admit, thunder ∉ admit (verb=cut, dobj, fineweb_adult)."""
admit = pmi_admit(store, verb="cut", role="dobj", band="fineweb_adult")
assert "cake" in admit
assert "thunder" not in admit
def test_enumerate_seeds_for_cut_spec1_emits_in_spec_seeds(store: WordStore):
"""Acceptance criterion 2.a: ≥4 seeds; all content words in spec lexicon AND admit set."""
spec_lex = set(store.subset(SPEC_FILTERS["spec1"])["word"].str.to_lowercase().to_list())
nsubj_admit = pmi_admit(store, verb="cut", role="nsubj", band="fineweb_adult")
dobj_admit = pmi_admit(store, verb="cut", role="dobj", band="fineweb_adult")
seeds = enumerate_seeds(
store=store,
spec_id="spec1",
verb="cut",
band="fineweb_adult",
max_seeds=16,
rng_seed=7,
)
assert len(seeds) >= 4
for s in seeds:
words = s.sentence.split()
nsubj_word, dobj_word = words[1].lower(), words[4].lower()
assert nsubj_word in spec_lex, f"{nsubj_word} not in spec1 lexicon"
assert dobj_word in spec_lex, f"{dobj_word} not in spec1 lexicon"
assert nsubj_word in nsubj_admit, f"{nsubj_word} not in nsubj admit"
assert dobj_word in dobj_admit, f"{dobj_word} not in dobj admit"
assert s.spec_id == "spec1"
assert s.locked_word_indices == (2,)
assert s.content_word_indices == (1, 4)
def test_enumerate_seeds_deterministic_with_rng_seed(store: WordStore):
a = enumerate_seeds(store, "spec1", "cut", "fineweb_adult", max_seeds=8, rng_seed=99)
b = enumerate_seeds(store, "spec1", "cut", "fineweb_adult", max_seeds=8, rng_seed=99)
assert [s.sentence for s in a] == [s.sentence for s in b]
- [ ] Step 2: Run test (expect FAIL)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_argstruc_enumerator.py -v
Expected: ImportError.
- [ ] Step 3: Implement
pmi_admitandenumerate_seeds
packages/generators/src/phonolex_generators/cfg_seed/argstruc_enumerator.py:
"""Verb-locked argument-structure CFG seed enumerator.
Productions:
S → NP V NP
NP → "the" N
V is locked at production time. Both N slots are filled from the
intersection of (a) the spec lexicon (`WordStore.subset(spec_expr)`) and
(b) the per-(verb, role, band) PMI-admit set from `selectional.parquet`.
v1: determiner fixed to "the"; pluralization/agreement deferred (OQ3).
"""
from __future__ import annotations
import random
import polars as pl
from phonolex_data.runtime.store import WordStore
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.trajectory import Seed
def pmi_admit(store: WordStore, verb: str, role: str, band: str) -> set[str]:
"""Per-(verb, role, band) selectional admit set: fillers with PPMI > 0."""
if store._selectional_df is None:
raise RuntimeError(
"selectional.parquet not attached; call "
"`store.attach_selectional(path)` before enumerate_seeds()"
)
df = store._selectional_df.filter(
(pl.col("verb") == verb)
& (pl.col("role") == role)
& (pl.col("band") == band)
& (pl.col("ppmi") > 0.0)
)
return set(df.get_column("filler").to_list())
def enumerate_seeds(
store: WordStore,
spec_id: str,
verb: str,
band: str = "fineweb_adult",
*,
max_seeds: int = 16,
nsubj_per_pool: int = 4,
dobj_per_pool: int = 4,
rng_seed: int = 42,
) -> list[Seed]:
"""Emit up to `max_seeds` `the {nsubj} {verb} the {dobj}` seed sentences."""
if spec_id not in SPEC_FILTERS:
raise KeyError(f"unknown spec_id: {spec_id!r}; known: {list(SPEC_FILTERS)}")
spec_words = set(
store.subset(SPEC_FILTERS[spec_id])
.get_column("word")
.str.to_lowercase()
.to_list()
)
nsubj_admit = pmi_admit(store, verb=verb, role="nsubj", band=band)
dobj_admit = pmi_admit(store, verb=verb, role="dobj", band=band)
nsubj_pool = sorted(spec_words & nsubj_admit)
dobj_pool = sorted(spec_words & dobj_admit)
rng = random.Random(rng_seed)
n_pick = min(len(nsubj_pool), nsubj_per_pool)
d_pick = min(len(dobj_pool), dobj_per_pool)
nsubj_sample = rng.sample(nsubj_pool, k=n_pick) if n_pick else []
dobj_sample = rng.sample(dobj_pool, k=d_pick) if d_pick else []
seeds: list[Seed] = []
for nsubj in nsubj_sample:
for dobj in dobj_sample:
if nsubj == dobj:
continue
seeds.append(
Seed(
sentence=f"the {nsubj} {verb} the {dobj}",
content_word_indices=(1, 4),
locked_word_indices=(2,),
spec_id=spec_id,
note=f"CFG-emitted ({nsubj}, {verb}, {dobj}) band={band}",
)
)
if len(seeds) >= max_seeds:
return seeds
return seeds
- [ ] Step 4: Export from package
packages/generators/src/phonolex_generators/cfg_seed/__init__.py:
from phonolex_generators.cfg_seed.argstruc_enumerator import (
enumerate_seeds,
pmi_admit,
)
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
__all__ = ["SPEC_FILTERS", "enumerate_seeds", "pmi_admit"]
- [ ] Step 5: Run test (expect PASS)
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_argstruc_enumerator.py -v
Expected: 3 passed.
- [ ] Step 6: Commit
git add packages/generators/src/phonolex_generators/cfg_seed packages/generators/tests/test_argstruc_enumerator.py
git commit -m "PHON-95 Task 8: cfg_seed.argstruc_enumerator — verb-locked NP V NP enumerator"
Task 9: PHON-64 v2 Acceptance Test (compliance + perf, not byte-equality)¶
Files:
- Create: packages/generators/tests/test_acceptance_phon64v2.py
The PHON-64 v2 regression feeds all 5 hand-crafted probe seeds through the productionized editor + scorer, asserting per seed: (a) coherence improves, (b) verbs are locked, (c) every content word is spec-compliant under the tagged trie (walk_to(w).is_end and not is_banned_word(w)). Byte-equality with the probe's sampled_locked_dedup_output.txt is NOT a target — the prefix-walking decode generalizes the probe's single-position filter, so outputs are expected to differ (and ideally improve, since multi-token spec words become reachable). The probe gold remains in the spike branch as a smoke baseline; we don't copy it as a fixture.
- [ ] Step 1: Write the failing acceptance test
packages/generators/tests/test_acceptance_phon64v2.py:
"""PHON-64 v2 acceptance: 5 probe seeds → coherent in-spec English.
Spec acceptance criterion 1: all 5 seeds produce a `best` with
coherence_best > coherence_seed and 100% spec compliance under the tagged
trie. The probe's `sampled_locked_dedup_output.txt` is a smoke baseline,
not a byte-equality target — prefix-walking decode generalizes the probe.
"""
import time
from pathlib import Path
import pytest
from phonolex_data.runtime.store import WordStore
from phonolex_governors.generation.trie import VocabTrie
from phonolex_generators.cfg_seed.argstruc_enumerator import enumerate_seeds
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.editor.trajectory import Seed
from phonolex_generators.shared.mlm_loader import get_mlm
PROBE_SEEDS = [
Seed("the puppy melt the baby", (1, 4), (2,), "spec6", "F1 case, verb locked"),
Seed("the cat chased the ball", (1, 4), (2,), "spec1", "Well-formed control"),
Seed("the snow filled the cup", (1, 4), (2,), "spec1", "Reasonable spec1 seed"),
Seed("the coleslaw cut the control", (1, 4), (2,), "spec1", "F1 + multi-BPE"),
Seed("the dog ate the bone", (1, 4), (2,), "spec1", "Well-formed control 2"),
]
@pytest.fixture(scope="module")
def store() -> WordStore:
s = WordStore.from_parquet(Path("data/runtime/words.parquet"))
s.attach_selectional(Path("data/runtime/selectional.parquet"))
return s
@pytest.fixture(scope="module")
def all_words(store: WordStore) -> list[str]:
return [w.lower() for w in store.df["word"].to_list()]
@pytest.fixture(scope="module")
def vocab_trie(all_words: list[str]) -> VocabTrie:
"""Build the trie ONCE; per-test we only retag with a different banned set."""
return VocabTrie(all_words)
def _retag_for_spec(trie: VocabTrie, store: WordStore, all_words: list[str], spec_id: str) -> set[str]:
allowed = set(
store.subset(SPEC_FILTERS[spec_id]).get_column("word").str.to_lowercase().to_list()
)
banned = set(all_words) - allowed
trie.tag(banned)
return allowed
@pytest.mark.acceptance
@pytest.mark.slow
@pytest.mark.parametrize("seed", PROBE_SEEDS, ids=lambda s: s.sentence.replace(" ", "_"))
def test_probe_seed_improves_and_stays_in_spec(
seed: Seed,
store: WordStore,
all_words: list[str],
vocab_trie: VocabTrie,
):
_retag_for_spec(vocab_trie, store, all_words, seed.spec_id)
model, tokenizer, _ = get_mlm()
result = edit(seed, model=model, tokenizer=tokenizer, trie=vocab_trie)
assert result.coherence_best > result.coherence_seed, (
f"Seed {seed.sentence!r}: edit did not improve coherence "
f"(seed={result.coherence_seed:+.2f}, best={result.coherence_best:+.2f})"
)
best_words = result.best.split()
# Verb is locked.
seed_words = seed.sentence.split()
for li in seed.locked_word_indices:
assert best_words[li] == seed_words[li], (
f"Locked word at index {li} changed: {seed_words[li]!r} → {best_words[li]!r}"
)
# Content words are spec-compliant under the trie.
for ci in seed.content_word_indices:
w = best_words[ci].lower()
node = vocab_trie.walk_to(w)
assert node is not None and node.is_end, (
f"Content word at index {ci} {w!r} is not is_end in trie"
)
assert not vocab_trie.is_banned_word(w), (
f"Content word at index {ci} {w!r} is banned for spec {seed.spec_id}"
)
@pytest.mark.acceptance
@pytest.mark.slow
def test_performance_gate_16_seeds_under_30s(
store: WordStore,
all_words: list[str],
vocab_trie: VocabTrie,
):
"""Acceptance criterion 3: 16-seed batch ≤ 30s wall-clock on MPS."""
_retag_for_spec(vocab_trie, store, all_words, "spec1")
seeds = enumerate_seeds(store, "spec1", "cut", "fineweb_adult", max_seeds=16, rng_seed=42)
assert len(seeds) >= 8
model, tokenizer, _ = get_mlm()
t0 = time.perf_counter()
for s in seeds:
edit(s, model=model, tokenizer=tokenizer, trie=vocab_trie)
elapsed = time.perf_counter() - t0
assert elapsed <= 30.0, f"16-seed batch took {elapsed:.1f}s; budget is 30s"
- [ ] Step 2: Run the acceptance test
Run: uv run --package phonolex-generators pytest packages/generators/tests/test_acceptance_phon64v2.py -v -m "slow and acceptance"
Expected: 6 passed (5 seed cases + perf gate). Wall-clock: ~30–60s total (model load + 5 best-of-8 edits + 16-seed sweep). If a seed-case fails, investigate whether the prefix-walking decode is dead-ending early (likely culprit: spec lexicon too restrictive AND dead_end_ratio < 1.0 not finding admits at p_0).
- [ ] Step 3: Run the full unit-test suite to confirm no regressions
Run: uv run --package phonolex-generators pytest packages/generators/tests/ -v (default — slow tests skipped)
Expected: all unmarked tests passed.
Run: uv run python -m pytest packages/data/tests/ -v (acceptance criterion 5)
Expected: all 201 packages/data tests still pass.
- [ ] Step 4: Commit
git add packages/generators/tests/test_acceptance_phon64v2.py
git commit -m "PHON-95 Task 9: acceptance — PHON-64 v2 compliance + 16-seed perf gate"
Task 10: Reproducible Run Script¶
Files:
- Create: packages/generation/research/2026-05-07-phon-95-editor/run.py
- Create: packages/generation/research/2026-05-07-phon-95-editor/README.md
A small CLI under the existing packages/generation/research/ tree (where 2026-04-29-eval-harness-v1 and similar live). Takes (spec_id, verb, n_seeds) and prints the editor's outputs in the same shape the probe printed.
- [ ] Step 1: Implement the run script
packages/generation/research/2026-05-07-phon-95-editor/run.py:
"""PHON-95 reproducible run — CFG enumerate → MLM edit (prefix-walking decode) → coherence-rank.
Usage:
uv run python packages/generation/research/2026-05-07-phon-95-editor/run.py \\
--spec-id spec1 --verb cut --band fineweb_adult --n-seeds 8
Prints per-seed best output + coherence + unique-output count.
"""
from __future__ import annotations
import argparse
import time
from pathlib import Path
from phonolex_data.runtime.store import WordStore
from phonolex_governors.generation.trie import VocabTrie
from phonolex_generators.cfg_seed.argstruc_enumerator import enumerate_seeds
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.shared.mlm_loader import get_mlm
def main() -> None:
parser = argparse.ArgumentParser(description="PHON-95 editor run")
parser.add_argument("--spec-id", required=True, choices=sorted(SPEC_FILTERS.keys()))
parser.add_argument("--verb", required=True)
parser.add_argument("--band", default="fineweb_adult")
parser.add_argument("--n-seeds", type=int, default=8)
parser.add_argument("--n-trajectories", type=int, default=8)
parser.add_argument("--rng-seed", type=int, default=42)
parser.add_argument(
"--words-parquet", type=Path, default=Path("data/runtime/words.parquet")
)
parser.add_argument(
"--selectional-parquet",
type=Path,
default=Path("data/runtime/selectional.parquet"),
)
args = parser.parse_args()
print(f"Loading WordStore from {args.words_parquet} ...")
store = WordStore.from_parquet(args.words_parquet)
store.attach_selectional(args.selectional_parquet)
print("Loading RoBERTa-large ...")
model, tokenizer, device = get_mlm()
print(f" device = {device}")
print("Building VocabTrie over full vocab ...")
all_words = [w.lower() for w in store.df["word"].to_list()]
trie = VocabTrie(all_words)
allowed = set(
store.subset(SPEC_FILTERS[args.spec_id])
.get_column("word").str.to_lowercase().to_list()
)
banned = set(all_words) - allowed
trie.tag(banned)
print(f" spec {args.spec_id}: {len(allowed):,} allowed / {len(banned):,} banned\n")
seeds = enumerate_seeds(
store,
spec_id=args.spec_id,
verb=args.verb,
band=args.band,
max_seeds=args.n_seeds,
rng_seed=args.rng_seed,
)
if not seeds:
print(f"No seeds emitted for spec={args.spec_id} verb={args.verb} band={args.band}")
return
t0 = time.perf_counter()
for s in seeds:
print("=" * 78)
print(f"SEED: {s.sentence!r} (spec={s.spec_id})")
result = edit(
s,
model=model,
tokenizer=tokenizer,
trie=trie,
n_trajectories=args.n_trajectories,
rng_base_seed=args.rng_seed,
)
print(f" seed coherence = {result.coherence_seed:+.2f}")
print(f" best = {result.best!r}")
print(f" best coherence = {result.coherence_best:+.2f}")
print(f" unique outputs = {len(result.unique_outputs)} / {args.n_trajectories}")
elapsed = time.perf_counter() - t0
print("=" * 78)
print(f"\nTotal: {elapsed:.1f}s for {len(seeds)} seeds "
f"({elapsed / len(seeds):.1f}s/seed average)")
if __name__ == "__main__":
main()
- [ ] Step 2: Add a brief README for the research dir
packages/generation/research/2026-05-07-phon-95-editor/README.md:
# PHON-95 — MLM Iterative Editor + Argstruc CFG Enumerator
Reproducible run script for the productionized PHON-92 stack
(`phonolex_generators`).
## Usage
uv run python run.py --spec-id spec1 --verb cut --band fineweb_adult --n-seeds 8
## What this is
Tiny driver around the new `phonolex_generators` package. The package
itself lives at `packages/generators/`; this directory only holds the
demo CLI + any artifacts produced from probing.
See:
- spec: `docs/superpowers/specs/2026-05-07-phon-95-mlm-editor-cfg-enumerator-design.md`
- plan: `docs/superpowers/plans/2026-05-07-phon-95-mlm-editor-cfg-enumerator.md`
- predecessor probe (gold): `research/phon-92-selectional-preference-spike` @
`packages/generation/research/2026-05-05-phon-92-selectional-preference/diffusion-editor-probe/`
- [ ] Step 3: Smoke-run the script
Run: uv run python packages/generation/research/2026-05-07-phon-95-editor/run.py --spec-id spec1 --verb cut --n-seeds 4 --n-trajectories 4
Expected: prints 4 seeds with best ≠ seed sentence for at least 3, no exceptions.
- [ ] Step 4: Commit
git add packages/generation/research/2026-05-07-phon-95-editor/
git commit -m "PHON-95 Task 10: reproducible run script under packages/generation/research/"
Task 11: Documentation — README + CLAUDE.md update¶
Files:
- Create: packages/generators/README.md
- Modify: CLAUDE.md (mention the new package alongside phonolex_data / phonolex_governors)
- [ ] Step 1: Write the package README
packages/generators/README.md:
# phonolex_generators
C1 combinatorial generation track — productionization of the PHON-92
validated stack.
## Modules
- `cfg_seed.argstruc_enumerator` — verb-locked NP V NP CFG; slot fills =
`WordStore.subset(spec_expr) ∩ pmi_admit(verb, role, band)`.
- `editor.mlm_iterative_editor` — joint-mask + sampled trie-filtered fill
+ best-of-N trajectories over RoBERTa-large.
- `scorer.joint_mask_pll` — joint-masked pseudo-log-likelihood; shared
MLM with the editor.
## Quick start
```python
from pathlib import Path
from phonolex_data.runtime.store import WordStore
from phonolex_governors.generation.trie import VocabTrie
from phonolex_generators.cfg_seed.argstruc_enumerator import enumerate_seeds
from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS
from phonolex_generators.editor.mlm_iterative_editor import edit
from phonolex_generators.shared.mlm_loader import get_mlm
store = WordStore.from_parquet(Path("data/runtime/words.parquet"))
store.attach_selectional(Path("data/runtime/selectional.parquet"))
model, tokenizer, _ = get_mlm()
all_words = [w.lower() for w in store.df["word"].to_list()]
trie = VocabTrie(all_words)
allowed = set(store.subset(SPEC_FILTERS["spec1"]).get_column("word").str.to_lowercase().to_list())
trie.tag(set(all_words) - allowed)
for seed in enumerate_seeds(store, "spec1", "cut", "fineweb_adult"):
result = edit(seed, model=model, tokenizer=tokenizer, trie=trie)
print(result.best, " coh =", result.coherence_best)
Tests¶
uv run --package phonolex-generators pytest packages/generators/tests/ -v # unit (fast)
uv run --package phonolex-generators pytest packages/generators/tests/ -v -m slow # + MLM-loading
uv run --package phonolex-generators pytest packages/generators/tests/ -v \\
-m "slow and acceptance" # PHON-64 v2 gold
Dependencies¶
phonolex_data (WordStore + Parquet) · phonolex_governors (VocabTrie,
re-tagged per request via trie.tag(banned)) · transformers · torch ·
polars.
See¶
- spec:
docs/superpowers/specs/2026-05-07-phon-95-mlm-editor-cfg-enumerator-design.md - plan:
docs/superpowers/plans/2026-05-07-phon-95-mlm-editor-cfg-enumerator.md- [ ] **Step 2: Modify `CLAUDE.md` — Architecture and Project Structure sections** In `CLAUDE.md` under the Architecture section, add `phonolex_generators` to the list of in-house Python packages. The relevant existing block reads: - Governor engine:
packages/governors/— word-level checker (G2P, phonology), Reranker (penalty-only trie steering), PunctuationBoost, VocabTrie, TargetedRolloutProcessor. Package name:phonolex_governors. - Generation server:
packages/generation/— FastAPI + T5Gemma 9B-2B. Local dev viauvicorn, production via RunPod Serverless (scale-to-zero GPU). Cloudflare Worker proxies/api/generate-singleto RunPod.After the Governor engine line, insert: - Generators (C1 combinatorial):
packages/generators/— productionized PHON-92 stack (PHON-95), generalized to prefix-walking decode. CFG seed enumerator + MLM iterative editor (per-content-slot trie walk over a taggedphonolex_governors.VocabTrie) + joint-mask PLL coherence scorer. Package name:phonolex_generators. Depends onphonolex_data+phonolex_governors.│ ├── generators/ # C1 combinatorial generation (phonolex_generators) — PHON-95 │ │ ├── src/phonolex_generators/ │ │ │ ├── cfg_seed/ # argstruc_enumerator + spec_filters │ │ │ ├── editor/ # mlm_iterative_editor + trajectory + trie_filter │ │ │ ├── scorer/ # joint_mask_pll │ │ │ └── shared/ # mlm_loader + word_to_tokens │ │ ├── tests/ # unit + acceptance (slow markers) │ │ └── pyproject.toml │ │In the Project Structure tree, add a `generators/` block between `governors/` and `web/`:Also update the Dev Setup section's editable-install line so `phonolex_generators` is added: Change: ```bash uv pip install -e packages/data -e packages/governors
to:
uv pip install -e packages/data -e packages/governors -e packages/generators
- [ ] Step 3: Verify CLAUDE.md lints clean (no broken links / out-of-date references)
Run: git diff CLAUDE.md and visually scan for typos / missing newlines.
- [ ] Step 4: Commit
git add packages/generators/README.md CLAUDE.md
git commit -m "PHON-95 Task 11: docs — package README + CLAUDE.md update"
Task 12: Open OQ1–OQ6 Follow-Up Tickets (DRAFT — user approval before creating)¶
Files: none (Jira step)
The spec has 6 open implementation questions. Per user feedback (feedback_authorization_per_item), this task DRAFTS the ticket bodies in the plan document and surfaces them for explicit user authorization before creating any Jira issues. Do NOT call mcp__plugin_atlassian_atlassian__createJiraIssue until the user signs off.
- [ ] Step 1: Verify free PHON-XX numbers
Run JQL via the Atlassian MCP:
project = PHON ORDER BY created DESC
Read off the highest existing key. Per user feedback (feedback_verify_jira_state), don't promise specific numbers ahead of time — list which 6 are next free, in order.
- [ ] Step 2: Draft ticket bodies (paste into the plan as a comment block; do NOT create yet)
For each OQ, the draft has shape:
Title: PHON-95 OQ<N>: <one-line summary>
Workstream: <pick one from the 10>
Body:
Trigger: <the measurable failure mode that would activate this work>
v1 default (current): <as-is>
v2 candidate: <what changes>
Predecessor: PHON-95
Six drafts:
- OQ1 — Continuous PMI biasing (default Boolean PMI ≥ 0; v2: α·ppmi logit bias). Trigger: a seed where boolean admit set produces a low-quality output AND a PPMI-ranked top-K differs meaningfully.
- OQ2 — Editor scaling to 10–15 tokens. Trigger: longer-sentence CFG productions land and best-of-N coherence stops improving over the seed.
- OQ3 — Subject-verb agreement / morphology. Trigger: any output flagged "morphologically wrong" by a clinician reviewer in the PHON-69 survey.
- OQ4 — Diversity at scale (best-of-8 → 10–20 unique). Trigger: a customer-facing batch task requires N distinct outputs > 4.
- OQ5 — Coherence robustness (N=50–100 sanity). Trigger: a degenerate output ranks above a well-formed one in any acceptance run.
-
OQ6 — Editor fine-tune on PhonoLex CDS. Trigger:
band="childes_*"outputs are qualitatively bad on the regression seeds. -
[ ] Step 3: Surface to user
Print the 6 drafts with proposed PHON-XX numbers and ask: "Approve creating these 6 OQ follow-up tickets in Jira? (yes/no/edit)"
- [ ] Step 4: Create on user approval (only if "yes")
For each approved draft, call mcp__plugin_atlassian_atlassian__createJiraIssue with the body above. Do NOT batch-create without per-ticket confirmation if the user says "edit."
- [ ] Step 5: No commit
This task creates Jira tickets, not code. No git operation required. The plan-doc trail is sufficient.
Self-Review¶
Spec coverage:
- §Scope In: — new package ✓ (T1), three modules ✓ (T3/T6/T8), per-request trie tagging on full vocab ✓ (T4/T6 — see deviation note below), boolean PMI admit ✓ (T8), acceptance test ✓ (T9), reproducible run script ✓ (T10).
- §Data contracts — WordStore.subset use ✓ (T7/T8), selectional.parquet PMI admit ✓ (T8), MLM weights singleton ✓ (T2), EditedSentence / Trajectory ✓ (T5).
- §Architecture three modules + shared MLM ✓ (T2/T3/T6/T8).
- §Acceptance criteria — (1) 5-seed compliance regression ✓ (T9), (2) module unit tests ✓ (T6/T7/T8/T3/T4), (3) 16-seed ≤ 30s ✓ (T9 perf gate), (4) cake/thunder PMI ✓ (T8), (5) phonolex_data no regressions ✓ (T9 step 3).
- §Open implementation questions — OQ1–OQ6 fan-out ✓ (T12, gated on user approval).
- §Plan handoff items 1–8 → tasks T1, T6, T8, T3, T9, T10, T11, T12 — all covered.
Placeholder scan: none. Every step contains exact paths and complete code.
Type consistency: Seed → (sentence, content_word_indices: tuple[int,...], locked_word_indices: tuple[int,...], spec_id, note) consistent across T5/T6/T8/T9. EditedSentence.unique_outputs: list[str] — note this is a list (not set) to preserve insertion order; T6 builds it via dict[str, None] to dedupe while preserving order, T9 does not assert ordering. VocabTrie (from phonolex_governors) used as the trie type in T4/T6/T9/T10/T11; per-request retag via trie.tag(banned).
Deliberate deviations from the spec:
-
Spec §Scope says "Per-request small dict-trie (~500–2K words) ... Distinct from v6's static 126K-word marisa-trie (which stays in
phonolex_governorsand is not a dependency here)." The plan reusesphonolex_governors.VocabTrie(full-vocab marisa-trie + per-requesttag(banned)) instead of building a parallel small dict-trie. Justification: we already own that infrastructure and it's the canonical representation; building a parallel small trie was over-engineering.phonolex_generatorstherefore depends onphonolex_governors. -
Spec §Module 2 says "Lift verbatim from
probe_sampled_iterative.py. ... joint-mask all content positions, forward through MLM, intersect top-K logits with the per-request word trie, sample ... at temperature=0.7 from top-10 of the trie-filtered top-50." The plan generalizes this from a single-position complete-word filter (probe) to per-content-slot prefix-walking decode that walks the trie left-to-right across each slot's mask positions, gating per-position admits bydead_end_ratio < 1.0, and greedy-stopping when the accumulated prefix isis_end and not banned. Justification: this is what the trie was designed for (mirrorsphonolex_governors.Reranker._steer_sequence); the probe's filter was a single-step special case that couldn't reach multi-token compliant words. The probe's gold output (sampled_locked_dedup_output.txt) becomes a smoke baseline rather than a byte-equality target.
Both deviations were authorized in the planning conversation 2026-05-07; the spec text should be updated in a follow-up edit.
Execution Handoff¶
Plan complete and saved to docs/superpowers/plans/2026-05-07-phon-95-mlm-editor-cfg-enumerator.md. Two execution options:
1. Subagent-Driven (recommended) — dispatch a fresh subagent per task, review between tasks, fast iteration. Best fit for this plan because Tasks 2–10 each have a single concrete deliverable + tests.
2. Inline Execution — execute tasks in this session using executing-plans, batch execution with checkpoints.
Which approach?