PHON-105 — CSP Hybrid PPMI + Frequency Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Add freq_<slot> = log(count_v_r_f + 1) as a new score component for xcomp and ccomp slots only. Default weight 1.0 (via existing weights fallback). Validate with teacher-distilled reranker quality scores on 7 verbal-clause probes.
Architecture: Extend the slot-fillers tuple from (slot, fillers, scores) to (slot, fillers, scores, freq_scores) throughout skeleton_csp.py. For xcomp/ccomp, populate freq_scores from count_v_r_f in selectional.parquet rows. Other slots return empty {}. The vectorized + python-fallback enumeration paths both add freq_<slot> columns/components when populated. Equivalence tests extended with verbal probes; standalone eval script records reranker quality A/B.
Tech Stack: Python 3.12, Polars 1.0+, pytest. Reranker uses LightGBM + MiniLM-L6-v2 (already wired in quality_axis.py).
Spec: docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md
File map¶
| File | Action |
|---|---|
packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py |
Modify — _slot_fillers, _build_slot_filler_tables, _enumerate_vectorized, _enumerate_python_fallback, _dedup_and_assemble, _compute_cartesian_size, solve_shape (tuple type) |
packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py |
Modify — add freq tests, extend equivalence parametrization |
packages/generation/research/2026-05-07-sentence-generation-paradigms/eval_hybrid_xcomp_ccomp.py |
Create — A/B eval script using teacher-distilled reranker |
docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md |
Modify — append empirical baseline numbers + win-rate decision |
All paths in this plan are relative to repo root /Users/jneumann/Repos/PhonoLex/. The spike directory is referenced as <spike>/ for brevity:
<spike>/ = packages/generation/research/2026-05-07-sentence-generation-paradigms/.
Test command throughout:
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py -v
Task 1: Extend _slot_fillers to return freq_scores; plumb the 4-tuple through consumers¶
Files:
- Modify: <spike>/skeleton_csp.py
- Modify: <spike>/test_vectorized_enumeration.py
This task changes the slot_fillers tuple shape from 3-tuple (slot, fillers, scores) to 4-tuple (slot, fillers, scores, freq_scores) everywhere. For non-verbal slots, freq_scores={}. For xcomp/ccomp, freq_scores={filler: log(count_v_r_f + 1) for ...}.
This is a pure shape extension — no scoring behavior changes yet. Consumers just receive the new dict but don't use it. Existing tests still pass.
- [ ] Step 1.1: Write failing test for
_slot_fillers
Append to <spike>/test_vectorized_enumeration.py:
def test_slot_fillers_xcomp_returns_freq_scores(store, sel_df):
"""xcomp slot returns log(count+1) freq_scores for each filler."""
import math
import skeleton_csp
fillers, scores, freq_scores = skeleton_csp._slot_fillers(
slot="xcomp", verb="want", band="fineweb_adult",
sel_df=sel_df, domain_words=frozenset(),
)
assert fillers, "want should have xcomp candidates in fineweb_adult"
assert set(freq_scores.keys()) == set(fillers)
# Every freq_score must be log(count+1) — strictly positive for ppmi>0 rows
for f in fillers:
assert freq_scores[f] > 0, f"{f}: freq_scores[{f}]={freq_scores[f]} not positive"
# Sanity: freq value should be log(count+1), so exp(freq) - 1 ≥ 1
for f in fillers:
recovered_count = math.exp(freq_scores[f]) - 1
assert recovered_count >= 1.0, (
f"{f}: recovered count {recovered_count} < 1 (freq={freq_scores[f]})"
)
def test_slot_fillers_nsubj_returns_empty_freq_scores(store, sel_df):
"""Nominal slots return empty freq_scores dict (preserve PPMI-only behavior)."""
import skeleton_csp
spec_words = frozenset(["cat", "kid", "dog"]) # arbitrary domain
fillers, scores, freq_scores = skeleton_csp._slot_fillers(
slot="nsubj", verb="cut", band="fineweb_adult",
sel_df=sel_df, domain_words=spec_words,
)
assert freq_scores == {}, f"nominal slot freq_scores should be empty, got {freq_scores}"
def test_slot_fillers_advmod_returns_empty_freq_scores(store, sel_df):
"""Advmod slot returns empty freq_scores (advmod has its own data path)."""
import skeleton_csp
fillers, scores, freq_scores = skeleton_csp._slot_fillers(
slot="advmod", verb="cut", band="fineweb_adult",
sel_df=sel_df, domain_words=frozenset(),
)
assert freq_scores == {}, f"advmod slot freq_scores should be empty, got {freq_scores}"
- [ ] Step 1.2: Run tests — should fail with TypeError or wrong tuple length
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_slot_fillers_xcomp_returns_freq_scores -v
Expected: TypeError (tuple unpacking) — _slot_fillers currently returns 2-tuple.
- [ ] Step 1.3: Modify
_slot_fillersin<spike>/skeleton_csp.py
The current _slot_fillers function (around line 474) has this structure:
def _slot_fillers(
slot: str,
*,
verb: str,
band: str,
sel_df: pl.DataFrame,
domain_words: frozenset[str],
advmod_position: str = "final",
) -> tuple[list[str], dict[str, float]]:
pmi_role = _slot_pmi_role(slot)
if pmi_role is not None:
rows = sel_df.filter(...)
all_pmi = dict(zip(rows.get_column("filler").to_list(), rows.get_column("ppmi").to_list()))
if slot in _VERBAL_SLOTS:
fillers = sorted(f for f in all_pmi.keys() if f != verb)
else:
fillers = sorted(set(all_pmi.keys()) & domain_words)
return fillers, {f: all_pmi[f] for f in fillers}
if slot == "advmod":
# ... advmod logic
return [...], {...}
if slot == "V":
return [verb], {}
raise ValueError(f"unsupported slot in solver: {slot}")
Update it to:
1. Change the return type annotation to tuple[list[str], dict[str, float], dict[str, float]].
2. For xcomp/ccomp: also extract count_v_r_f from rows and build freq_scores = {f: math.log(count[f] + 1) for f in fillers}.
3. For other PMI slots (nsubj, dobj, iobj, pobj_*): return empty dict for freq_scores.
4. For advmod and V: return empty dict for freq_scores.
Add import math at the top of the file if not already present.
The replacement function:
def _slot_fillers(
slot: str,
*,
verb: str,
band: str,
sel_df: pl.DataFrame,
domain_words: frozenset[str],
advmod_position: str = "final",
) -> tuple[list[str], dict[str, float], dict[str, float]]:
"""Return (filler list, ppmi score lookup, frequency score lookup) for one slot.
`freq_scores` is non-empty only for xcomp/ccomp (verbal slots) — populated
with log(count_v_r_f + 1) per filler. Other slots return empty {} so the
downstream tuple shape stays uniform.
For PMI slots, fillers = PMI(verb, role) intersected with the appropriate
domain. For nominal slots (nsubj/dobj/iobj/pobj_X), we intersect with
`domain_words` (noun-spec + user hard-constraints). For verbal slots
(xcomp/ccomp), we do NOT intersect with `domain_words` — the spec was
designed for matrix nominals; embedded verbs draw from the full PMI table.
For advmod, fillers come from per-verb advmod-PMI table (PHON-94) with
band-fallback to top-N most-common advmods. Position-aware filtering.
"""
pmi_role = _slot_pmi_role(slot)
if pmi_role is not None:
rows = sel_df.filter(
(pl.col("verb") == verb)
& (pl.col("role") == pmi_role)
& (pl.col("band") == band)
& (pl.col("ppmi") > 0.0)
)
all_pmi = dict(
zip(rows.get_column("filler").to_list(), rows.get_column("ppmi").to_list())
)
if slot in _VERBAL_SLOTS:
# Exclude the matrix verb from its own xcomp/ccomp filler list
fillers = sorted(f for f in all_pmi.keys() if f != verb)
# Build frequency scores from count_v_r_f
count_lookup = dict(
zip(
rows.get_column("filler").to_list(),
rows.get_column("count_v_r_f").to_list(),
)
)
freq_scores = {f: math.log(count_lookup[f] + 1) for f in fillers}
return fillers, {f: all_pmi[f] for f in fillers}, freq_scores
else:
fillers = sorted(set(all_pmi.keys()) & domain_words)
return fillers, {f: all_pmi[f] for f in fillers}, {}
if slot == "advmod":
verb_pmi = _advmod_pmi_for_verb(verb, band)
if verb_pmi:
raw = sorted(verb_pmi.keys())
fillers = _filter_advmod_by_position(raw, advmod_position)
return fillers, {f: verb_pmi[f] for f in fillers}, {}
fallback = _advmod_band_fallback(band)
if fallback:
return _filter_advmod_by_position(list(fallback), advmod_position), {}, {}
return [], {}, {}
if slot == "V":
return [verb], {}, {}
raise ValueError(f"unsupported slot in solver: {slot}")
Verify import math is at the top of skeleton_csp.py. If not present, add it to the stdlib imports block.
- [ ] Step 1.4: Plumb the 4-tuple through
solve_shape
In solve_shape, find the loop that builds slot_fillers (around line 605):
slot_fillers: list[tuple[str, list[str], dict[str, float]]] = []
for slot in shape.slots:
fillers, scores = _slot_fillers(
slot, verb=verb, band=band, sel_df=sel_df, domain_words=domain_words,
advmod_position=advmod_pos,
)
if not fillers:
return []
slot_fillers.append((slot, fillers, scores))
Replace with:
slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]] = []
for slot in shape.slots:
fillers, scores, freq_scores = _slot_fillers(
slot, verb=verb, band=band, sel_df=sel_df, domain_words=domain_words,
advmod_position=advmod_pos,
)
if not fillers:
return []
slot_fillers.append((slot, fillers, scores, freq_scores))
- [ ] Step 1.5: Update consumer signatures (no behavior change yet)
The following helpers in <spike>/skeleton_csp.py accept slot_fillers as a parameter and unpack the tuples in their bodies. Update their type annotations to the new 4-tuple shape and update the unpacking pattern to ignore the new freq_scores element for now (subsequent tasks will use it):
_enumerate_python_fallback — find the parameter type annotation:
def _enumerate_python_fallback(
shape: SkeletonShape,
slot_fillers: list[tuple[str, list[str], dict[str, float]]],
...
Change to:
def _enumerate_python_fallback(
shape: SkeletonShape,
slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]],
...
In its body, find the line slot, fillers, scores = slot_fillers[idx] and change to slot, fillers, scores, _freq_scores = slot_fillers[idx] (underscore prefix marks unused for now).
_build_slot_filler_tables — find the parameter:
def _build_slot_filler_tables(
slot_fillers: list[tuple[str, list[str], dict[str, float]]],
locked_slots: dict[str, str],
) -> dict[str, pl.DataFrame]:
Change to:
def _build_slot_filler_tables(
slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]],
locked_slots: dict[str, str],
) -> dict[str, pl.DataFrame]:
In its body, find for slot, fillers, scores in slot_fillers: and change to for slot, fillers, scores, _freq_scores in slot_fillers:.
_enumerate_vectorized — find:
def _enumerate_vectorized(
shape: SkeletonShape,
slot_fillers: list[tuple[str, list[str], dict[str, float]]],
...
Change to:
def _enumerate_vectorized(
shape: SkeletonShape,
slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]],
...
(No body unpacking change needed — _enumerate_vectorized delegates to _build_slot_filler_tables.)
_compute_cartesian_size (added in PHON-104 stats fix) — find this function (search for def _compute_cartesian_size or for the snippet non_locked_sizes = [). Update its iteration over slot_fillers to ignore the 4th element:
If you find:
non_locked_sizes = [
len(fillers) for slot, fillers, _ in slot_fillers
if slot not in initial_locks
]
Change to:
non_locked_sizes = [
len(fillers) for slot, fillers, _, _ in slot_fillers
if slot not in initial_locks
]
Similarly for the next(f for s, f, _ in slot_fillers if s == "nsubj") patterns later in the same function — change to next(f for s, f, _, _ in slot_fillers if s == "nsubj") (and same for "dobj").
If you find similar for slot, fillers, scores in slot_fillers: patterns elsewhere in the file, update them the same way.
- [ ] Step 1.6: Run tests, verify all pass
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v
Expected: 47 prior tests + 3 new = 50 passed.
- [ ] Step 1.7: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py \
packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: extend _slot_fillers to return freq_scores (4-tuple plumbing)
_slot_fillers now returns (fillers, scores, freq_scores). For xcomp/ccomp,
freq_scores is populated with log(count_v_r_f + 1) per filler. Other slots
return empty {}. The slot_fillers tuple shape extends from 3-tuple to
4-tuple throughout the consumer chain (solve_shape, _build_slot_filler_tables,
_enumerate_vectorized, _enumerate_python_fallback, _compute_cartesian_size).
No scoring behavior changes yet — freq_scores is plumbed through but
consumers ignore it. Subsequent tasks add the score column / component.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 2: _build_slot_filler_tables adds freq_<slot> column when freq_scores non-empty¶
Files:
- Modify: <spike>/skeleton_csp.py
- Modify: <spike>/test_vectorized_enumeration.py
- [ ] Step 2.1: Write failing tests
Append to <spike>/test_vectorized_enumeration.py:
def test_build_slot_filler_tables_adds_freq_column_for_verbal_slots():
"""When freq_scores is non-empty for a slot, the table gets a freq_<slot> column."""
slot_fillers = [
("nsubj", ["she"], {"she": 1.0}, {}), # nominal, no freq
("V", ["want"], {}, {}),
("xcomp", ["go", "do"], {"go": 2.0, "do": 1.5}, {"go": 4.6, "do": 3.2}),
]
tables = skeleton_csp._build_slot_filler_tables(slot_fillers, locked_slots={"V": "want"})
# Nominal nsubj has no freq column
assert set(tables["nsubj"].columns) == {"nsubj", "pmi_nsubj"}
# Locked V has no freq column (freq_scores is empty for V)
assert set(tables["V"].columns) == {"V", "pmi_V"}
# xcomp has freq_xcomp column
assert set(tables["xcomp"].columns) == {"xcomp", "pmi_xcomp", "freq_xcomp"}
# freq values aligned with fillers
xcomp_rows = dict(zip(
tables["xcomp"]["xcomp"].to_list(),
tables["xcomp"]["freq_xcomp"].to_list(),
))
assert xcomp_rows == {"go": 4.6, "do": 3.2}
- [ ] Step 2.2: Run test, verify it fails
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_build_slot_filler_tables_adds_freq_column_for_verbal_slots -v
Expected: AssertionError — freq_xcomp not in columns.
- [ ] Step 2.3: Add freq column to
_build_slot_filler_tables
Find the function in <spike>/skeleton_csp.py. Update the unpacking from _freq_scores to freq_scores and conditionally add the freq column:
def _build_slot_filler_tables(
slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]],
locked_slots: dict[str, str],
) -> dict[str, pl.DataFrame]:
"""Build per-slot polars frames with `<slot>` (filler) + `pmi_<slot>` columns.
For verbal slots (xcomp/ccomp) where freq_scores is non-empty, also adds
a `freq_<slot>` column with log(count_v_r_f + 1) values.
Locked slots produce a 1-row frame with the locked filler. Non-locked
slots produce a |fillers|-row frame.
"""
tables: dict[str, pl.DataFrame] = {}
for slot, fillers, scores, freq_scores in slot_fillers:
if slot in locked_slots:
w = locked_slots[slot]
cols = {
slot: [w],
f"pmi_{slot}": [scores.get(w, 0.0)],
}
if freq_scores:
cols[f"freq_{slot}"] = [freq_scores.get(w, 0.0)]
tables[slot] = pl.DataFrame(cols)
else:
cols = {
slot: fillers,
f"pmi_{slot}": [scores.get(f, 0.0) for f in fillers],
}
if freq_scores:
cols[f"freq_{slot}"] = [freq_scores.get(f, 0.0) for f in fillers]
tables[slot] = pl.DataFrame(cols)
return tables
- [ ] Step 2.4: Run tests, verify all pass
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v
Expected: 51 passed (50 prior + 1 new).
- [ ] Step 2.5: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py \
packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: _build_slot_filler_tables adds freq_<slot> column for verbal slots
When freq_scores is non-empty (xcomp/ccomp), the slot's frame gets a
freq_<slot> column populated parallel to pmi_<slot>. Locked-slot 1-row
frames also carry the freq column. Non-verbal slots stay 2-column
(slot + pmi_slot).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 3: _enumerate_vectorized and _dedup_and_assemble recognize freq_* columns¶
Files:
- Modify: <spike>/skeleton_csp.py
- Modify: <spike>/test_vectorized_enumeration.py
- [ ] Step 3.1: Write failing test
Append to <spike>/test_vectorized_enumeration.py:
def test_enumerate_vectorized_freq_in_total_score():
"""Verbal-slot freq_<slot> column is summed into total_score."""
shape = skeleton_csp.SkeletonShape(
arg_structure="nsubj,V,xcomp",
slots=("nsubj", "V", "xcomp"),
band_freq=0,
)
slot_fillers = [
("nsubj", ["she"], {"she": 1.0}, {}),
("V", ["want"], {}, {}),
("xcomp", ["go"], {"go": 2.0}, {"go": 4.6}),
]
cart = skeleton_csp._enumerate_vectorized(
shape=shape, slot_fillers=slot_fillers, word_axes={},
weights=None, locked_slots={"V": "want"},
)
assert "freq_xcomp" in cart.columns
assert cart["freq_xcomp"].to_list() == [4.6]
# total_score: pmi_nsubj=1.0 + pmi_V=0.0 + pmi_xcomp=2.0 + freq_xcomp=4.6 = 7.6
assert cart["total_score"].to_list() == [7.6]
def test_dedup_and_assemble_freq_in_components():
"""freq_<slot> survives into score_components."""
shape = skeleton_csp.SkeletonShape(
arg_structure="nsubj,V,xcomp",
slots=("nsubj", "V", "xcomp"),
band_freq=0,
)
slot_fillers = [
("nsubj", ["she"], {"she": 1.0}, {}),
("V", ["want"], {}, {}),
("xcomp", ["go"], {"go": 2.0}, {"go": 4.6}),
]
cart = skeleton_csp._enumerate_vectorized(
shape=shape, slot_fillers=slot_fillers, word_axes={},
weights=None, locked_slots={"V": "want"},
)
assembled = skeleton_csp._dedup_and_assemble(
cart, shape, {}, {"V": "want"}, top_k=1, over_fetch=1,
)
assert len(assembled) == 1
total, fillers, components = assembled[0]
assert components["freq_xcomp"] == 4.6
assert components["pmi_xcomp"] == 2.0
- [ ] Step 3.2: Run tests, verify they fail
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_enumerate_vectorized_freq_in_total_score research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_dedup_and_assemble_freq_in_components -v
Expected: 2 fail. The first because total_score doesn't include freq_xcomp yet (score_cols filter only matches pmi_*, c in word_axes, adv_sentinel). The second because score_cols in _dedup_and_assemble doesn't pick up freq_*.
- [ ] Step 3.3: Update
_enumerate_vectorized's score_cols filter
Find this block in _enumerate_vectorized in <spike>/skeleton_csp.py:
# Total score = weighted sum of all score columns
score_cols = [
c for c in cart.columns
if c.startswith("pmi_") or c in word_axes or c == "adv_sentinel"
]
Replace with:
# Total score = weighted sum of all score columns
score_cols = [
c for c in cart.columns
if c.startswith("pmi_") or c.startswith("freq_") or c in word_axes or c == "adv_sentinel"
]
- [ ] Step 3.4: Update
_dedup_and_assemble's score_cols filter
Find this block in _dedup_and_assemble:
# Identify score columns to copy into components — must match
# _enumerate_vectorized's score_cols filter exactly so any axis column
# contributing to total_score is also reported in components.
score_cols = [
c for c in cart.columns
if c.startswith("pmi_") or c in word_axes or c == "adv_sentinel"
]
Replace with:
# Identify score columns to copy into components — must match
# _enumerate_vectorized's score_cols filter exactly so any axis column
# contributing to total_score is also reported in components.
score_cols = [
c for c in cart.columns
if c.startswith("pmi_") or c.startswith("freq_") or c in word_axes or c == "adv_sentinel"
]
Also update the per-row drop logic. Find:
for c in score_cols:
v = float(row[c])
# Match python path's asymmetric drop:
# - Locked slot with score 0: pmi_<slot> NEVER added (locked branch's
# `if locked_score > 0` guard).
# - Non-locked slot with score 0: pmi_<slot> IS added (yield happens
# before the post-loop cleanup deletes it).
if c.startswith("pmi_") and v == 0.0:
slot_name = c[len("pmi_"):]
if slot_name in locked_slots:
continue
# Per-word axis: python path drops 0-sum entries (`if total_axis != 0.0`)
# before adding to components. Vectorized path mirrors this.
if c in word_axes and v == 0.0:
continue
components[c] = v
Replace with:
for c in score_cols:
v = float(row[c])
# Match python path's asymmetric drop:
# - Locked slot with score 0: pmi_<slot> / freq_<slot> NEVER added
# (locked branch's `if locked_score > 0` guard).
# - Non-locked slot with score 0: pmi_<slot> / freq_<slot> IS added
# (yield happens before the post-loop cleanup deletes it).
if (c.startswith("pmi_") or c.startswith("freq_")) and v == 0.0:
# Strip the prefix to recover the slot name
if c.startswith("pmi_"):
slot_name = c[len("pmi_"):]
else:
slot_name = c[len("freq_"):]
if slot_name in locked_slots:
continue
# Per-word axis: python path drops 0-sum entries (`if total_axis != 0.0`)
# before adding to components. Vectorized path mirrors this.
if c in word_axes and v == 0.0:
continue
components[c] = v
- [ ] Step 3.5: Run tests, verify all pass
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v
Expected: 53 passed (51 prior + 2 new).
- [ ] Step 3.6: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py \
packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: vectorized path recognizes freq_* columns
_enumerate_vectorized's total_score sum and _dedup_and_assemble's
component-assembly score_cols both include `c.startswith("freq_")`
parallel to pmi_*. Drop-on-zero-locked logic mirrors pmi_* exactly:
locked verbal slots with freq=0 are dropped from components;
non-locked verbal slots with freq=0 are kept (matches python path's
yield-before-cleanup semantics established in PHON-104).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 4: _enumerate_python_fallback mirrors freq_ bookkeeping¶
Files:
- Modify: <spike>/skeleton_csp.py
The python fallback's nested enumerate_assignments generator handles pmi_<slot> via running-components increment/decrement around the yield. We need parallel handling for freq_<slot> so the python and vectorized paths produce the same components dicts under the equivalence test (Task 6).
- [ ] Step 4.1: Update the nested generator's slot iteration
Find enumerate_assignments inside _enumerate_python_fallback. The generator currently looks like (post-Task-1's underscore-prefix):
slot, fillers, scores, _freq_scores = slot_fillers[idx]
if slot in partial:
locked_word = partial[slot]
locked_score = scores.get(locked_word, 0.0)
comp_key = f"pmi_{slot}"
if locked_score > 0:
running_components[comp_key] = running_components.get(comp_key, 0.0) + locked_score
yield from enumerate_assignments(idx + 1, partial, running_components)
if locked_score > 0:
running_components[comp_key] -= locked_score
if abs(running_components.get(comp_key, 0.0)) < 1e-12:
running_components.pop(comp_key, None)
return
for f in fillers:
partial[slot] = f
comp_key = f"pmi_{slot}"
score = scores.get(f, 0.0)
running_components[comp_key] = score if comp_key not in running_components else running_components[comp_key] + score
yield from enumerate_assignments(idx + 1, partial, running_components)
del partial[slot]
if comp_key in running_components:
if score == 0.0:
del running_components[comp_key]
else:
running_components[comp_key] -= score
if abs(running_components[comp_key]) < 1e-12:
del running_components[comp_key]
Replace with the freq-aware version. Use freq_scores (drop the underscore prefix since we now USE it):
slot, fillers, scores, freq_scores = slot_fillers[idx]
if slot in partial:
locked_word = partial[slot]
locked_score = scores.get(locked_word, 0.0)
locked_freq = freq_scores.get(locked_word, 0.0) if freq_scores else 0.0
pmi_key = f"pmi_{slot}"
freq_key = f"freq_{slot}"
if locked_score > 0:
running_components[pmi_key] = running_components.get(pmi_key, 0.0) + locked_score
if locked_freq > 0:
running_components[freq_key] = running_components.get(freq_key, 0.0) + locked_freq
yield from enumerate_assignments(idx + 1, partial, running_components)
if locked_score > 0:
running_components[pmi_key] -= locked_score
if abs(running_components.get(pmi_key, 0.0)) < 1e-12:
running_components.pop(pmi_key, None)
if locked_freq > 0:
running_components[freq_key] -= locked_freq
if abs(running_components.get(freq_key, 0.0)) < 1e-12:
running_components.pop(freq_key, None)
return
for f in fillers:
partial[slot] = f
pmi_key = f"pmi_{slot}"
freq_key = f"freq_{slot}"
score = scores.get(f, 0.0)
freq_score = freq_scores.get(f, 0.0) if freq_scores else 0.0
running_components[pmi_key] = score if pmi_key not in running_components else running_components[pmi_key] + score
if freq_scores:
running_components[freq_key] = freq_score if freq_key not in running_components else running_components[freq_key] + freq_score
yield from enumerate_assignments(idx + 1, partial, running_components)
del partial[slot]
if pmi_key in running_components:
if score == 0.0:
del running_components[pmi_key]
else:
running_components[pmi_key] -= score
if abs(running_components[pmi_key]) < 1e-12:
del running_components[pmi_key]
if freq_scores and freq_key in running_components:
if freq_score == 0.0:
del running_components[freq_key]
else:
running_components[freq_key] -= freq_score
if abs(running_components[freq_key]) < 1e-12:
del running_components[freq_key]
The pattern mirrors the existing pmi_
- [ ] Step 4.2: Run all tests
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v
Expected: 53 passed. The Task 6 equivalence test will exercise the python path; Tasks 4 prior tests pass since python-path freq behavior wasn't yet exercised.
- [ ] Step 4.3: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py
git commit -m "$(cat <<'EOF'
PHON-105: _enumerate_python_fallback mirrors freq_<slot> bookkeeping
The nested generator now handles freq_<slot> in parallel with pmi_<slot>:
locked-branch adds only when locked_freq > 0 (drops on zero-for-locked);
non-locked branch adds unconditionally then cleans up post-yield (keeps
zero-freq in yielded components for non-locked slots). Mirrors the
exact asymmetry the vectorized path encodes via _dedup_and_assemble.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 5: End-to-end test — freq_xcomp/freq_ccomp in solve_shape output¶
Files:
- Modify: <spike>/test_vectorized_enumeration.py
- [ ] Step 5.1: Write the end-to-end tests
Append to <spike>/test_vectorized_enumeration.py:
def test_xcomp_solve_shape_produces_freq_xcomp_in_components(store, sel_df):
"""End-to-end: a real xcomp probe produces freq_xcomp in components."""
import paradigm_3_csp
from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape
spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
top = solve_shape(
shape, verb="want", domain_words=spec_words, sel_df=sel_df,
band="fineweb_adult", word_axes={}, cross_axes={},
word_df=store.df, top_k=3,
)
assert top, "want should have xcomp candidates"
for c in top:
assert "freq_xcomp" in c["score_components"], (
f"freq_xcomp missing on {c['sentence']!r}"
)
# log(count+1) > 0 for any non-zero count, and the ppmi>0 filter
# already implied count_v_r_f > 0
assert c["score_components"]["freq_xcomp"] > 0
def test_ccomp_solve_shape_produces_freq_ccomp_in_components(store, sel_df):
"""End-to-end: a real ccomp probe produces freq_ccomp in components."""
import paradigm_3_csp
from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape
spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
shape = SkeletonShape("nsubj,V,ccomp", parse_arg_structure("nsubj,V,ccomp"), 0)
top = solve_shape(
shape, verb="think", domain_words=spec_words, sel_df=sel_df,
band="fineweb_adult", word_axes={}, cross_axes={},
word_df=store.df, top_k=3,
)
assert top, "think should have ccomp candidates"
for c in top:
assert "freq_ccomp" in c["score_components"], (
f"freq_ccomp missing on {c['sentence']!r}"
)
assert c["score_components"]["freq_ccomp"] > 0
def test_nominal_slots_have_no_freq_component(store, sel_df):
"""nsubj/dobj should NOT get freq_<slot> components."""
import paradigm_3_csp
from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape
spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
shape = SkeletonShape("nsubj,V,dobj", parse_arg_structure("nsubj,V,dobj"), 0)
top = solve_shape(
shape, verb="cut", domain_words=spec_words, sel_df=sel_df,
band="fineweb_adult", word_axes={}, cross_axes={},
word_df=store.df, top_k=3,
)
for c in top:
assert "freq_nsubj" not in c["score_components"]
assert "freq_dobj" not in c["score_components"]
- [ ] Step 5.2: Run tests, verify they pass
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py -v
Expected: 56 passed (53 prior + 3 new).
- [ ] Step 5.3: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: end-to-end tests for freq_<slot> in components
Three integration tests exercise the full slot_fillers → table → cart →
components pipeline: xcomp produces freq_xcomp; ccomp produces freq_ccomp;
nominal slots (nsubj/dobj) have NO freq component. All scores are
strictly positive (log(count+1) > 0 for any count > 0).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 6: Test weights={"freq_xcomp": 0.0} recovers PPMI-only ranking¶
Files:
- Modify: <spike>/test_vectorized_enumeration.py
- [ ] Step 6.1: Write the test
Append to <spike>/test_vectorized_enumeration.py:
def test_hybrid_weight_zero_recovers_ppmi_only_ranking(store, sel_df):
"""weights={'freq_xcomp': 0.0} disables the freq blend → ranking sorts purely by pmi_xcomp."""
import paradigm_3_csp
from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape
spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
common = dict(
verb="want",
domain_words=spec_words,
sel_df=sel_df,
band="fineweb_adult",
word_axes={},
cross_axes={},
word_df=store.df,
top_k=8,
)
# PPMI-only mode: zero out freq_xcomp weight
ppmi_only = solve_shape(shape, weights={"freq_xcomp": 0.0}, **common)
# In PPMI-only mode, ranking should be by pmi_xcomp desc.
pmi_scores = [c["score_components"]["pmi_xcomp"] for c in ppmi_only]
assert pmi_scores == sorted(pmi_scores, reverse=True), (
f"pmi-only mode should rank by pmi_xcomp desc, got {pmi_scores}"
)
- [ ] Step 6.2: Run, verify pass
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_hybrid_weight_zero_recovers_ppmi_only_ranking -v
Expected: 1 passed.
- [ ] Step 6.3: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: test that freq_xcomp weight=0 recovers PPMI-only ranking
Verifies the existing per-axis weights mechanism cleanly disables the
freq blend. weights={"freq_xcomp": 0.0} → pmi_xcomp is the sole xcomp
score signal → ranking is monotonically descending in pmi_xcomp.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 7: Extend test_vectorized_matches_python with verbal probes¶
Files:
- Modify: <spike>/test_vectorized_enumeration.py
- [ ] Step 7.1: Find and extend the parametrize decorator
Find the existing @pytest.mark.parametrize for test_vectorized_matches_python:
@pytest.mark.parametrize("verb,spec_id,arg_structure", [
("cut", "spec1", "nsubj,V,dobj"),
("cut", "spec1", "nsubj,V,dobj,advmod"),
("chase", "spec1", "nsubj,V,dobj,advmod"),
("melt", "spec6", "nsubj,V,dobj,advmod"),
("eat", "spec1", "nsubj,V,dobj,advmod"),
("fill", "spec1", "nsubj,V,dobj,advmod"),
])
Replace with:
@pytest.mark.parametrize("verb,spec_id,arg_structure", [
("cut", "spec1", "nsubj,V,dobj"),
("cut", "spec1", "nsubj,V,dobj,advmod"),
("chase", "spec1", "nsubj,V,dobj,advmod"),
("melt", "spec6", "nsubj,V,dobj,advmod"),
("eat", "spec1", "nsubj,V,dobj,advmod"),
("fill", "spec1", "nsubj,V,dobj,advmod"),
# Verbal-clause probes (PHON-105 — exercise freq_<slot> equivalence)
("want", "spec1", "nsubj,V,xcomp"),
("think", "spec1", "nsubj,V,ccomp"),
])
- [ ] Step 7.2: Run the parametrized test, verify the new probes pass
cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_vectorized_matches_python -v
Expected: 8 parameterized cases pass (6 nominal + 2 verbal).
If a verbal case fails on score_components mismatch, the python fallback's freq bookkeeping (Task 4) needs scrutiny. The most likely cause is a missed freq_scores lookup in either the locked-branch or the non-locked branch — verify both paths handle the freq_key parallel to pmi_key.
- [ ] Step 7.3: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: extend equivalence test with verbal-clause probes
Adds (want, spec1, nsubj,V,xcomp) and (think, spec1, nsubj,V,ccomp) to
test_vectorized_matches_python. Asserts bit-identical sentences,
total_score (within 1e-9), and score_components dicts between vectorized
and python paths for the verbal-slot freq blend.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 8: Build eval_hybrid_xcomp_ccomp.py and record empirical baseline¶
Files:
- Create: <spike>/eval_hybrid_xcomp_ccomp.py
- Modify: docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md
- [ ] Step 8.1: Create the eval script
Create <spike>/eval_hybrid_xcomp_ccomp.py:
"""Eval hybrid PPMI + freq blend on verbal-slot probes — PHON-105.
For 7 canonical xcomp/ccomp probes:
1. Generate top-K=8 candidates under PPMI-only (freq weight = 0)
2. Generate top-K=8 candidates under hybrid (default weights, both at 1.0)
3. Score both with the teacher-distilled reranker (PHON-95 quality_axis)
4. Compare mean Q, top-1 Q, top-3 Q, win count
Run: uv run python research/2026-05-07-sentence-generation-paradigms/eval_hybrid_xcomp_ccomp.py
"""
from __future__ import annotations
import sys
from pathlib import Path
import polars as pl
sys.path.insert(0, str(Path(__file__).parent))
import paradigm_3_csp
from phonolex_data.runtime.store import WordStore
from quality_axis import score_candidates # PHON-95 reranker entry point
from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape
PROBES = [
("want", "xcomp", "nsubj,V,xcomp"),
("try", "xcomp", "nsubj,V,xcomp"),
("like", "xcomp", "nsubj,V,xcomp"),
("need", "xcomp", "nsubj,V,xcomp"),
("think", "ccomp", "nsubj,V,ccomp"),
("know", "ccomp", "nsubj,V,ccomp"),
("see", "ccomp", "nsubj,V,ccomp"),
]
def _load_data() -> tuple[WordStore, pl.DataFrame]:
repo_root = Path(__file__).resolve().parents[4]
store = WordStore.from_parquet(repo_root / "data" / "runtime" / "words.parquet")
sel_df = pl.read_parquet(repo_root / "data" / "runtime" / "selectional.parquet")
return store, sel_df
def _solve_with_weights(
verb: str, slot_kind: str, arg: str,
store: WordStore, sel_df: pl.DataFrame,
weights: dict[str, float] | None,
) -> list[dict]:
spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
shape = SkeletonShape(arg, parse_arg_structure(arg), 0)
return solve_shape(
shape, verb=verb, domain_words=spec_words, sel_df=sel_df,
band="fineweb_adult", word_axes={}, cross_axes={},
word_df=store.df, weights=weights, top_k=8,
)
def main() -> None:
print("Loading WordStore + selectional.parquet…")
store, sel_df = _load_data()
rows: list[dict] = []
for verb, slot_kind, arg in PROBES:
ppmi_only_weights = {f"freq_{slot_kind}": 0.0}
hybrid_weights = None # default 1.0 for all axes
ppmi_top = _solve_with_weights(verb, slot_kind, arg, store, sel_df, ppmi_only_weights)
hybrid_top = _solve_with_weights(verb, slot_kind, arg, store, sel_df, hybrid_weights)
ppmi_scores = score_candidates(ppmi_top)
hybrid_scores = score_candidates(hybrid_top)
ppmi_mean = sum(ppmi_scores) / len(ppmi_scores) if ppmi_scores else 0.0
hybrid_mean = sum(hybrid_scores) / len(hybrid_scores) if hybrid_scores else 0.0
rows.append({
"probe": f"{verb} {slot_kind}",
"ppmi_mean_q": ppmi_mean,
"hybrid_mean_q": hybrid_mean,
"delta": hybrid_mean - ppmi_mean,
"ppmi_top1": ppmi_top[0]["sentence"] if ppmi_top else "(none)",
"hybrid_top1": hybrid_top[0]["sentence"] if hybrid_top else "(none)",
})
print(f"\n{'Probe':<14}{'PPMI Q':>10}{'Hybrid Q':>10}{'Δ':>8} {'PPMI top-1':<40}{'Hybrid top-1':<40}")
print("-" * 124)
wins = 0
for r in rows:
if r["delta"] > 0:
wins += 1
print(
f"{r['probe']:<14}"
f"{r['ppmi_mean_q']:>10.3f}"
f"{r['hybrid_mean_q']:>10.3f}"
f"{r['delta']:>+8.3f} "
f"{r['ppmi_top1']:<40}"
f"{r['hybrid_top1']:<40}"
)
print("-" * 124)
print(f"Hybrid wins: {wins}/{len(rows)}")
if wins >= 4:
print(f"→ Decision: ship hybrid as default (≥4/{len(rows)} wins)")
else:
print(f"→ Decision: ship pure PPMI as default; freq_<slot> stays as opt-in axis ({wins}/{len(rows)} wins)")
if __name__ == "__main__":
main()
If quality_axis.py doesn't expose score_candidates, inspect that file for the actual function name and update the import. The PHON-95 spike memo references quality_axis.py and train_reranker.py — read them once to get the right entrypoint.
- [ ] Step 8.2: Run the eval
cd packages/generation && uv run python research/2026-05-07-sentence-generation-paradigms/eval_hybrid_xcomp_ccomp.py
Capture the output table.
- [ ] Step 8.3: Append the empirical baseline to the spec
Open docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md and find the "Validation plan" section's empty <FILL> table near the bottom. Replace the placeholder rows with the real numbers from Step 8.2 — one row per probe — and update the "Wins" footer.
Append a one-paragraph reconciliation note immediately below the table:
**Decision (recorded 2026-05-08):** hybrid wins <X>/7 probes. Per the criterion in this spec, <ship default | ship pure PPMI as default with freq_<slot> as opt-in>.
If hybrid wins ≥ 4/7, no further code change is required — the implementation is the default. If hybrid wins < 4/7, file a follow-up note in the spec recommending callers pass weights={"freq_xcomp": 0.0, "freq_ccomp": 0.0} until the blend is re-tuned (e.g., as part of PHON-107 reranker v2). DO NOT change defaults silently inside solve_shape in this ticket — the user will decide based on the recorded numbers.
- [ ] Step 8.4: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/eval_hybrid_xcomp_ccomp.py \
docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md
git commit -m "$(cat <<'EOF'
PHON-105: eval hybrid + record empirical baseline
eval_hybrid_xcomp_ccomp.py runs 7 verbal-slot probes (4 xcomp + 3 ccomp)
under PPMI-only and hybrid modes, scores both with the teacher-distilled
reranker, and reports per-probe Δ. Spec updated with the empirical table
and the win-count decision.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Done¶
After Task 8 commits, PHON-105 closes. The hybrid blend is wired throughout the CSP enumeration; the eval script + spec record document the empirical comparison; the user decides whether the recorded result justifies shipping default-hybrid or default-PPMI based on the spec's win-rate criterion.
PHON-106 (CSP constraint: maxopp contrastive scorer) is unblocked. PHON-107 (CSP reranker v2) is unblocked.