PHON-105 — CSP Hybrid PPMI + Frequency Implementation Plan¶

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add freq_<slot> = log(count_v_r_f + 1) as a new score component for xcomp and ccomp slots only. Default weight 1.0 (via existing weights fallback). Validate with teacher-distilled reranker quality scores on 7 verbal-clause probes.

Architecture: Extend the slot-fillers tuple from (slot, fillers, scores) to (slot, fillers, scores, freq_scores) throughout skeleton_csp.py. For xcomp/ccomp, populate freq_scores from count_v_r_f in selectional.parquet rows. Other slots return empty {}. The vectorized + python-fallback enumeration paths both add freq_<slot> columns/components when populated. Equivalence tests extended with verbal probes; standalone eval script records reranker quality A/B.

Tech Stack: Python 3.12, Polars 1.0+, pytest. Reranker uses LightGBM + MiniLM-L6-v2 (already wired in quality_axis.py).

Spec: docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md

File map¶

File	Action
`packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py`	Modify — `_slot_fillers`, `_build_slot_filler_tables`, `_enumerate_vectorized`, `_enumerate_python_fallback`, `_dedup_and_assemble`, `_compute_cartesian_size`, `solve_shape` (tuple type)
`packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py`	Modify — add freq tests, extend equivalence parametrization
`packages/generation/research/2026-05-07-sentence-generation-paradigms/eval_hybrid_xcomp_ccomp.py`	Create — A/B eval script using teacher-distilled reranker
`docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md`	Modify — append empirical baseline numbers + win-rate decision

All paths in this plan are relative to repo root /Users/jneumann/Repos/PhonoLex/. The spike directory is referenced as <spike>/ for brevity: <spike>/ = packages/generation/research/2026-05-07-sentence-generation-paradigms/.

Test command throughout:

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py -v

Task 1: Extend `_slot_fillers` to return freq_scores; plumb the 4-tuple through consumers¶

Files: - Modify: <spike>/skeleton_csp.py - Modify: <spike>/test_vectorized_enumeration.py

This task changes the slot_fillers tuple shape from 3-tuple (slot, fillers, scores) to 4-tuple (slot, fillers, scores, freq_scores) everywhere. For non-verbal slots, freq_scores={}. For xcomp/ccomp, freq_scores={filler: log(count_v_r_f + 1) for ...}.

This is a pure shape extension — no scoring behavior changes yet. Consumers just receive the new dict but don't use it. Existing tests still pass.

[ ] Step 1.1: Write failing test for _slot_fillers

Append to <spike>/test_vectorized_enumeration.py:

def test_slot_fillers_xcomp_returns_freq_scores(store, sel_df):
    """xcomp slot returns log(count+1) freq_scores for each filler."""
    import math
    import skeleton_csp

    fillers, scores, freq_scores = skeleton_csp._slot_fillers(
        slot="xcomp", verb="want", band="fineweb_adult",
        sel_df=sel_df, domain_words=frozenset(),
    )
    assert fillers, "want should have xcomp candidates in fineweb_adult"
    assert set(freq_scores.keys()) == set(fillers)
    # Every freq_score must be log(count+1) — strictly positive for ppmi>0 rows
    for f in fillers:
        assert freq_scores[f] > 0, f"{f}: freq_scores[{f}]={freq_scores[f]} not positive"
    # Sanity: freq value should be log(count+1), so exp(freq) - 1 ≥ 1
    for f in fillers:
        recovered_count = math.exp(freq_scores[f]) - 1
        assert recovered_count >= 1.0, (
            f"{f}: recovered count {recovered_count} < 1 (freq={freq_scores[f]})"
        )


def test_slot_fillers_nsubj_returns_empty_freq_scores(store, sel_df):
    """Nominal slots return empty freq_scores dict (preserve PPMI-only behavior)."""
    import skeleton_csp

    spec_words = frozenset(["cat", "kid", "dog"])  # arbitrary domain
    fillers, scores, freq_scores = skeleton_csp._slot_fillers(
        slot="nsubj", verb="cut", band="fineweb_adult",
        sel_df=sel_df, domain_words=spec_words,
    )
    assert freq_scores == {}, f"nominal slot freq_scores should be empty, got {freq_scores}"


def test_slot_fillers_advmod_returns_empty_freq_scores(store, sel_df):
    """Advmod slot returns empty freq_scores (advmod has its own data path)."""
    import skeleton_csp

    fillers, scores, freq_scores = skeleton_csp._slot_fillers(
        slot="advmod", verb="cut", band="fineweb_adult",
        sel_df=sel_df, domain_words=frozenset(),
    )
    assert freq_scores == {}, f"advmod slot freq_scores should be empty, got {freq_scores}"

[ ] Step 1.2: Run tests — should fail with TypeError or wrong tuple length

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_slot_fillers_xcomp_returns_freq_scores -v

Expected: TypeError (tuple unpacking) — _slot_fillers currently returns 2-tuple.

[ ] Step 1.3: Modify _slot_fillers in <spike>/skeleton_csp.py

The current _slot_fillers function (around line 474) has this structure:

def _slot_fillers(
    slot: str,
    *,
    verb: str,
    band: str,
    sel_df: pl.DataFrame,
    domain_words: frozenset[str],
    advmod_position: str = "final",
) -> tuple[list[str], dict[str, float]]:
    pmi_role = _slot_pmi_role(slot)
    if pmi_role is not None:
        rows = sel_df.filter(...)
        all_pmi = dict(zip(rows.get_column("filler").to_list(), rows.get_column("ppmi").to_list()))
        if slot in _VERBAL_SLOTS:
            fillers = sorted(f for f in all_pmi.keys() if f != verb)
        else:
            fillers = sorted(set(all_pmi.keys()) & domain_words)
        return fillers, {f: all_pmi[f] for f in fillers}
    if slot == "advmod":
        # ... advmod logic
        return [...], {...}
    if slot == "V":
        return [verb], {}
    raise ValueError(f"unsupported slot in solver: {slot}")

Update it to: 1. Change the return type annotation to tuple[list[str], dict[str, float], dict[str, float]]. 2. For xcomp/ccomp: also extract count_v_r_f from rows and build freq_scores = {f: math.log(count[f] + 1) for f in fillers}. 3. For other PMI slots (nsubj, dobj, iobj, pobj_*): return empty dict for freq_scores. 4. For advmod and V: return empty dict for freq_scores.

Add import math at the top of the file if not already present.

The replacement function:

def _slot_fillers(
    slot: str,
    *,
    verb: str,
    band: str,
    sel_df: pl.DataFrame,
    domain_words: frozenset[str],
    advmod_position: str = "final",
) -> tuple[list[str], dict[str, float], dict[str, float]]:
    """Return (filler list, ppmi score lookup, frequency score lookup) for one slot.

    `freq_scores` is non-empty only for xcomp/ccomp (verbal slots) — populated
    with log(count_v_r_f + 1) per filler. Other slots return empty {} so the
    downstream tuple shape stays uniform.

    For PMI slots, fillers = PMI(verb, role) intersected with the appropriate
    domain. For nominal slots (nsubj/dobj/iobj/pobj_X), we intersect with
    `domain_words` (noun-spec + user hard-constraints). For verbal slots
    (xcomp/ccomp), we do NOT intersect with `domain_words` — the spec was
    designed for matrix nominals; embedded verbs draw from the full PMI table.

    For advmod, fillers come from per-verb advmod-PMI table (PHON-94) with
    band-fallback to top-N most-common advmods. Position-aware filtering.
    """
    pmi_role = _slot_pmi_role(slot)
    if pmi_role is not None:
        rows = sel_df.filter(
            (pl.col("verb") == verb)
            & (pl.col("role") == pmi_role)
            & (pl.col("band") == band)
            & (pl.col("ppmi") > 0.0)
        )
        all_pmi = dict(
            zip(rows.get_column("filler").to_list(), rows.get_column("ppmi").to_list())
        )
        if slot in _VERBAL_SLOTS:
            # Exclude the matrix verb from its own xcomp/ccomp filler list
            fillers = sorted(f for f in all_pmi.keys() if f != verb)
            # Build frequency scores from count_v_r_f
            count_lookup = dict(
                zip(
                    rows.get_column("filler").to_list(),
                    rows.get_column("count_v_r_f").to_list(),
                )
            )
            freq_scores = {f: math.log(count_lookup[f] + 1) for f in fillers}
            return fillers, {f: all_pmi[f] for f in fillers}, freq_scores
        else:
            fillers = sorted(set(all_pmi.keys()) & domain_words)
            return fillers, {f: all_pmi[f] for f in fillers}, {}
    if slot == "advmod":
        verb_pmi = _advmod_pmi_for_verb(verb, band)
        if verb_pmi:
            raw = sorted(verb_pmi.keys())
            fillers = _filter_advmod_by_position(raw, advmod_position)
            return fillers, {f: verb_pmi[f] for f in fillers}, {}
        fallback = _advmod_band_fallback(band)
        if fallback:
            return _filter_advmod_by_position(list(fallback), advmod_position), {}, {}
        return [], {}, {}
    if slot == "V":
        return [verb], {}, {}
    raise ValueError(f"unsupported slot in solver: {slot}")

Verify import math is at the top of skeleton_csp.py. If not present, add it to the stdlib imports block.

[ ] Step 1.4: Plumb the 4-tuple through solve_shape

In solve_shape, find the loop that builds slot_fillers (around line 605):

    slot_fillers: list[tuple[str, list[str], dict[str, float]]] = []
    for slot in shape.slots:
        fillers, scores = _slot_fillers(
            slot, verb=verb, band=band, sel_df=sel_df, domain_words=domain_words,
            advmod_position=advmod_pos,
        )
        if not fillers:
            return []
        slot_fillers.append((slot, fillers, scores))

Replace with:

    slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]] = []
    for slot in shape.slots:
        fillers, scores, freq_scores = _slot_fillers(
            slot, verb=verb, band=band, sel_df=sel_df, domain_words=domain_words,
            advmod_position=advmod_pos,
        )
        if not fillers:
            return []
        slot_fillers.append((slot, fillers, scores, freq_scores))

[ ] Step 1.5: Update consumer signatures (no behavior change yet)

The following helpers in <spike>/skeleton_csp.py accept slot_fillers as a parameter and unpack the tuples in their bodies. Update their type annotations to the new 4-tuple shape and update the unpacking pattern to ignore the new freq_scores element for now (subsequent tasks will use it):

_enumerate_python_fallback — find the parameter type annotation:

def _enumerate_python_fallback(
    shape: SkeletonShape,
    slot_fillers: list[tuple[str, list[str], dict[str, float]]],
    ...

Change to:

def _enumerate_python_fallback(
    shape: SkeletonShape,
    slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]],
    ...

In its body, find the line slot, fillers, scores = slot_fillers[idx] and change to slot, fillers, scores, _freq_scores = slot_fillers[idx] (underscore prefix marks unused for now).

_build_slot_filler_tables — find the parameter:

def _build_slot_filler_tables(
    slot_fillers: list[tuple[str, list[str], dict[str, float]]],
    locked_slots: dict[str, str],
) -> dict[str, pl.DataFrame]:

Change to:

def _build_slot_filler_tables(
    slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]],
    locked_slots: dict[str, str],
) -> dict[str, pl.DataFrame]:

In its body, find for slot, fillers, scores in slot_fillers: and change to for slot, fillers, scores, _freq_scores in slot_fillers:.

_enumerate_vectorized — find:

def _enumerate_vectorized(
    shape: SkeletonShape,
    slot_fillers: list[tuple[str, list[str], dict[str, float]]],
    ...

Change to:

def _enumerate_vectorized(
    shape: SkeletonShape,
    slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]],
    ...

(No body unpacking change needed — _enumerate_vectorized delegates to _build_slot_filler_tables.)

_compute_cartesian_size (added in PHON-104 stats fix) — find this function (search for def _compute_cartesian_size or for the snippet non_locked_sizes = [). Update its iteration over slot_fillers to ignore the 4th element:

If you find:

            non_locked_sizes = [
                len(fillers) for slot, fillers, _ in slot_fillers
                if slot not in initial_locks
            ]

Change to:

            non_locked_sizes = [
                len(fillers) for slot, fillers, _, _ in slot_fillers
                if slot not in initial_locks
            ]

Similarly for the next(f for s, f, _ in slot_fillers if s == "nsubj") patterns later in the same function — change to next(f for s, f, _, _ in slot_fillers if s == "nsubj") (and same for "dobj").

If you find similar for slot, fillers, scores in slot_fillers: patterns elsewhere in the file, update them the same way.

[ ] Step 1.6: Run tests, verify all pass

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v

Expected: 47 prior tests + 3 new = 50 passed.

[ ] Step 1.7: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py \
        packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: extend _slot_fillers to return freq_scores (4-tuple plumbing)

_slot_fillers now returns (fillers, scores, freq_scores). For xcomp/ccomp,
freq_scores is populated with log(count_v_r_f + 1) per filler. Other slots
return empty {}. The slot_fillers tuple shape extends from 3-tuple to
4-tuple throughout the consumer chain (solve_shape, _build_slot_filler_tables,
_enumerate_vectorized, _enumerate_python_fallback, _compute_cartesian_size).

No scoring behavior changes yet — freq_scores is plumbed through but
consumers ignore it. Subsequent tasks add the score column / component.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 2: `_build_slot_filler_tables` adds `freq_<slot>` column when freq_scores non-empty¶

Files: - Modify: <spike>/skeleton_csp.py - Modify: <spike>/test_vectorized_enumeration.py

[ ] Step 2.1: Write failing tests

Append to <spike>/test_vectorized_enumeration.py:

def test_build_slot_filler_tables_adds_freq_column_for_verbal_slots():
    """When freq_scores is non-empty for a slot, the table gets a freq_<slot> column."""
    slot_fillers = [
        ("nsubj", ["she"], {"she": 1.0}, {}),  # nominal, no freq
        ("V", ["want"], {}, {}),
        ("xcomp", ["go", "do"], {"go": 2.0, "do": 1.5}, {"go": 4.6, "do": 3.2}),
    ]
    tables = skeleton_csp._build_slot_filler_tables(slot_fillers, locked_slots={"V": "want"})

    # Nominal nsubj has no freq column
    assert set(tables["nsubj"].columns) == {"nsubj", "pmi_nsubj"}
    # Locked V has no freq column (freq_scores is empty for V)
    assert set(tables["V"].columns) == {"V", "pmi_V"}
    # xcomp has freq_xcomp column
    assert set(tables["xcomp"].columns) == {"xcomp", "pmi_xcomp", "freq_xcomp"}
    # freq values aligned with fillers
    xcomp_rows = dict(zip(
        tables["xcomp"]["xcomp"].to_list(),
        tables["xcomp"]["freq_xcomp"].to_list(),
    ))
    assert xcomp_rows == {"go": 4.6, "do": 3.2}

[ ] Step 2.2: Run test, verify it fails

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_build_slot_filler_tables_adds_freq_column_for_verbal_slots -v

Expected: AssertionError — freq_xcomp not in columns.

[ ] Step 2.3: Add freq column to _build_slot_filler_tables

Find the function in <spike>/skeleton_csp.py. Update the unpacking from _freq_scores to freq_scores and conditionally add the freq column:

def _build_slot_filler_tables(
    slot_fillers: list[tuple[str, list[str], dict[str, float], dict[str, float]]],
    locked_slots: dict[str, str],
) -> dict[str, pl.DataFrame]:
    """Build per-slot polars frames with `<slot>` (filler) + `pmi_<slot>` columns.
    For verbal slots (xcomp/ccomp) where freq_scores is non-empty, also adds
    a `freq_<slot>` column with log(count_v_r_f + 1) values.

    Locked slots produce a 1-row frame with the locked filler. Non-locked
    slots produce a |fillers|-row frame.
    """
    tables: dict[str, pl.DataFrame] = {}
    for slot, fillers, scores, freq_scores in slot_fillers:
        if slot in locked_slots:
            w = locked_slots[slot]
            cols = {
                slot: [w],
                f"pmi_{slot}": [scores.get(w, 0.0)],
            }
            if freq_scores:
                cols[f"freq_{slot}"] = [freq_scores.get(w, 0.0)]
            tables[slot] = pl.DataFrame(cols)
        else:
            cols = {
                slot: fillers,
                f"pmi_{slot}": [scores.get(f, 0.0) for f in fillers],
            }
            if freq_scores:
                cols[f"freq_{slot}"] = [freq_scores.get(f, 0.0) for f in fillers]
            tables[slot] = pl.DataFrame(cols)
    return tables

[ ] Step 2.4: Run tests, verify all pass

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v

Expected: 51 passed (50 prior + 1 new).

[ ] Step 2.5: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py \
        packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: _build_slot_filler_tables adds freq_<slot> column for verbal slots

When freq_scores is non-empty (xcomp/ccomp), the slot's frame gets a
freq_<slot> column populated parallel to pmi_<slot>. Locked-slot 1-row
frames also carry the freq column. Non-verbal slots stay 2-column
(slot + pmi_slot).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 3: `_enumerate_vectorized` and `_dedup_and_assemble` recognize `freq_*` columns¶

Files: - Modify: <spike>/skeleton_csp.py - Modify: <spike>/test_vectorized_enumeration.py

[ ] Step 3.1: Write failing test

Append to <spike>/test_vectorized_enumeration.py:

def test_enumerate_vectorized_freq_in_total_score():
    """Verbal-slot freq_<slot> column is summed into total_score."""
    shape = skeleton_csp.SkeletonShape(
        arg_structure="nsubj,V,xcomp",
        slots=("nsubj", "V", "xcomp"),
        band_freq=0,
    )
    slot_fillers = [
        ("nsubj", ["she"], {"she": 1.0}, {}),
        ("V", ["want"], {}, {}),
        ("xcomp", ["go"], {"go": 2.0}, {"go": 4.6}),
    ]
    cart = skeleton_csp._enumerate_vectorized(
        shape=shape, slot_fillers=slot_fillers, word_axes={},
        weights=None, locked_slots={"V": "want"},
    )
    assert "freq_xcomp" in cart.columns
    assert cart["freq_xcomp"].to_list() == [4.6]
    # total_score: pmi_nsubj=1.0 + pmi_V=0.0 + pmi_xcomp=2.0 + freq_xcomp=4.6 = 7.6
    assert cart["total_score"].to_list() == [7.6]


def test_dedup_and_assemble_freq_in_components():
    """freq_<slot> survives into score_components."""
    shape = skeleton_csp.SkeletonShape(
        arg_structure="nsubj,V,xcomp",
        slots=("nsubj", "V", "xcomp"),
        band_freq=0,
    )
    slot_fillers = [
        ("nsubj", ["she"], {"she": 1.0}, {}),
        ("V", ["want"], {}, {}),
        ("xcomp", ["go"], {"go": 2.0}, {"go": 4.6}),
    ]
    cart = skeleton_csp._enumerate_vectorized(
        shape=shape, slot_fillers=slot_fillers, word_axes={},
        weights=None, locked_slots={"V": "want"},
    )
    assembled = skeleton_csp._dedup_and_assemble(
        cart, shape, {}, {"V": "want"}, top_k=1, over_fetch=1,
    )
    assert len(assembled) == 1
    total, fillers, components = assembled[0]
    assert components["freq_xcomp"] == 4.6
    assert components["pmi_xcomp"] == 2.0

[ ] Step 3.2: Run tests, verify they fail

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_enumerate_vectorized_freq_in_total_score research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_dedup_and_assemble_freq_in_components -v

Expected: 2 fail. The first because total_score doesn't include freq_xcomp yet (score_cols filter only matches pmi_*, c in word_axes, adv_sentinel). The second because score_cols in _dedup_and_assemble doesn't pick up freq_*.

[ ] Step 3.3: Update _enumerate_vectorized's score_cols filter

Find this block in _enumerate_vectorized in <spike>/skeleton_csp.py:

    # Total score = weighted sum of all score columns
    score_cols = [
        c for c in cart.columns
        if c.startswith("pmi_") or c in word_axes or c == "adv_sentinel"
    ]

Replace with:

    # Total score = weighted sum of all score columns
    score_cols = [
        c for c in cart.columns
        if c.startswith("pmi_") or c.startswith("freq_") or c in word_axes or c == "adv_sentinel"
    ]

[ ] Step 3.4: Update _dedup_and_assemble's score_cols filter

Find this block in _dedup_and_assemble:

    # Identify score columns to copy into components — must match
    # _enumerate_vectorized's score_cols filter exactly so any axis column
    # contributing to total_score is also reported in components.
    score_cols = [
        c for c in cart.columns
        if c.startswith("pmi_") or c in word_axes or c == "adv_sentinel"
    ]

Replace with:

    # Identify score columns to copy into components — must match
    # _enumerate_vectorized's score_cols filter exactly so any axis column
    # contributing to total_score is also reported in components.
    score_cols = [
        c for c in cart.columns
        if c.startswith("pmi_") or c.startswith("freq_") or c in word_axes or c == "adv_sentinel"
    ]

Also update the per-row drop logic. Find:

        for c in score_cols:
            v = float(row[c])
            # Match python path's asymmetric drop:
            # - Locked slot with score 0: pmi_<slot> NEVER added (locked branch's
            #   `if locked_score > 0` guard).
            # - Non-locked slot with score 0: pmi_<slot> IS added (yield happens
            #   before the post-loop cleanup deletes it).
            if c.startswith("pmi_") and v == 0.0:
                slot_name = c[len("pmi_"):]
                if slot_name in locked_slots:
                    continue
            # Per-word axis: python path drops 0-sum entries (`if total_axis != 0.0`)
            # before adding to components. Vectorized path mirrors this.
            if c in word_axes and v == 0.0:
                continue
            components[c] = v

Replace with:

        for c in score_cols:
            v = float(row[c])
            # Match python path's asymmetric drop:
            # - Locked slot with score 0: pmi_<slot> / freq_<slot> NEVER added
            #   (locked branch's `if locked_score > 0` guard).
            # - Non-locked slot with score 0: pmi_<slot> / freq_<slot> IS added
            #   (yield happens before the post-loop cleanup deletes it).
            if (c.startswith("pmi_") or c.startswith("freq_")) and v == 0.0:
                # Strip the prefix to recover the slot name
                if c.startswith("pmi_"):
                    slot_name = c[len("pmi_"):]
                else:
                    slot_name = c[len("freq_"):]
                if slot_name in locked_slots:
                    continue
            # Per-word axis: python path drops 0-sum entries (`if total_axis != 0.0`)
            # before adding to components. Vectorized path mirrors this.
            if c in word_axes and v == 0.0:
                continue
            components[c] = v

[ ] Step 3.5: Run tests, verify all pass

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v

Expected: 53 passed (51 prior + 2 new).

[ ] Step 3.6: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py \
        packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: vectorized path recognizes freq_* columns

_enumerate_vectorized's total_score sum and _dedup_and_assemble's
component-assembly score_cols both include `c.startswith("freq_")`
parallel to pmi_*. Drop-on-zero-locked logic mirrors pmi_* exactly:
locked verbal slots with freq=0 are dropped from components;
non-locked verbal slots with freq=0 are kept (matches python path's
yield-before-cleanup semantics established in PHON-104).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 4: `_enumerate_python_fallback` mirrors freq_ bookkeeping¶

Files: - Modify: <spike>/skeleton_csp.py

The python fallback's nested enumerate_assignments generator handles pmi_<slot> via running-components increment/decrement around the yield. We need parallel handling for freq_<slot> so the python and vectorized paths produce the same components dicts under the equivalence test (Task 6).

[ ] Step 4.1: Update the nested generator's slot iteration

Find enumerate_assignments inside _enumerate_python_fallback. The generator currently looks like (post-Task-1's underscore-prefix):

        slot, fillers, scores, _freq_scores = slot_fillers[idx]
        if slot in partial:
            locked_word = partial[slot]
            locked_score = scores.get(locked_word, 0.0)
            comp_key = f"pmi_{slot}"
            if locked_score > 0:
                running_components[comp_key] = running_components.get(comp_key, 0.0) + locked_score
            yield from enumerate_assignments(idx + 1, partial, running_components)
            if locked_score > 0:
                running_components[comp_key] -= locked_score
                if abs(running_components.get(comp_key, 0.0)) < 1e-12:
                    running_components.pop(comp_key, None)
            return
        for f in fillers:
            partial[slot] = f
            comp_key = f"pmi_{slot}"
            score = scores.get(f, 0.0)
            running_components[comp_key] = score if comp_key not in running_components else running_components[comp_key] + score
            yield from enumerate_assignments(idx + 1, partial, running_components)
            del partial[slot]
            if comp_key in running_components:
                if score == 0.0:
                    del running_components[comp_key]
                else:
                    running_components[comp_key] -= score
                    if abs(running_components[comp_key]) < 1e-12:
                        del running_components[comp_key]

Replace with the freq-aware version. Use freq_scores (drop the underscore prefix since we now USE it):

        slot, fillers, scores, freq_scores = slot_fillers[idx]
        if slot in partial:
            locked_word = partial[slot]
            locked_score = scores.get(locked_word, 0.0)
            locked_freq = freq_scores.get(locked_word, 0.0) if freq_scores else 0.0
            pmi_key = f"pmi_{slot}"
            freq_key = f"freq_{slot}"
            if locked_score > 0:
                running_components[pmi_key] = running_components.get(pmi_key, 0.0) + locked_score
            if locked_freq > 0:
                running_components[freq_key] = running_components.get(freq_key, 0.0) + locked_freq
            yield from enumerate_assignments(idx + 1, partial, running_components)
            if locked_score > 0:
                running_components[pmi_key] -= locked_score
                if abs(running_components.get(pmi_key, 0.0)) < 1e-12:
                    running_components.pop(pmi_key, None)
            if locked_freq > 0:
                running_components[freq_key] -= locked_freq
                if abs(running_components.get(freq_key, 0.0)) < 1e-12:
                    running_components.pop(freq_key, None)
            return
        for f in fillers:
            partial[slot] = f
            pmi_key = f"pmi_{slot}"
            freq_key = f"freq_{slot}"
            score = scores.get(f, 0.0)
            freq_score = freq_scores.get(f, 0.0) if freq_scores else 0.0
            running_components[pmi_key] = score if pmi_key not in running_components else running_components[pmi_key] + score
            if freq_scores:
                running_components[freq_key] = freq_score if freq_key not in running_components else running_components[freq_key] + freq_score
            yield from enumerate_assignments(idx + 1, partial, running_components)
            del partial[slot]
            if pmi_key in running_components:
                if score == 0.0:
                    del running_components[pmi_key]
                else:
                    running_components[pmi_key] -= score
                    if abs(running_components[pmi_key]) < 1e-12:
                        del running_components[pmi_key]
            if freq_scores and freq_key in running_components:
                if freq_score == 0.0:
                    del running_components[freq_key]
                else:
                    running_components[freq_key] -= freq_score
                    if abs(running_components[freq_key]) < 1e-12:
                        del running_components[freq_key]

The pattern mirrors the existing pmi_ handling exactly: at yield time, the freq value is part of the running_components dict (so the yielded components dict carries it for non-locked slots even when freq=0 — matching the asymmetric-drop semantics established in PHON-104). After yield, cleanup deletes it for the next iteration.

[ ] Step 4.2: Run all tests

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v

Expected: 53 passed. The Task 6 equivalence test will exercise the python path; Tasks 4 prior tests pass since python-path freq behavior wasn't yet exercised.

[ ] Step 4.3: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/skeleton_csp.py
git commit -m "$(cat <<'EOF'
PHON-105: _enumerate_python_fallback mirrors freq_<slot> bookkeeping

The nested generator now handles freq_<slot> in parallel with pmi_<slot>:
locked-branch adds only when locked_freq > 0 (drops on zero-for-locked);
non-locked branch adds unconditionally then cleans up post-yield (keeps
zero-freq in yielded components for non-locked slots). Mirrors the
exact asymmetry the vectorized path encodes via _dedup_and_assemble.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 5: End-to-end test — freq_xcomp/freq_ccomp in solve_shape output¶

Files: - Modify: <spike>/test_vectorized_enumeration.py

[ ] Step 5.1: Write the end-to-end tests

Append to <spike>/test_vectorized_enumeration.py:

def test_xcomp_solve_shape_produces_freq_xcomp_in_components(store, sel_df):
    """End-to-end: a real xcomp probe produces freq_xcomp in components."""
    import paradigm_3_csp
    from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape

    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
    top = solve_shape(
        shape, verb="want", domain_words=spec_words, sel_df=sel_df,
        band="fineweb_adult", word_axes={}, cross_axes={},
        word_df=store.df, top_k=3,
    )
    assert top, "want should have xcomp candidates"
    for c in top:
        assert "freq_xcomp" in c["score_components"], (
            f"freq_xcomp missing on {c['sentence']!r}"
        )
        # log(count+1) > 0 for any non-zero count, and the ppmi>0 filter
        # already implied count_v_r_f > 0
        assert c["score_components"]["freq_xcomp"] > 0


def test_ccomp_solve_shape_produces_freq_ccomp_in_components(store, sel_df):
    """End-to-end: a real ccomp probe produces freq_ccomp in components."""
    import paradigm_3_csp
    from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape

    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,ccomp", parse_arg_structure("nsubj,V,ccomp"), 0)
    top = solve_shape(
        shape, verb="think", domain_words=spec_words, sel_df=sel_df,
        band="fineweb_adult", word_axes={}, cross_axes={},
        word_df=store.df, top_k=3,
    )
    assert top, "think should have ccomp candidates"
    for c in top:
        assert "freq_ccomp" in c["score_components"], (
            f"freq_ccomp missing on {c['sentence']!r}"
        )
        assert c["score_components"]["freq_ccomp"] > 0


def test_nominal_slots_have_no_freq_component(store, sel_df):
    """nsubj/dobj should NOT get freq_<slot> components."""
    import paradigm_3_csp
    from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape

    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,dobj", parse_arg_structure("nsubj,V,dobj"), 0)
    top = solve_shape(
        shape, verb="cut", domain_words=spec_words, sel_df=sel_df,
        band="fineweb_adult", word_axes={}, cross_axes={},
        word_df=store.df, top_k=3,
    )
    for c in top:
        assert "freq_nsubj" not in c["score_components"]
        assert "freq_dobj" not in c["score_components"]

[ ] Step 5.2: Run tests, verify they pass

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py -v

Expected: 56 passed (53 prior + 3 new).

[ ] Step 5.3: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: end-to-end tests for freq_<slot> in components

Three integration tests exercise the full slot_fillers → table → cart →
components pipeline: xcomp produces freq_xcomp; ccomp produces freq_ccomp;
nominal slots (nsubj/dobj) have NO freq component. All scores are
strictly positive (log(count+1) > 0 for any count > 0).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 6: Test weights={"freq_xcomp": 0.0} recovers PPMI-only ranking¶

Files: - Modify: <spike>/test_vectorized_enumeration.py

[ ] Step 6.1: Write the test

Append to <spike>/test_vectorized_enumeration.py:

def test_hybrid_weight_zero_recovers_ppmi_only_ranking(store, sel_df):
    """weights={'freq_xcomp': 0.0} disables the freq blend → ranking sorts purely by pmi_xcomp."""
    import paradigm_3_csp
    from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape

    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
    common = dict(
        verb="want",
        domain_words=spec_words,
        sel_df=sel_df,
        band="fineweb_adult",
        word_axes={},
        cross_axes={},
        word_df=store.df,
        top_k=8,
    )

    # PPMI-only mode: zero out freq_xcomp weight
    ppmi_only = solve_shape(shape, weights={"freq_xcomp": 0.0}, **common)

    # In PPMI-only mode, ranking should be by pmi_xcomp desc.
    pmi_scores = [c["score_components"]["pmi_xcomp"] for c in ppmi_only]
    assert pmi_scores == sorted(pmi_scores, reverse=True), (
        f"pmi-only mode should rank by pmi_xcomp desc, got {pmi_scores}"
    )

[ ] Step 6.2: Run, verify pass

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_hybrid_weight_zero_recovers_ppmi_only_ranking -v

Expected: 1 passed.

[ ] Step 6.3: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: test that freq_xcomp weight=0 recovers PPMI-only ranking

Verifies the existing per-axis weights mechanism cleanly disables the
freq blend. weights={"freq_xcomp": 0.0} → pmi_xcomp is the sole xcomp
score signal → ranking is monotonically descending in pmi_xcomp.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 7: Extend `test_vectorized_matches_python` with verbal probes¶

Files: - Modify: <spike>/test_vectorized_enumeration.py

[ ] Step 7.1: Find and extend the parametrize decorator

Find the existing @pytest.mark.parametrize for test_vectorized_matches_python:

@pytest.mark.parametrize("verb,spec_id,arg_structure", [
    ("cut",   "spec1", "nsubj,V,dobj"),
    ("cut",   "spec1", "nsubj,V,dobj,advmod"),
    ("chase", "spec1", "nsubj,V,dobj,advmod"),
    ("melt",  "spec6", "nsubj,V,dobj,advmod"),
    ("eat",   "spec1", "nsubj,V,dobj,advmod"),
    ("fill",  "spec1", "nsubj,V,dobj,advmod"),
])

Replace with:

@pytest.mark.parametrize("verb,spec_id,arg_structure", [
    ("cut",   "spec1", "nsubj,V,dobj"),
    ("cut",   "spec1", "nsubj,V,dobj,advmod"),
    ("chase", "spec1", "nsubj,V,dobj,advmod"),
    ("melt",  "spec6", "nsubj,V,dobj,advmod"),
    ("eat",   "spec1", "nsubj,V,dobj,advmod"),
    ("fill",  "spec1", "nsubj,V,dobj,advmod"),
    # Verbal-clause probes (PHON-105 — exercise freq_<slot> equivalence)
    ("want",  "spec1", "nsubj,V,xcomp"),
    ("think", "spec1", "nsubj,V,ccomp"),
])

[ ] Step 7.2: Run the parametrized test, verify the new probes pass

cd packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py::test_vectorized_matches_python -v

Expected: 8 parameterized cases pass (6 nominal + 2 verbal).

If a verbal case fails on score_components mismatch, the python fallback's freq bookkeeping (Task 4) needs scrutiny. The most likely cause is a missed freq_scores lookup in either the locked-branch or the non-locked branch — verify both paths handle the freq_key parallel to pmi_key.

[ ] Step 7.3: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/test_vectorized_enumeration.py
git commit -m "$(cat <<'EOF'
PHON-105: extend equivalence test with verbal-clause probes

Adds (want, spec1, nsubj,V,xcomp) and (think, spec1, nsubj,V,ccomp) to
test_vectorized_matches_python. Asserts bit-identical sentences,
total_score (within 1e-9), and score_components dicts between vectorized
and python paths for the verbal-slot freq blend.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 8: Build `eval_hybrid_xcomp_ccomp.py` and record empirical baseline¶

Files: - Create: <spike>/eval_hybrid_xcomp_ccomp.py - Modify: docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md

[ ] Step 8.1: Create the eval script

Create <spike>/eval_hybrid_xcomp_ccomp.py:

"""Eval hybrid PPMI + freq blend on verbal-slot probes — PHON-105.

For 7 canonical xcomp/ccomp probes:
  1. Generate top-K=8 candidates under PPMI-only (freq weight = 0)
  2. Generate top-K=8 candidates under hybrid (default weights, both at 1.0)
  3. Score both with the teacher-distilled reranker (PHON-95 quality_axis)
  4. Compare mean Q, top-1 Q, top-3 Q, win count

Run: uv run python research/2026-05-07-sentence-generation-paradigms/eval_hybrid_xcomp_ccomp.py
"""
from __future__ import annotations

import sys
from pathlib import Path

import polars as pl

sys.path.insert(0, str(Path(__file__).parent))

import paradigm_3_csp
from phonolex_data.runtime.store import WordStore
from quality_axis import score_candidates  # PHON-95 reranker entry point
from skeleton_csp import SkeletonShape, parse_arg_structure, solve_shape


PROBES = [
    ("want",  "xcomp", "nsubj,V,xcomp"),
    ("try",   "xcomp", "nsubj,V,xcomp"),
    ("like",  "xcomp", "nsubj,V,xcomp"),
    ("need",  "xcomp", "nsubj,V,xcomp"),
    ("think", "ccomp", "nsubj,V,ccomp"),
    ("know",  "ccomp", "nsubj,V,ccomp"),
    ("see",   "ccomp", "nsubj,V,ccomp"),
]


def _load_data() -> tuple[WordStore, pl.DataFrame]:
    repo_root = Path(__file__).resolve().parents[4]
    store = WordStore.from_parquet(repo_root / "data" / "runtime" / "words.parquet")
    sel_df = pl.read_parquet(repo_root / "data" / "runtime" / "selectional.parquet")
    return store, sel_df


def _solve_with_weights(
    verb: str, slot_kind: str, arg: str,
    store: WordStore, sel_df: pl.DataFrame,
    weights: dict[str, float] | None,
) -> list[dict]:
    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape(arg, parse_arg_structure(arg), 0)
    return solve_shape(
        shape, verb=verb, domain_words=spec_words, sel_df=sel_df,
        band="fineweb_adult", word_axes={}, cross_axes={},
        word_df=store.df, weights=weights, top_k=8,
    )


def main() -> None:
    print("Loading WordStore + selectional.parquet…")
    store, sel_df = _load_data()

    rows: list[dict] = []
    for verb, slot_kind, arg in PROBES:
        ppmi_only_weights = {f"freq_{slot_kind}": 0.0}
        hybrid_weights = None  # default 1.0 for all axes

        ppmi_top = _solve_with_weights(verb, slot_kind, arg, store, sel_df, ppmi_only_weights)
        hybrid_top = _solve_with_weights(verb, slot_kind, arg, store, sel_df, hybrid_weights)

        ppmi_scores = score_candidates(ppmi_top)
        hybrid_scores = score_candidates(hybrid_top)

        ppmi_mean = sum(ppmi_scores) / len(ppmi_scores) if ppmi_scores else 0.0
        hybrid_mean = sum(hybrid_scores) / len(hybrid_scores) if hybrid_scores else 0.0

        rows.append({
            "probe": f"{verb} {slot_kind}",
            "ppmi_mean_q": ppmi_mean,
            "hybrid_mean_q": hybrid_mean,
            "delta": hybrid_mean - ppmi_mean,
            "ppmi_top1": ppmi_top[0]["sentence"] if ppmi_top else "(none)",
            "hybrid_top1": hybrid_top[0]["sentence"] if hybrid_top else "(none)",
        })

    print(f"\n{'Probe':<14}{'PPMI Q':>10}{'Hybrid Q':>10}{'Δ':>8}  {'PPMI top-1':<40}{'Hybrid top-1':<40}")
    print("-" * 124)
    wins = 0
    for r in rows:
        if r["delta"] > 0:
            wins += 1
        print(
            f"{r['probe']:<14}"
            f"{r['ppmi_mean_q']:>10.3f}"
            f"{r['hybrid_mean_q']:>10.3f}"
            f"{r['delta']:>+8.3f}  "
            f"{r['ppmi_top1']:<40}"
            f"{r['hybrid_top1']:<40}"
        )
    print("-" * 124)
    print(f"Hybrid wins: {wins}/{len(rows)}")
    if wins >= 4:
        print(f"→ Decision: ship hybrid as default (≥4/{len(rows)} wins)")
    else:
        print(f"→ Decision: ship pure PPMI as default; freq_<slot> stays as opt-in axis ({wins}/{len(rows)} wins)")


if __name__ == "__main__":
    main()

If quality_axis.py doesn't expose score_candidates, inspect that file for the actual function name and update the import. The PHON-95 spike memo references quality_axis.py and train_reranker.py — read them once to get the right entrypoint.

[ ] Step 8.2: Run the eval

cd packages/generation && uv run python research/2026-05-07-sentence-generation-paradigms/eval_hybrid_xcomp_ccomp.py

Capture the output table.

[ ] Step 8.3: Append the empirical baseline to the spec

Open docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md and find the "Validation plan" section's empty <FILL> table near the bottom. Replace the placeholder rows with the real numbers from Step 8.2 — one row per probe — and update the "Wins" footer.

Append a one-paragraph reconciliation note immediately below the table:

**Decision (recorded 2026-05-08):** hybrid wins <X>/7 probes. Per the criterion in this spec, <ship default | ship pure PPMI as default with freq_<slot> as opt-in>.

If hybrid wins ≥ 4/7, no further code change is required — the implementation is the default. If hybrid wins < 4/7, file a follow-up note in the spec recommending callers pass weights={"freq_xcomp": 0.0, "freq_ccomp": 0.0} until the blend is re-tuned (e.g., as part of PHON-107 reranker v2). DO NOT change defaults silently inside solve_shape in this ticket — the user will decide based on the recorded numbers.

[ ] Step 8.4: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/eval_hybrid_xcomp_ccomp.py \
        docs/superpowers/specs/2026-05-08-phon-105-csp-hybrid-ppmi-frequency-design.md
git commit -m "$(cat <<'EOF'
PHON-105: eval hybrid + record empirical baseline

eval_hybrid_xcomp_ccomp.py runs 7 verbal-slot probes (4 xcomp + 3 ccomp)
under PPMI-only and hybrid modes, scores both with the teacher-distilled
reranker, and reports per-probe Δ. Spec updated with the empirical table
and the win-count decision.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Done¶

After Task 8 commits, PHON-105 closes. The hybrid blend is wired throughout the CSP enumeration; the eval script + spec record document the empirical comparison; the user decides whether the recorded result justifies shipping default-hybrid or default-PPMI based on the spec's win-rate criterion.

PHON-106 (CSP constraint: maxopp contrastive scorer) is unblocked. PHON-107 (CSP reranker v2) is unblocked.

PHON-105 — CSP Hybrid PPMI + Frequency Implementation Plan¶

File map¶

Task 1: Extend _slot_fillers to return freq_scores; plumb the 4-tuple through consumers¶

Task 2: _build_slot_filler_tables adds freq_<slot> column when freq_scores non-empty¶

Task 3: _enumerate_vectorized and _dedup_and_assemble recognize freq_* columns¶

Task 4: _enumerate_python_fallback mirrors freq_ bookkeeping¶

Task 5: End-to-end test — freq_xcomp/freq_ccomp in solve_shape output¶

Task 6: Test weights={"freq_xcomp": 0.0} recovers PPMI-only ranking¶

Task 7: Extend test_vectorized_matches_python with verbal probes¶

Task 8: Build eval_hybrid_xcomp_ccomp.py and record empirical baseline¶

Done¶

Task 1: Extend `_slot_fillers` to return freq_scores; plumb the 4-tuple through consumers¶

Task 2: `_build_slot_filler_tables` adds `freq_<slot>` column when freq_scores non-empty¶

Task 3: `_enumerate_vectorized` and `_dedup_and_assemble` recognize `freq_*` columns¶

Task 4: `_enumerate_python_fallback` mirrors freq_ bookkeeping¶

Task 7: Extend `test_vectorized_matches_python` with verbal probes¶

Task 8: Build `eval_hybrid_xcomp_ccomp.py` and record empirical baseline¶