PHON-105 — CSP quality: hybrid PPMI + raw frequency for verbal slots¶

Date: 2026-05-08 Branch: feature/csp-iteration Scope: spike-internal quality improvement (productionization to packages/generators/csp/ is PHON-109's scope)

Frame¶

solve_shape scores xcomp/ccomp fillers using pure positive PMI (PPMI) from selectional.parquet. PPMI alone over-prefers rare-but-strongly-associated verbs: e.g., for matrix verb wish the row (wish, xcomp, proceed, fineweb_b3, count_v_r_f=6, count_v_r_star=7171, ppmi=1.696) ranks well — but proceed was observed only 6 times. A common verbal complement like go with much higher count_v_r_f but lower PPMI is currently demoted, even though it produces more natural sentences.

The PHON-95 spike already noticed this in _demo_clause_extension: want xcomp produces fillers like tame (count likely small, PPMI high) and cabal (similar), where go / do / make would feel more natural. Verbal complements have the smallest selectional samples in the corpus parse — PPMI variance is highest there.

This ticket adds a complementary frequency signal to the verbal-slot scoring: freq_<slot> = log(count_v_r_f + 1) exposed as a separate score component alongside the existing pmi_<slot>. Both axes are exposed via the existing per-axis weights mechanism. After the empirical eval (3/7 hybrid wins), the implementation defaults freq_<slot> weight to 0.0 — pure PPMI is the default ranking. Callers can enable the hybrid blend via weights={"freq_xcomp": 1.0, "freq_ccomp": 1.0}. The freq_ column is always populated in score_components for inspection / reranker features, regardless of weight.

Goal¶

For seven canonical verbal-slot probes (want, try, like, need for xcomp; think, know, see for ccomp), the hybrid blend should improve mean teacher-distilled reranker quality scores on at least 4 of 7 probes vs PPMI-only.

If the hybrid wins ≥4 of 7, ship at default weights. If 3 or fewer, surface the result honestly and let the user decide whether to ship pure PPMI as default and expose freq_<slot> as an opt-in axis.

Non-goals¶

Per-slot weight tuning — defer to PHON-107 (reranker v2) or a follow-up if needed.
Applying the blend to nominal slots (nsubj/dobj/pobj_*) — pure PPMI is well-calibrated there because of richer corpus signal. Nominal slots stay PPMI-only in this ticket.
Marginal lemma frequency from words.parquet — out of scope. The joint count signal (already in selectional.parquet) is the cleaner first iteration.
Frequency floor / hard threshold — drops candidates the reranker might rescue, inconsistent with PHON-104's vectorize-don't-prune principle.
Productionization move to packages/generators/csp/ (PHON-109 scope).

Architecture & data flow¶

selectional.parquet rows: (verb, role, filler, band, count_v_r_f, count_v_r_star, ppmi)
    ↓ filter (verb, role, band, ppmi > 0)
_slot_fillers(slot, ...)   ← in skeleton_csp.py
    ↓ for slot in {xcomp, ccomp}: extract count_v_r_f alongside ppmi
    ↓ return (fillers, scores: dict[filler, ppmi], freq_scores: dict[filler, log(count+1)])
_build_slot_filler_tables  ← when freq_scores non-empty, add freq_<slot> column
    ↓
_enumerate_vectorized       ← score_cols filter recognizes freq_* prefix
    ↓ total_score = Σ weights[k]·v across {pmi_*, freq_*, soft axes, adv_sentinel}
_dedup_and_assemble         ← drop logic mirrors pmi_* (locked-then-zero drops; non-locked keeps)
    ↓
candidates with score_components {pmi_xcomp, freq_xcomp, ...}
    ↓
reranker (downstream, PHON-95 Step 15) sees freq_<slot> as a new feature

File	Change
`skeleton_csp.py:_slot_fillers`	For `xcomp`/`ccomp`, return a 3-tuple `(fillers, scores, freq_scores)` instead of `(fillers, scores)`. Other slots return empty `freq_scores={}`.
`skeleton_csp.py:solve_shape`	The `slot_fillers` list type extends from `list[tuple[str, list[str], dict[str, float]]]` to `list[tuple[str, list[str], dict[str, float], dict[str, float]]]`.
`skeleton_csp.py:_build_slot_filler_tables`	When `freq_scores` is non-empty for a slot, add a `freq_<slot>` column populated parallel to `pmi_<slot>`.
`skeleton_csp.py:_enumerate_vectorized`	`score_cols` filter extends to recognize `c.startswith("freq_")` parallel to `pmi_*`.
`skeleton_csp.py:_enumerate_python_fallback`	Mirror the `pmi_<slot>` running-components bookkeeping for `freq_<slot>`. Same yield-before-cleanup asymmetry.
`skeleton_csp.py:_dedup_and_assemble`	`score_cols` filter extends to recognize `freq_`. Same drop-on-zero-locked logic as `pmi_`.
`paragraph_csp.py:_solve_sentence`	No change — uses `solve_shape` opaquely.
`paradigm_3_csp.py:solve()`	No change — `solve_shape` returns full score_components transparently.

Blend formula¶

import math

# In _slot_fillers, for slot in {"xcomp", "ccomp"}:
rows = sel_df.filter(
    (pl.col("verb") == verb)
    & (pl.col("role") == pmi_role)
    & (pl.col("band") == band)
    & (pl.col("ppmi") > 0.0)
)

ppmi_lookup = dict(zip(
    rows.get_column("filler").to_list(),
    rows.get_column("ppmi").to_list(),
))
count_lookup = dict(zip(
    rows.get_column("filler").to_list(),
    rows.get_column("count_v_r_f").to_list(),
))

# Existing self-loop exclusion: drop the matrix verb from xcomp/ccomp filler list
fillers = sorted(f for f in ppmi_lookup.keys() if f != verb)

scores = {f: ppmi_lookup[f] for f in fillers}
freq_scores = {f: math.log(count_lookup[f] + 1) for f in fillers}
return fillers, scores, freq_scores

For non-verbal slots: return (fillers, scores, {}). The empty freq_scores signals "no freq column for this slot" downstream.

total_score for a candidate is then:

total = α·pmi_<slot> + β·freq_<slot> + (other axes...)

where α defaults to 1.0 (PPMI), β defaults to 0.0 (freq is opt-in). Callers enable the hybrid blend via weights={"freq_xcomp": 1.0, ...}.

Why log(count+1) and not log(count): the +1 smoother avoids log(0) for fillers with count=0. Since the ppmi > 0 filter already implies count > 0 in the data, this is defensive — but matches standard log-frequency conventions.

Magnitude calibration check: typical xcomp ranges - PPMI: 0.0–~5.0 (typical "good" candidates: 1.5–3.5) - count_v_r_f: 1–~3000 (typical: 5–100) - log(count+1): 0–~8 (typical: 1.8–4.6)

The two signals are within ~2× of each other in typical ranges. Default α=β=1.0 will give freq slightly more weight than PPMI in raw magnitude. Per-request reweighting via weights={"freq_xcomp": 0.5} rebalances if eval shows freq dominating.

Validation plan¶

A new research script eval_hybrid_xcomp_ccomp.py runs the comparison once and records numbers.

PROBES = [
    ("want",  "xcomp", "nsubj,V,xcomp"),
    ("try",   "xcomp", "nsubj,V,xcomp"),
    ("like",  "xcomp", "nsubj,V,xcomp"),
    ("need",  "xcomp", "nsubj,V,xcomp"),
    ("think", "ccomp", "nsubj,V,ccomp"),
    ("know",  "ccomp", "nsubj,V,ccomp"),
    ("see",   "ccomp", "nsubj,V,ccomp"),
]

# For each probe:
#   1. Generate top-K=8 under PPMI-only (weights={"freq_xcomp": 0.0, "freq_ccomp": 0.0})
#   2. Generate top-K=8 under hybrid (default weights, both at 1.0)
#   3. Score both candidate sets with the existing teacher-distilled reranker
#   4. Record: mean Q, top-1 Q, top-3 Q, sentence diff count

The reranker code lives in train_reranker.py and quality_axis.py from the PHON-95 spike. The eval script imports score_candidates(candidates) -> list[float] and applies it to both candidate lists per probe.

Output table (appended to spec under "Empirical baseline"):

Probe	PPMI-only mean Q	Hybrid mean Q	Δ	PPMI top-1	Hybrid top-1
want xcomp	2.335	1.813	−0.522	The comrade wants to belabor.	The comrade wants to know.
try xcomp	1.922	1.820	−0.102	The clown tries to americanize.	The clown tries to figure.
like xcomp	1.695	1.722	+0.028	The coder likes to squish.	The coder likes to reuse.
need xcomp	1.748	1.852	+0.103	The culvert needs to biopsie.	The culvert needs to know.
think ccomp	1.835	1.343	−0.492	The comrade thinks that the color compliments the course.	The comrade thinks that the calyx has the clockwork.
know ccomp	1.385	1.404	+0.019	The caveman knows that the clock tolls.	The caveman knows that the calyx has the clockwork.
see ccomp	2.933	1.526	−1.408	The cadre sees that the command coils the cable.	The cadre sees that the cursing happens the cause.
Wins (Hybrid > PPMI)			3/7

Decision (recorded 2026-05-08): hybrid wins 3/7 probes. Per the criterion in this spec, ship pure PPMI as default; freq_xcomp and freq_ccomp stay as opt-in axes. Callers that want the frequency blend can pass weights={"freq_xcomp": 1.0, "freq_ccomp": 1.0} (or any positive value) — the score components are already present in score_components for all xcomp/ccomp candidates regardless of ranking mode. The hybrid underperformed on 4 of 7 probes: the frequency signal preferentially promotes high-count verbs (know, figure, reuse) that are grammatically generic but contextually weak in narrow-domain (SLP) settings where PPMI-strong rare verbs may actually be more appropriate. PHON-107 (reranker v2) is the right venue for re-examining blend calibration with per-slot weight tuning.

Decision criterion: if hybrid mean-Q is higher on ≥4 of 7 probes, commit at default weights. If 3 or fewer, surface honestly and ship pure PPMI as default with freq_<slot> exposed as opt-in via the weights dict.

Edge cases¶

Case	Behavior
Slot is `xcomp`/`ccomp` with empty filler list	`_slot_fillers` returns `(fillers=[], scores={}, freq_scores={})`. solve_shape returns `[]`.
Slot is `nsubj`/`dobj`/`pobj_*`	`(fillers, scores, {})` — empty `freq_scores`. `_build_slot_filler_tables` skips the freq column. No behavior change.
Locked verbal slot (paragraph composition with locked verbal complement)	Locked-branch logic mirrors `pmi_<slot>`: scores.get/freq_scores.get default to 0; locked-with-zero drops both columns from components. Same asymmetric drop as PHON-104 fix.
Recursive `_resolve_embedded_clause` for ccomp	Inner solve_shape call gets the freq blend on its own xcomp/ccomp slots too. Consistent.
User explicitly sets `weights={"freq_xcomp": 0.0, "freq_ccomp": 0.0}`	Equivalent to PPMI-only mode. The eval script uses this exact pattern for the A/B comparison.
`count_v_r_f = 0` (impossible given `ppmi > 0` filter, but defensive)	`log(0 + 1) = 0`. freq_ = 0; under non-locked-keep semantics, stays in components at 0. Under locked-and-zero semantics, gets dropped (consistent with `pmi_<slot>` behavior).

Testing¶

In test_vectorized_enumeration.py:

def test_xcomp_freq_score_present_in_components(store, sel_df):
    """xcomp slot produces freq_xcomp in score_components."""
    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
    top = solve_shape(shape, verb="want", domain_words=spec_words, sel_df=sel_df,
                      band="fineweb_adult", word_axes={}, cross_axes={},
                      word_df=store.df, top_k=3)
    assert top, "want should have xcomp candidates"
    for c in top:
        assert "freq_xcomp" in c["score_components"]
        assert c["score_components"]["freq_xcomp"] > 0


def test_nominal_slots_have_no_freq_component(store, sel_df):
    """nsubj/dobj should NOT get freq_<slot> components."""
    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,dobj", parse_arg_structure("nsubj,V,dobj"), 0)
    top = solve_shape(shape, verb="cut", domain_words=spec_words, sel_df=sel_df,
                      band="fineweb_adult", word_axes={}, cross_axes={},
                      word_df=store.df, top_k=3)
    for c in top:
        assert "freq_nsubj" not in c["score_components"]
        assert "freq_dobj" not in c["score_components"]


def test_hybrid_weight_zero_recovers_ppmi_only_ranking(store, sel_df):
    """weights={'freq_xcomp': 0.0} disables freq → ranking sorts purely by PPMI."""
    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
    common = dict(verb="want", domain_words=spec_words, sel_df=sel_df,
                  band="fineweb_adult", word_axes={}, cross_axes={},
                  word_df=store.df, top_k=8)

    ppmi_only = solve_shape(shape, weights={"freq_xcomp": 0.0}, **common)

    pmi_scores = [c["score_components"]["pmi_xcomp"] for c in ppmi_only]
    assert pmi_scores == sorted(pmi_scores, reverse=True), (
        f"pmi-only mode should rank by PPMI desc, got {pmi_scores}"
    )

Extend test_vectorized_matches_python with two verbal-clause probes:

@pytest.mark.parametrize("verb,spec_id,arg_structure", [
    # ... existing 6 nominal probes ...
    ("want",  "spec1", "nsubj,V,xcomp"),
    ("think", "spec1", "nsubj,V,ccomp"),
])

The vectorized and python paths must produce bit-identical sentences/score_components/total_score for verbal probes, just like nominal ones.

Open questions¶

None.

References¶

PHON-94 (corpus parse → selectional.parquet) — origin of count_v_r_f.
PHON-95 acceptance probes — _demo_clause_extension exposes the rare-PPMI failure mode.
PHON-104 vectorized enumeration — freq_<slot> is a new column following the same patterns established for pmi_<slot> and the per-word axes.
PHON-95 reranker (Step 15, teacher-distilled, Spearman 0.633) — the quality oracle used in the eval script.
Spike code: packages/generation/research/2026-05-07-sentence-generation-paradigms/.