Skip to content

PHON-105 — CSP quality: hybrid PPMI + raw frequency for verbal slots

Date: 2026-05-08 Branch: feature/csp-iteration Scope: spike-internal quality improvement (productionization to packages/generators/csp/ is PHON-109's scope)

Frame

solve_shape scores xcomp/ccomp fillers using pure positive PMI (PPMI) from selectional.parquet. PPMI alone over-prefers rare-but-strongly-associated verbs: e.g., for matrix verb wish the row (wish, xcomp, proceed, fineweb_b3, count_v_r_f=6, count_v_r_star=7171, ppmi=1.696) ranks well — but proceed was observed only 6 times. A common verbal complement like go with much higher count_v_r_f but lower PPMI is currently demoted, even though it produces more natural sentences.

The PHON-95 spike already noticed this in _demo_clause_extension: want xcomp produces fillers like tame (count likely small, PPMI high) and cabal (similar), where go / do / make would feel more natural. Verbal complements have the smallest selectional samples in the corpus parse — PPMI variance is highest there.

This ticket adds a complementary frequency signal to the verbal-slot scoring: freq_<slot> = log(count_v_r_f + 1) exposed as a separate score component alongside the existing pmi_<slot>. Both axes are exposed via the existing per-axis weights mechanism. After the empirical eval (3/7 hybrid wins), the implementation defaults freq_<slot> weight to 0.0 — pure PPMI is the default ranking. Callers can enable the hybrid blend via weights={"freq_xcomp": 1.0, "freq_ccomp": 1.0}. The freq_ column is always populated in score_components for inspection / reranker features, regardless of weight.

Goal

For seven canonical verbal-slot probes (want, try, like, need for xcomp; think, know, see for ccomp), the hybrid blend should improve mean teacher-distilled reranker quality scores on at least 4 of 7 probes vs PPMI-only.

If the hybrid wins ≥4 of 7, ship at default weights. If 3 or fewer, surface the result honestly and let the user decide whether to ship pure PPMI as default and expose freq_<slot> as an opt-in axis.

Non-goals

  • Per-slot weight tuning — defer to PHON-107 (reranker v2) or a follow-up if needed.
  • Applying the blend to nominal slots (nsubj/dobj/pobj_*) — pure PPMI is well-calibrated there because of richer corpus signal. Nominal slots stay PPMI-only in this ticket.
  • Marginal lemma frequency from words.parquet — out of scope. The joint count signal (already in selectional.parquet) is the cleaner first iteration.
  • Frequency floor / hard threshold — drops candidates the reranker might rescue, inconsistent with PHON-104's vectorize-don't-prune principle.
  • Productionization move to packages/generators/csp/ (PHON-109 scope).

Architecture & data flow

selectional.parquet rows: (verb, role, filler, band, count_v_r_f, count_v_r_star, ppmi)
    ↓ filter (verb, role, band, ppmi > 0)
_slot_fillers(slot, ...)   ← in skeleton_csp.py
    ↓ for slot in {xcomp, ccomp}: extract count_v_r_f alongside ppmi
    ↓ return (fillers, scores: dict[filler, ppmi], freq_scores: dict[filler, log(count+1)])
_build_slot_filler_tables  ← when freq_scores non-empty, add freq_<slot> column
    ↓
_enumerate_vectorized       ← score_cols filter recognizes freq_* prefix
    ↓ total_score = Σ weights[k]·v across {pmi_*, freq_*, soft axes, adv_sentinel}
_dedup_and_assemble         ← drop logic mirrors pmi_* (locked-then-zero drops; non-locked keeps)
    ↓
candidates with score_components {pmi_xcomp, freq_xcomp, ...}
    ↓
reranker (downstream, PHON-95 Step 15) sees freq_<slot> as a new feature
File Change
skeleton_csp.py:_slot_fillers For xcomp/ccomp, return a 3-tuple (fillers, scores, freq_scores) instead of (fillers, scores). Other slots return empty freq_scores={}.
skeleton_csp.py:solve_shape The slot_fillers list type extends from list[tuple[str, list[str], dict[str, float]]] to list[tuple[str, list[str], dict[str, float], dict[str, float]]].
skeleton_csp.py:_build_slot_filler_tables When freq_scores is non-empty for a slot, add a freq_<slot> column populated parallel to pmi_<slot>.
skeleton_csp.py:_enumerate_vectorized score_cols filter extends to recognize c.startswith("freq_") parallel to pmi_*.
skeleton_csp.py:_enumerate_python_fallback Mirror the pmi_<slot> running-components bookkeeping for freq_<slot>. Same yield-before-cleanup asymmetry.
skeleton_csp.py:_dedup_and_assemble score_cols filter extends to recognize freq_*. Same drop-on-zero-locked logic as pmi_*.
paragraph_csp.py:_solve_sentence No change — uses solve_shape opaquely.
paradigm_3_csp.py:solve() No change — solve_shape returns full score_components transparently.

Blend formula

import math

# In _slot_fillers, for slot in {"xcomp", "ccomp"}:
rows = sel_df.filter(
    (pl.col("verb") == verb)
    & (pl.col("role") == pmi_role)
    & (pl.col("band") == band)
    & (pl.col("ppmi") > 0.0)
)

ppmi_lookup = dict(zip(
    rows.get_column("filler").to_list(),
    rows.get_column("ppmi").to_list(),
))
count_lookup = dict(zip(
    rows.get_column("filler").to_list(),
    rows.get_column("count_v_r_f").to_list(),
))

# Existing self-loop exclusion: drop the matrix verb from xcomp/ccomp filler list
fillers = sorted(f for f in ppmi_lookup.keys() if f != verb)

scores = {f: ppmi_lookup[f] for f in fillers}
freq_scores = {f: math.log(count_lookup[f] + 1) for f in fillers}
return fillers, scores, freq_scores

For non-verbal slots: return (fillers, scores, {}). The empty freq_scores signals "no freq column for this slot" downstream.

total_score for a candidate is then:

total = α·pmi_<slot> + β·freq_<slot> + (other axes...)
where α defaults to 1.0 (PPMI), β defaults to 0.0 (freq is opt-in). Callers enable the hybrid blend via weights={"freq_xcomp": 1.0, ...}.

Why log(count+1) and not log(count): the +1 smoother avoids log(0) for fillers with count=0. Since the ppmi > 0 filter already implies count > 0 in the data, this is defensive — but matches standard log-frequency conventions.

Magnitude calibration check: typical xcomp ranges - PPMI: 0.0–~5.0 (typical "good" candidates: 1.5–3.5) - count_v_r_f: 1–~3000 (typical: 5–100) - log(count+1): 0–~8 (typical: 1.8–4.6)

The two signals are within ~2× of each other in typical ranges. Default α=β=1.0 will give freq slightly more weight than PPMI in raw magnitude. Per-request reweighting via weights={"freq_xcomp": 0.5} rebalances if eval shows freq dominating.

Validation plan

A new research script eval_hybrid_xcomp_ccomp.py runs the comparison once and records numbers.

PROBES = [
    ("want",  "xcomp", "nsubj,V,xcomp"),
    ("try",   "xcomp", "nsubj,V,xcomp"),
    ("like",  "xcomp", "nsubj,V,xcomp"),
    ("need",  "xcomp", "nsubj,V,xcomp"),
    ("think", "ccomp", "nsubj,V,ccomp"),
    ("know",  "ccomp", "nsubj,V,ccomp"),
    ("see",   "ccomp", "nsubj,V,ccomp"),
]

# For each probe:
#   1. Generate top-K=8 under PPMI-only (weights={"freq_xcomp": 0.0, "freq_ccomp": 0.0})
#   2. Generate top-K=8 under hybrid (default weights, both at 1.0)
#   3. Score both candidate sets with the existing teacher-distilled reranker
#   4. Record: mean Q, top-1 Q, top-3 Q, sentence diff count

The reranker code lives in train_reranker.py and quality_axis.py from the PHON-95 spike. The eval script imports score_candidates(candidates) -> list[float] and applies it to both candidate lists per probe.

Output table (appended to spec under "Empirical baseline"):

Probe PPMI-only mean Q Hybrid mean Q Δ PPMI top-1 Hybrid top-1
want xcomp 2.335 1.813 −0.522 The comrade wants to belabor. The comrade wants to know.
try xcomp 1.922 1.820 −0.102 The clown tries to americanize. The clown tries to figure.
like xcomp 1.695 1.722 +0.028 The coder likes to squish. The coder likes to reuse.
need xcomp 1.748 1.852 +0.103 The culvert needs to biopsie. The culvert needs to know.
think ccomp 1.835 1.343 −0.492 The comrade thinks that the color compliments the course. The comrade thinks that the calyx has the clockwork.
know ccomp 1.385 1.404 +0.019 The caveman knows that the clock tolls. The caveman knows that the calyx has the clockwork.
see ccomp 2.933 1.526 −1.408 The cadre sees that the command coils the cable. The cadre sees that the cursing happens the cause.
Wins (Hybrid > PPMI) 3/7

Decision (recorded 2026-05-08): hybrid wins 3/7 probes. Per the criterion in this spec, ship pure PPMI as default; freq_xcomp and freq_ccomp stay as opt-in axes. Callers that want the frequency blend can pass weights={"freq_xcomp": 1.0, "freq_ccomp": 1.0} (or any positive value) — the score components are already present in score_components for all xcomp/ccomp candidates regardless of ranking mode. The hybrid underperformed on 4 of 7 probes: the frequency signal preferentially promotes high-count verbs (know, figure, reuse) that are grammatically generic but contextually weak in narrow-domain (SLP) settings where PPMI-strong rare verbs may actually be more appropriate. PHON-107 (reranker v2) is the right venue for re-examining blend calibration with per-slot weight tuning.

Decision criterion: if hybrid mean-Q is higher on ≥4 of 7 probes, commit at default weights. If 3 or fewer, surface honestly and ship pure PPMI as default with freq_<slot> exposed as opt-in via the weights dict.

Edge cases

Case Behavior
Slot is xcomp/ccomp with empty filler list _slot_fillers returns (fillers=[], scores={}, freq_scores={}). solve_shape returns [].
Slot is nsubj/dobj/pobj_* (fillers, scores, {}) — empty freq_scores. _build_slot_filler_tables skips the freq column. No behavior change.
Locked verbal slot (paragraph composition with locked verbal complement) Locked-branch logic mirrors pmi_<slot>: scores.get/freq_scores.get default to 0; locked-with-zero drops both columns from components. Same asymmetric drop as PHON-104 fix.
Recursive _resolve_embedded_clause for ccomp Inner solve_shape call gets the freq blend on its own xcomp/ccomp slots too. Consistent.
User explicitly sets weights={"freq_xcomp": 0.0, "freq_ccomp": 0.0} Equivalent to PPMI-only mode. The eval script uses this exact pattern for the A/B comparison.
count_v_r_f = 0 (impossible given ppmi > 0 filter, but defensive) log(0 + 1) = 0. freq_ = 0; under non-locked-keep semantics, stays in components at 0. Under locked-and-zero semantics, gets dropped (consistent with pmi_<slot> behavior).

Testing

In test_vectorized_enumeration.py:

def test_xcomp_freq_score_present_in_components(store, sel_df):
    """xcomp slot produces freq_xcomp in score_components."""
    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
    top = solve_shape(shape, verb="want", domain_words=spec_words, sel_df=sel_df,
                      band="fineweb_adult", word_axes={}, cross_axes={},
                      word_df=store.df, top_k=3)
    assert top, "want should have xcomp candidates"
    for c in top:
        assert "freq_xcomp" in c["score_components"]
        assert c["score_components"]["freq_xcomp"] > 0


def test_nominal_slots_have_no_freq_component(store, sel_df):
    """nsubj/dobj should NOT get freq_<slot> components."""
    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,dobj", parse_arg_structure("nsubj,V,dobj"), 0)
    top = solve_shape(shape, verb="cut", domain_words=spec_words, sel_df=sel_df,
                      band="fineweb_adult", word_axes={}, cross_axes={},
                      word_df=store.df, top_k=3)
    for c in top:
        assert "freq_nsubj" not in c["score_components"]
        assert "freq_dobj" not in c["score_components"]


def test_hybrid_weight_zero_recovers_ppmi_only_ranking(store, sel_df):
    """weights={'freq_xcomp': 0.0} disables freq → ranking sorts purely by PPMI."""
    spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
    shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
    common = dict(verb="want", domain_words=spec_words, sel_df=sel_df,
                  band="fineweb_adult", word_axes={}, cross_axes={},
                  word_df=store.df, top_k=8)

    ppmi_only = solve_shape(shape, weights={"freq_xcomp": 0.0}, **common)

    pmi_scores = [c["score_components"]["pmi_xcomp"] for c in ppmi_only]
    assert pmi_scores == sorted(pmi_scores, reverse=True), (
        f"pmi-only mode should rank by PPMI desc, got {pmi_scores}"
    )

Extend test_vectorized_matches_python with two verbal-clause probes:

@pytest.mark.parametrize("verb,spec_id,arg_structure", [
    # ... existing 6 nominal probes ...
    ("want",  "spec1", "nsubj,V,xcomp"),
    ("think", "spec1", "nsubj,V,ccomp"),
])

The vectorized and python paths must produce bit-identical sentences/score_components/total_score for verbal probes, just like nominal ones.

Open questions

None.

References

  • PHON-94 (corpus parse → selectional.parquet) — origin of count_v_r_f.
  • PHON-95 acceptance probes — _demo_clause_extension exposes the rare-PPMI failure mode.
  • PHON-104 vectorized enumeration — freq_<slot> is a new column following the same patterns established for pmi_<slot> and the per-word axes.
  • PHON-95 reranker (Step 15, teacher-distilled, Spearman 0.633) — the quality oracle used in the eval script.
  • Spike code: packages/generation/research/2026-05-07-sentence-generation-paradigms/.