PHON-105 — CSP quality: hybrid PPMI + raw frequency for verbal slots¶
Date: 2026-05-08
Branch: feature/csp-iteration
Scope: spike-internal quality improvement (productionization to packages/generators/csp/ is PHON-109's scope)
Frame¶
solve_shape scores xcomp/ccomp fillers using pure positive PMI (PPMI) from selectional.parquet. PPMI alone over-prefers rare-but-strongly-associated verbs: e.g., for matrix verb wish the row (wish, xcomp, proceed, fineweb_b3, count_v_r_f=6, count_v_r_star=7171, ppmi=1.696) ranks well — but proceed was observed only 6 times. A common verbal complement like go with much higher count_v_r_f but lower PPMI is currently demoted, even though it produces more natural sentences.
The PHON-95 spike already noticed this in _demo_clause_extension: want xcomp produces fillers like tame (count likely small, PPMI high) and cabal (similar), where go / do / make would feel more natural. Verbal complements have the smallest selectional samples in the corpus parse — PPMI variance is highest there.
This ticket adds a complementary frequency signal to the verbal-slot scoring: freq_<slot> = log(count_v_r_f + 1) exposed as a separate score component alongside the existing pmi_<slot>. Both axes are exposed via the existing per-axis weights mechanism. After the empirical eval (3/7 hybrid wins), the implementation defaults freq_<slot> weight to 0.0 — pure PPMI is the default ranking. Callers can enable the hybrid blend via weights={"freq_xcomp": 1.0, "freq_ccomp": 1.0}. The freq_
Goal¶
For seven canonical verbal-slot probes (want, try, like, need for xcomp; think, know, see for ccomp), the hybrid blend should improve mean teacher-distilled reranker quality scores on at least 4 of 7 probes vs PPMI-only.
If the hybrid wins ≥4 of 7, ship at default weights. If 3 or fewer, surface the result honestly and let the user decide whether to ship pure PPMI as default and expose freq_<slot> as an opt-in axis.
Non-goals¶
- Per-slot weight tuning — defer to PHON-107 (reranker v2) or a follow-up if needed.
- Applying the blend to nominal slots (nsubj/dobj/pobj_*) — pure PPMI is well-calibrated there because of richer corpus signal. Nominal slots stay PPMI-only in this ticket.
- Marginal lemma frequency from
words.parquet— out of scope. The joint count signal (already inselectional.parquet) is the cleaner first iteration. - Frequency floor / hard threshold — drops candidates the reranker might rescue, inconsistent with PHON-104's vectorize-don't-prune principle.
- Productionization move to
packages/generators/csp/(PHON-109 scope).
Architecture & data flow¶
selectional.parquet rows: (verb, role, filler, band, count_v_r_f, count_v_r_star, ppmi)
↓ filter (verb, role, band, ppmi > 0)
_slot_fillers(slot, ...) ← in skeleton_csp.py
↓ for slot in {xcomp, ccomp}: extract count_v_r_f alongside ppmi
↓ return (fillers, scores: dict[filler, ppmi], freq_scores: dict[filler, log(count+1)])
_build_slot_filler_tables ← when freq_scores non-empty, add freq_<slot> column
↓
_enumerate_vectorized ← score_cols filter recognizes freq_* prefix
↓ total_score = Σ weights[k]·v across {pmi_*, freq_*, soft axes, adv_sentinel}
_dedup_and_assemble ← drop logic mirrors pmi_* (locked-then-zero drops; non-locked keeps)
↓
candidates with score_components {pmi_xcomp, freq_xcomp, ...}
↓
reranker (downstream, PHON-95 Step 15) sees freq_<slot> as a new feature
| File | Change |
|---|---|
skeleton_csp.py:_slot_fillers |
For xcomp/ccomp, return a 3-tuple (fillers, scores, freq_scores) instead of (fillers, scores). Other slots return empty freq_scores={}. |
skeleton_csp.py:solve_shape |
The slot_fillers list type extends from list[tuple[str, list[str], dict[str, float]]] to list[tuple[str, list[str], dict[str, float], dict[str, float]]]. |
skeleton_csp.py:_build_slot_filler_tables |
When freq_scores is non-empty for a slot, add a freq_<slot> column populated parallel to pmi_<slot>. |
skeleton_csp.py:_enumerate_vectorized |
score_cols filter extends to recognize c.startswith("freq_") parallel to pmi_*. |
skeleton_csp.py:_enumerate_python_fallback |
Mirror the pmi_<slot> running-components bookkeeping for freq_<slot>. Same yield-before-cleanup asymmetry. |
skeleton_csp.py:_dedup_and_assemble |
score_cols filter extends to recognize freq_*. Same drop-on-zero-locked logic as pmi_*. |
paragraph_csp.py:_solve_sentence |
No change — uses solve_shape opaquely. |
paradigm_3_csp.py:solve() |
No change — solve_shape returns full score_components transparently. |
Blend formula¶
import math
# In _slot_fillers, for slot in {"xcomp", "ccomp"}:
rows = sel_df.filter(
(pl.col("verb") == verb)
& (pl.col("role") == pmi_role)
& (pl.col("band") == band)
& (pl.col("ppmi") > 0.0)
)
ppmi_lookup = dict(zip(
rows.get_column("filler").to_list(),
rows.get_column("ppmi").to_list(),
))
count_lookup = dict(zip(
rows.get_column("filler").to_list(),
rows.get_column("count_v_r_f").to_list(),
))
# Existing self-loop exclusion: drop the matrix verb from xcomp/ccomp filler list
fillers = sorted(f for f in ppmi_lookup.keys() if f != verb)
scores = {f: ppmi_lookup[f] for f in fillers}
freq_scores = {f: math.log(count_lookup[f] + 1) for f in fillers}
return fillers, scores, freq_scores
For non-verbal slots: return (fillers, scores, {}). The empty freq_scores signals "no freq column for this slot" downstream.
total_score for a candidate is then:
total = α·pmi_<slot> + β·freq_<slot> + (other axes...)
weights={"freq_xcomp": 1.0, ...}.
Why log(count+1) and not log(count): the +1 smoother avoids log(0) for fillers with count=0. Since the ppmi > 0 filter already implies count > 0 in the data, this is defensive — but matches standard log-frequency conventions.
Magnitude calibration check: typical xcomp ranges - PPMI: 0.0–~5.0 (typical "good" candidates: 1.5–3.5) - count_v_r_f: 1–~3000 (typical: 5–100) - log(count+1): 0–~8 (typical: 1.8–4.6)
The two signals are within ~2× of each other in typical ranges. Default α=β=1.0 will give freq slightly more weight than PPMI in raw magnitude. Per-request reweighting via weights={"freq_xcomp": 0.5} rebalances if eval shows freq dominating.
Validation plan¶
A new research script eval_hybrid_xcomp_ccomp.py runs the comparison once and records numbers.
PROBES = [
("want", "xcomp", "nsubj,V,xcomp"),
("try", "xcomp", "nsubj,V,xcomp"),
("like", "xcomp", "nsubj,V,xcomp"),
("need", "xcomp", "nsubj,V,xcomp"),
("think", "ccomp", "nsubj,V,ccomp"),
("know", "ccomp", "nsubj,V,ccomp"),
("see", "ccomp", "nsubj,V,ccomp"),
]
# For each probe:
# 1. Generate top-K=8 under PPMI-only (weights={"freq_xcomp": 0.0, "freq_ccomp": 0.0})
# 2. Generate top-K=8 under hybrid (default weights, both at 1.0)
# 3. Score both candidate sets with the existing teacher-distilled reranker
# 4. Record: mean Q, top-1 Q, top-3 Q, sentence diff count
The reranker code lives in train_reranker.py and quality_axis.py from the PHON-95 spike. The eval script imports score_candidates(candidates) -> list[float] and applies it to both candidate lists per probe.
Output table (appended to spec under "Empirical baseline"):
| Probe | PPMI-only mean Q | Hybrid mean Q | Δ | PPMI top-1 | Hybrid top-1 |
|---|---|---|---|---|---|
| want xcomp | 2.335 | 1.813 | −0.522 | The comrade wants to belabor. | The comrade wants to know. |
| try xcomp | 1.922 | 1.820 | −0.102 | The clown tries to americanize. | The clown tries to figure. |
| like xcomp | 1.695 | 1.722 | +0.028 | The coder likes to squish. | The coder likes to reuse. |
| need xcomp | 1.748 | 1.852 | +0.103 | The culvert needs to biopsie. | The culvert needs to know. |
| think ccomp | 1.835 | 1.343 | −0.492 | The comrade thinks that the color compliments the course. | The comrade thinks that the calyx has the clockwork. |
| know ccomp | 1.385 | 1.404 | +0.019 | The caveman knows that the clock tolls. | The caveman knows that the calyx has the clockwork. |
| see ccomp | 2.933 | 1.526 | −1.408 | The cadre sees that the command coils the cable. | The cadre sees that the cursing happens the cause. |
| Wins (Hybrid > PPMI) | 3/7 |
Decision (recorded 2026-05-08): hybrid wins 3/7 probes. Per the criterion in this spec, ship pure PPMI as default; freq_xcomp and freq_ccomp stay as opt-in axes. Callers that want the frequency blend can pass weights={"freq_xcomp": 1.0, "freq_ccomp": 1.0} (or any positive value) — the score components are already present in score_components for all xcomp/ccomp candidates regardless of ranking mode. The hybrid underperformed on 4 of 7 probes: the frequency signal preferentially promotes high-count verbs (know, figure, reuse) that are grammatically generic but contextually weak in narrow-domain (SLP) settings where PPMI-strong rare verbs may actually be more appropriate. PHON-107 (reranker v2) is the right venue for re-examining blend calibration with per-slot weight tuning.
Decision criterion: if hybrid mean-Q is higher on ≥4 of 7 probes, commit at default weights. If 3 or fewer, surface honestly and ship pure PPMI as default with freq_<slot> exposed as opt-in via the weights dict.
Edge cases¶
| Case | Behavior |
|---|---|
Slot is xcomp/ccomp with empty filler list |
_slot_fillers returns (fillers=[], scores={}, freq_scores={}). solve_shape returns []. |
Slot is nsubj/dobj/pobj_* |
(fillers, scores, {}) — empty freq_scores. _build_slot_filler_tables skips the freq column. No behavior change. |
| Locked verbal slot (paragraph composition with locked verbal complement) | Locked-branch logic mirrors pmi_<slot>: scores.get/freq_scores.get default to 0; locked-with-zero drops both columns from components. Same asymmetric drop as PHON-104 fix. |
Recursive _resolve_embedded_clause for ccomp |
Inner solve_shape call gets the freq blend on its own xcomp/ccomp slots too. Consistent. |
User explicitly sets weights={"freq_xcomp": 0.0, "freq_ccomp": 0.0} |
Equivalent to PPMI-only mode. The eval script uses this exact pattern for the A/B comparison. |
count_v_r_f = 0 (impossible given ppmi > 0 filter, but defensive) |
log(0 + 1) = 0. freq_pmi_<slot> behavior). |
Testing¶
In test_vectorized_enumeration.py:
def test_xcomp_freq_score_present_in_components(store, sel_df):
"""xcomp slot produces freq_xcomp in score_components."""
spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
top = solve_shape(shape, verb="want", domain_words=spec_words, sel_df=sel_df,
band="fineweb_adult", word_axes={}, cross_axes={},
word_df=store.df, top_k=3)
assert top, "want should have xcomp candidates"
for c in top:
assert "freq_xcomp" in c["score_components"]
assert c["score_components"]["freq_xcomp"] > 0
def test_nominal_slots_have_no_freq_component(store, sel_df):
"""nsubj/dobj should NOT get freq_<slot> components."""
spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
shape = SkeletonShape("nsubj,V,dobj", parse_arg_structure("nsubj,V,dobj"), 0)
top = solve_shape(shape, verb="cut", domain_words=spec_words, sel_df=sel_df,
band="fineweb_adult", word_axes={}, cross_axes={},
word_df=store.df, top_k=3)
for c in top:
assert "freq_nsubj" not in c["score_components"]
assert "freq_dobj" not in c["score_components"]
def test_hybrid_weight_zero_recovers_ppmi_only_ranking(store, sel_df):
"""weights={'freq_xcomp': 0.0} disables freq → ranking sorts purely by PPMI."""
spec_words = paradigm_3_csp.spec_lexicon(store, "spec1")
shape = SkeletonShape("nsubj,V,xcomp", parse_arg_structure("nsubj,V,xcomp"), 0)
common = dict(verb="want", domain_words=spec_words, sel_df=sel_df,
band="fineweb_adult", word_axes={}, cross_axes={},
word_df=store.df, top_k=8)
ppmi_only = solve_shape(shape, weights={"freq_xcomp": 0.0}, **common)
pmi_scores = [c["score_components"]["pmi_xcomp"] for c in ppmi_only]
assert pmi_scores == sorted(pmi_scores, reverse=True), (
f"pmi-only mode should rank by PPMI desc, got {pmi_scores}"
)
Extend test_vectorized_matches_python with two verbal-clause probes:
@pytest.mark.parametrize("verb,spec_id,arg_structure", [
# ... existing 6 nominal probes ...
("want", "spec1", "nsubj,V,xcomp"),
("think", "spec1", "nsubj,V,ccomp"),
])
The vectorized and python paths must produce bit-identical sentences/score_components/total_score for verbal probes, just like nominal ones.
Open questions¶
None.
References¶
- PHON-94 (corpus parse → selectional.parquet) — origin of count_v_r_f.
- PHON-95 acceptance probes —
_demo_clause_extensionexposes the rare-PPMI failure mode. - PHON-104 vectorized enumeration —
freq_<slot>is a new column following the same patterns established forpmi_<slot>and the per-word axes. - PHON-95 reranker (Step 15, teacher-distilled, Spearman 0.633) — the quality oracle used in the eval script.
- Spike code:
packages/generation/research/2026-05-07-sentence-generation-paradigms/.