PHON-117 — v5.2 realizer + solver rewrite¶

Status: SPEC (awaiting sign-off) Owner: TBD Filed: 2026-05-13 Branch: feature/phon-116-naturalness-scorer (work continues on the existing PHON-116 branch since the reranker retrain is part of this scope) Blocks: PR #96 (release/v5.2.0 → develop) — v5.2 cannot ship until this lands.

Why this exists¶

v5.2 Governed Generation is producing user-visible nonsense in the governed-generation UI:

98% of corpus-attested pos_templates fail to render. The realizer's _LEAKY_CONTENT_POS = {NOUN, PROPN, ADJ, NUM, INTJ} guard rejects any template with a content POS at a non-slot position. Random sample of 500 fineweb_adult nsubj,V,dobj templates: 490 fail, 10 succeed. Surviving templates skew DET-NOUN-VERB-DET-NOUN; SLP-visible output reads monotonous regardless of band or constraint.
Selectional preferences (PPMI from FineWeb-Edu DEP parse) drive the candidate set. When the parse misattributes a verb-preposition pair ("build IN [location]" → pobj_in for "build"), the solver enumerates "the peasant builds in a brick" with high PPMI. The selectional layer is an intermediate heuristic that masquerades as a quality signal.
The PHON-116 naturalness scorer was trained on output from the broken realizer. Its notion of "natural CSP" is calibrated to the DET-rich DET-NOUN-VERB-DET-NOUN distribution. Retrain on the new distribution is necessary.
PROPN leaks through skeleton templates ("The Eiffel verbs the cake"). PROPN was supposed to be purged in v5.2 but still appears in pos_templates as scaffolding tokens.
Function words slip past phoneme constraints. Resolved in PHON-117's earlier work today: corpus_sentences.parquet now stores CMU phonemes for all tokens (content + function + clitics). This rewrite preserves that.

Design (locked, modulo open Q)¶

1. Solver — drop selectional from the hot path¶

The solver no longer joins selectional.parquet. Candidate enumeration:

Skeleton inventory (loaded once at server cold-start):
Read skeletons.parquet.
Filter out any pos_template containing PROPN at any position.
Filter to canonical arg_structure (existing CANONICAL_ARG_STRUCTURES set).
Cap pos_template length to ≤ N tokens (tunable; start N=8 or 10). Long templates produce wordy output with proportionally more chances for the realizer or the reranker to misfire. The cap is a quality lever.
Dedupe by pos_template (within an arg_structure × band), keep one representative per unique pos_template.
Per-request constraint-filtered lexicon pools. Filter words.parquet once by hard_filter_expr. Bin survivors by POS into {NOUN: [...], VERB: [...], ADJ: [...], ADV: [...]}. The non-slot content positions in pos_template pull from the matching POS bin.
Combinatoric sampling. For each (skeleton, ...) combination, draw uniformly at random from its POS pools. Cap aggregate candidate set at max_candidates (default 5000). Surface-dedup as the last step before reranking.

No verb-filler co-occurrence prior. No PPMI sort. Every constraint-passing word is admissible in every POS-matching position. The reranker is the only quality gate.

2. Realizer — pos_template walker, full coverage¶

Replace _LEAKY_CONTENT_POS guard with substitution:

Position type	Resolution
Slot match (POS ∈ slot's admissible set, slot remaining in queue)	Solver filler (existing logic)
Non-slot NOUN	Random word from `pools["NOUN"]`
Non-slot ADJ	Random word from `pools["ADJ"]`
Non-slot ADV	Random word from `pools["ADV"]`
Non-slot NUM	"two" / "three" / "four" hardcoded (no NUM in v5.2 lexicon)
Non-slot INTJ	Drop position
Non-slot PROPN	Impossible — skeleton inventory filtered these out at load
Function POS (DET/ADP/AUX/PART/CCONJ/SCONJ/PRON)	Existing synthesis from `_render_function_pos` (the dead-code branches inside `_is_mass_noun` get moved back to their parent function as part of this work)
PUNCT / SPACE / SYM / X	Drop

No example string consulted. No content leakage from corpus parse. PROPN cannot appear in any output.

Realize is now a deterministic function of (pos_template, slots, fillers, non-slot-pool-pick-seed). Combined with the solver's seeded RNG, output is reproducible per request.

3. Function-word phoneme check¶

Function-POS synthesis ("the", "a/an", "in", "to", "that") must respect phoneme-level constraints. Implementation: when synthesizing a function word, check the synthesized form's CMU phonemes against hard_filter_expr. If the synthesized word violates an Exclude constraint, drop the template.

This matches the corpus path's behavior (already in place from PHON-117's earlier work today — corpus_sentences.parquet stores phonemes for all tokens including function words, and match_corpus enforces Exclude across all rows).

4. Reranker retrain¶

The PHON-116 head was trained on output from the broken realizer (DET-NOUN-VERB-DET-NOUN-heavy CSP nonce). Retrain is mandatory because:

The candidate distribution changes substantively (PRON-rich templates now render, ADJ/ADV scaffolding appears, function-word variety expands).
The previously-dominant "the X verbs a Y" pattern is no longer the overwhelming majority.

Retrain process (same harness as PHON-116):

Regenerate the CSP nonce pool with the new realizer (~5K sentences).
Re-sample curated corpus (~5K) and CoLA unacceptable (~2K) — these don't change.
Re-label via the teacher LLM (see Open Question 1 below).
Retrain bge-base + MLP head via existing train_head.py.
Verify Spearman lift on held-out 15%.
Drop new head into data/runtime/naturalness_scorer_head.pt.

5. Decommissioning¶

Remove from hot path (keep on disk for now; v5.3 cleanup):

selectional.parquet reads in solver.solve().
_load_pairs_for_request's join with selectional (pairs.parquet stays for the contrastive path; selectional drops from the join).
subcat_profile / role_fillability consumers in WordStore.
filter_subcat_noise (no longer needed; nothing consumes selectional).
The example field in skeletons.parquet — no longer consulted.
_realize_template and _realize_legacy (replaced by the new unified path).

Open questions¶

OQ1: Teacher LLM for reranker retrain¶

gpt-4.1-mini's ceiling on PHON-116 was Spearman 0.69 with mode-collapse on negatives (28% of CSP labels at exactly 2.0). For a fluency/grammaticality task — different shape from PHON-73/82/83 concept ratings — a stronger model is likely needed.

Candidates:

Model	Approx cost / 12K labels	Notes
gpt-4.1 (full, not -mini)	~$10-15	10x mini cost, much sharper
Claude Sonnet 4	~$15-20	Strong English judgment; logprob API supported
GPT-4o	~$10-15	Comparable to gpt-4.1
Pilot two models, pick higher held-out Spearman	~$20-30	Pilots first, then committed

Recommendation: pilot 200 sentences each with gpt-4.1 and Claude Sonnet 4, pick the one with higher Spearman against author-rated gold (~30 sentences, ~10 minutes of author time). Commit to the winner for the full 12K relabel.

OQ2: pos_template length cap¶

Start at N=8 or N=10 tokens. Empirical pass after solver lands: generate 1000 sentences with each cap, eyeball the variety/quality trade. Commit to one number for v5.2 ship.

OQ3: Pre-gen catalog¶

If per-request latency exceeds ~10s warm (current PHON-116 latency baseline is ~7s warm), pre-gen is required for v5.2. Otherwise punt to v5.2.1 perf pass.

Storage if pre-gen lands in v5.2: - D1: out (size limits). - LFS: out (large blob). - Local SQLite shipped separately: viable, ~hundreds of MB per common constraint matrix. - R2 / Cloudflare KV: requires infrastructure work.

Decision deferred until latency benchmark.

Eval gates¶

Gate	Threshold	Notes
Realize success rate on random 500 templates	≥ 80%	vs current 2%
Pool template diversity (unique pos_templates in top-1000 sample)	≥ 50	vs current ~5
PROPN appearance in any output	0	Hard guarantee
Function-word constraint enforcement on synthetic	0 violations on test set	Matches corpus path
Reranker Spearman vs new teacher on held-out 15%	≥ 0.75 OR substantial lift over current 0.69 baseline	If 0.75 not achievable, document the actual ceiling
Per-request latency, warm, 5K candidates	< 10s	Pre-gen decision gate

Non-goals¶

Storing semantic compatibility data (selectional or otherwise). The reranker is the only semantic plausibility judge.
Determiner statistics per (verb, role, filler). Hardcoded "the / a/an by leading sound" synthesis is sufficient; corpus-derived DET stats are a v5.3+ refinement.
Multi-model ensemble teacher. Single teacher with strong agreement against author gold is the target.
Pre-gen catalog as required for v5.2 ship — only required if latency fails the 10s gate.

Implementation plan (high-level)¶

Skeleton inventory at load: PROPN-strip + length-cap + pos_template dedupe. Verify counts.
Realizer rewrite: pool-substitution at non-slot content POS positions. Move dead _render_function_pos code back into the function. Unit tests for each POS path.
Solver rewrite: drop selectional, sample uniformly from POS pools.
Function-word constraint check at synthesis time.
Regen CSP nonce pool for reranker retrain.
Teacher pilot (OQ1 resolved here).
Re-label full 12K + retrain head.
End-to-end smoke: probe constraints from earlier in PHON-117 iteration (exclude rhotic, ends-with /z/, minpair b-d) — verify variety + quality gates.
Latency benchmark → pre-gen decision.
Update CLAUDE.md "Generation Runtime Data Contract" + decommission selectional from the documented hot path.
Update PR #96 description, un-draft.

Risks¶

Reranker doesn't lift beyond 0.69 with a new teacher. Mitigation: fall back to ensemble or accept the ceiling honestly.
Brute-force candidate set exceeds memory at max_candidates=5000. Mitigation: cap-before-realize sampling at the lexicon-pool stage, not just the join stage.
Pre-gen turns out to be required and isn't done in time for v5.2 ship. Mitigation: prioritize OQ3 evaluation immediately after the solver+realizer land.
Function-word constraint check rejects so many templates the pool starves. Mitigation: empirical eval; if needed, soften to "function words pass if not contraindicated by an explicit constraint."