PHON-117 — v5.2 realizer + solver rewrite¶
Status: SPEC (awaiting sign-off)
Owner: TBD
Filed: 2026-05-13
Branch: feature/phon-116-naturalness-scorer (work continues on the
existing PHON-116 branch since the reranker retrain is part of this scope)
Blocks: PR #96 (release/v5.2.0 → develop) — v5.2 cannot ship until
this lands.
Why this exists¶
v5.2 Governed Generation is producing user-visible nonsense in the
governed-generation UI:
- 98% of corpus-attested pos_templates fail to render. The realizer's
_LEAKY_CONTENT_POS = {NOUN, PROPN, ADJ, NUM, INTJ}guard rejects any template with a content POS at a non-slot position. Random sample of 500 fineweb_adult nsubj,V,dobj templates: 490 fail, 10 succeed. Surviving templates skew DET-NOUN-VERB-DET-NOUN; SLP-visible output reads monotonous regardless of band or constraint. - Selectional preferences (PPMI from FineWeb-Edu DEP parse) drive the candidate set. When the parse misattributes a verb-preposition pair ("build IN [location]" → pobj_in for "build"), the solver enumerates "the peasant builds in a brick" with high PPMI. The selectional layer is an intermediate heuristic that masquerades as a quality signal.
- The PHON-116 naturalness scorer was trained on output from the broken realizer. Its notion of "natural CSP" is calibrated to the DET-rich DET-NOUN-VERB-DET-NOUN distribution. Retrain on the new distribution is necessary.
- PROPN leaks through skeleton templates ("The Eiffel verbs the cake"). PROPN was supposed to be purged in v5.2 but still appears in pos_templates as scaffolding tokens.
- Function words slip past phoneme constraints. Resolved in PHON-117's earlier work today: corpus_sentences.parquet now stores CMU phonemes for all tokens (content + function + clitics). This rewrite preserves that.
Design (locked, modulo open Q)¶
1. Solver — drop selectional from the hot path¶
The solver no longer joins selectional.parquet. Candidate enumeration:
- Skeleton inventory (loaded once at server cold-start):
- Read
skeletons.parquet. - Filter out any pos_template containing
PROPNat any position. - Filter to canonical
arg_structure(existingCANONICAL_ARG_STRUCTURESset). - Cap pos_template length to ≤ N tokens (tunable; start N=8 or 10). Long templates produce wordy output with proportionally more chances for the realizer or the reranker to misfire. The cap is a quality lever.
- Dedupe by pos_template (within an arg_structure × band), keep one representative per unique pos_template.
- Per-request constraint-filtered lexicon pools. Filter
words.parquetonce byhard_filter_expr. Bin survivors by POS into{NOUN: [...], VERB: [...], ADJ: [...], ADV: [...]}. The non-slot content positions in pos_template pull from the matching POS bin. - Combinatoric sampling. For each (skeleton, ...) combination, draw
uniformly at random from its POS pools. Cap aggregate candidate set at
max_candidates(default 5000). Surface-dedup as the last step before reranking.
No verb-filler co-occurrence prior. No PPMI sort. Every constraint-passing word is admissible in every POS-matching position. The reranker is the only quality gate.
2. Realizer — pos_template walker, full coverage¶
Replace _LEAKY_CONTENT_POS guard with substitution:
| Position type | Resolution |
|---|---|
| Slot match (POS ∈ slot's admissible set, slot remaining in queue) | Solver filler (existing logic) |
| Non-slot NOUN | Random word from pools["NOUN"] |
| Non-slot ADJ | Random word from pools["ADJ"] |
| Non-slot ADV | Random word from pools["ADV"] |
| Non-slot NUM | "two" / "three" / "four" hardcoded (no NUM in v5.2 lexicon) |
| Non-slot INTJ | Drop position |
| Non-slot PROPN | Impossible — skeleton inventory filtered these out at load |
| Function POS (DET/ADP/AUX/PART/CCONJ/SCONJ/PRON) | Existing synthesis from _render_function_pos (the dead-code branches inside _is_mass_noun get moved back to their parent function as part of this work) |
| PUNCT / SPACE / SYM / X | Drop |
No example string consulted. No content leakage from corpus parse. PROPN cannot appear in any output.
Realize is now a deterministic function of (pos_template, slots, fillers, non-slot-pool-pick-seed). Combined with the solver's seeded RNG, output is reproducible per request.
3. Function-word phoneme check¶
Function-POS synthesis ("the", "a/an", "in", "to", "that") must respect
phoneme-level constraints. Implementation: when synthesizing a function
word, check the synthesized form's CMU phonemes against hard_filter_expr.
If the synthesized word violates an Exclude constraint, drop the template.
This matches the corpus path's behavior (already in place from PHON-117's
earlier work today — corpus_sentences.parquet stores phonemes for all
tokens including function words, and match_corpus enforces Exclude
across all rows).
4. Reranker retrain¶
The PHON-116 head was trained on output from the broken realizer (DET-NOUN-VERB-DET-NOUN-heavy CSP nonce). Retrain is mandatory because:
- The candidate distribution changes substantively (PRON-rich templates now render, ADJ/ADV scaffolding appears, function-word variety expands).
- The previously-dominant "the X verbs a Y" pattern is no longer the overwhelming majority.
Retrain process (same harness as PHON-116):
- Regenerate the CSP nonce pool with the new realizer (~5K sentences).
- Re-sample curated corpus (~5K) and CoLA unacceptable (~2K) — these don't change.
- Re-label via the teacher LLM (see Open Question 1 below).
- Retrain bge-base + MLP head via existing
train_head.py. - Verify Spearman lift on held-out 15%.
- Drop new head into
data/runtime/naturalness_scorer_head.pt.
5. Decommissioning¶
Remove from hot path (keep on disk for now; v5.3 cleanup):
selectional.parquetreads insolver.solve()._load_pairs_for_request's join with selectional (pairs.parquet stays for the contrastive path; selectional drops from the join).subcat_profile/role_fillabilityconsumers in WordStore.filter_subcat_noise(no longer needed; nothing consumes selectional).- The
examplefield in skeletons.parquet — no longer consulted. _realize_templateand_realize_legacy(replaced by the new unified path).
Open questions¶
OQ1: Teacher LLM for reranker retrain¶
gpt-4.1-mini's ceiling on PHON-116 was Spearman 0.69 with mode-collapse on negatives (28% of CSP labels at exactly 2.0). For a fluency/grammaticality task — different shape from PHON-73/82/83 concept ratings — a stronger model is likely needed.
Candidates:
| Model | Approx cost / 12K labels | Notes |
|---|---|---|
| gpt-4.1 (full, not -mini) | ~$10-15 | 10x mini cost, much sharper |
| Claude Sonnet 4 | ~$15-20 | Strong English judgment; logprob API supported |
| GPT-4o | ~$10-15 | Comparable to gpt-4.1 |
| Pilot two models, pick higher held-out Spearman | ~$20-30 | Pilots first, then committed |
Recommendation: pilot 200 sentences each with gpt-4.1 and Claude Sonnet 4, pick the one with higher Spearman against author-rated gold (~30 sentences, ~10 minutes of author time). Commit to the winner for the full 12K relabel.
OQ2: pos_template length cap¶
Start at N=8 or N=10 tokens. Empirical pass after solver lands: generate 1000 sentences with each cap, eyeball the variety/quality trade. Commit to one number for v5.2 ship.
OQ3: Pre-gen catalog¶
If per-request latency exceeds ~10s warm (current PHON-116 latency baseline is ~7s warm), pre-gen is required for v5.2. Otherwise punt to v5.2.1 perf pass.
Storage if pre-gen lands in v5.2: - D1: out (size limits). - LFS: out (large blob). - Local SQLite shipped separately: viable, ~hundreds of MB per common constraint matrix. - R2 / Cloudflare KV: requires infrastructure work.
Decision deferred until latency benchmark.
Eval gates¶
| Gate | Threshold | Notes |
|---|---|---|
| Realize success rate on random 500 templates | ≥ 80% | vs current 2% |
| Pool template diversity (unique pos_templates in top-1000 sample) | ≥ 50 | vs current ~5 |
| PROPN appearance in any output | 0 | Hard guarantee |
| Function-word constraint enforcement on synthetic | 0 violations on test set | Matches corpus path |
| Reranker Spearman vs new teacher on held-out 15% | ≥ 0.75 OR substantial lift over current 0.69 baseline | If 0.75 not achievable, document the actual ceiling |
| Per-request latency, warm, 5K candidates | < 10s | Pre-gen decision gate |
Non-goals¶
- Storing semantic compatibility data (selectional or otherwise). The reranker is the only semantic plausibility judge.
- Determiner statistics per (verb, role, filler). Hardcoded "the / a/an by leading sound" synthesis is sufficient; corpus-derived DET stats are a v5.3+ refinement.
- Multi-model ensemble teacher. Single teacher with strong agreement against author gold is the target.
- Pre-gen catalog as required for v5.2 ship — only required if latency fails the 10s gate.
Implementation plan (high-level)¶
- Skeleton inventory at load: PROPN-strip + length-cap + pos_template dedupe. Verify counts.
- Realizer rewrite: pool-substitution at non-slot content POS positions.
Move dead
_render_function_poscode back into the function. Unit tests for each POS path. - Solver rewrite: drop selectional, sample uniformly from POS pools.
- Function-word constraint check at synthesis time.
- Regen CSP nonce pool for reranker retrain.
- Teacher pilot (OQ1 resolved here).
- Re-label full 12K + retrain head.
- End-to-end smoke: probe constraints from earlier in PHON-117 iteration (exclude rhotic, ends-with /z/, minpair b-d) — verify variety + quality gates.
- Latency benchmark → pre-gen decision.
- Update CLAUDE.md "Generation Runtime Data Contract" + decommission selectional from the documented hot path.
- Update PR #96 description, un-draft.
Risks¶
- Reranker doesn't lift beyond 0.69 with a new teacher. Mitigation: fall back to ensemble or accept the ceiling honestly.
- Brute-force candidate set exceeds memory at max_candidates=5000. Mitigation: cap-before-realize sampling at the lexicon-pool stage, not just the join stage.
- Pre-gen turns out to be required and isn't done in time for v5.2 ship. Mitigation: prioritize OQ3 evaluation immediately after the solver+realizer land.
- Function-word constraint check rejects so many templates the pool starves. Mitigation: empirical eval; if needed, soften to "function words pass if not contraindicated by an explicit constraint."