Skip to content

PHON-117 — v5.2 realizer + solver rewrite

Status: SPEC (awaiting sign-off) Owner: TBD Filed: 2026-05-13 Branch: feature/phon-116-naturalness-scorer (work continues on the existing PHON-116 branch since the reranker retrain is part of this scope) Blocks: PR #96 (release/v5.2.0 → develop) — v5.2 cannot ship until this lands.

Why this exists

v5.2 Governed Generation is producing user-visible nonsense in the governed-generation UI:

  • 98% of corpus-attested pos_templates fail to render. The realizer's _LEAKY_CONTENT_POS = {NOUN, PROPN, ADJ, NUM, INTJ} guard rejects any template with a content POS at a non-slot position. Random sample of 500 fineweb_adult nsubj,V,dobj templates: 490 fail, 10 succeed. Surviving templates skew DET-NOUN-VERB-DET-NOUN; SLP-visible output reads monotonous regardless of band or constraint.
  • Selectional preferences (PPMI from FineWeb-Edu DEP parse) drive the candidate set. When the parse misattributes a verb-preposition pair ("build IN [location]" → pobj_in for "build"), the solver enumerates "the peasant builds in a brick" with high PPMI. The selectional layer is an intermediate heuristic that masquerades as a quality signal.
  • The PHON-116 naturalness scorer was trained on output from the broken realizer. Its notion of "natural CSP" is calibrated to the DET-rich DET-NOUN-VERB-DET-NOUN distribution. Retrain on the new distribution is necessary.
  • PROPN leaks through skeleton templates ("The Eiffel verbs the cake"). PROPN was supposed to be purged in v5.2 but still appears in pos_templates as scaffolding tokens.
  • Function words slip past phoneme constraints. Resolved in PHON-117's earlier work today: corpus_sentences.parquet now stores CMU phonemes for all tokens (content + function + clitics). This rewrite preserves that.

Design (locked, modulo open Q)

1. Solver — drop selectional from the hot path

The solver no longer joins selectional.parquet. Candidate enumeration:

  • Skeleton inventory (loaded once at server cold-start):
  • Read skeletons.parquet.
  • Filter out any pos_template containing PROPN at any position.
  • Filter to canonical arg_structure (existing CANONICAL_ARG_STRUCTURES set).
  • Cap pos_template length to ≤ N tokens (tunable; start N=8 or 10). Long templates produce wordy output with proportionally more chances for the realizer or the reranker to misfire. The cap is a quality lever.
  • Dedupe by pos_template (within an arg_structure × band), keep one representative per unique pos_template.
  • Per-request constraint-filtered lexicon pools. Filter words.parquet once by hard_filter_expr. Bin survivors by POS into {NOUN: [...], VERB: [...], ADJ: [...], ADV: [...]}. The non-slot content positions in pos_template pull from the matching POS bin.
  • Combinatoric sampling. For each (skeleton, ...) combination, draw uniformly at random from its POS pools. Cap aggregate candidate set at max_candidates (default 5000). Surface-dedup as the last step before reranking.

No verb-filler co-occurrence prior. No PPMI sort. Every constraint-passing word is admissible in every POS-matching position. The reranker is the only quality gate.

2. Realizer — pos_template walker, full coverage

Replace _LEAKY_CONTENT_POS guard with substitution:

Position type Resolution
Slot match (POS ∈ slot's admissible set, slot remaining in queue) Solver filler (existing logic)
Non-slot NOUN Random word from pools["NOUN"]
Non-slot ADJ Random word from pools["ADJ"]
Non-slot ADV Random word from pools["ADV"]
Non-slot NUM "two" / "three" / "four" hardcoded (no NUM in v5.2 lexicon)
Non-slot INTJ Drop position
Non-slot PROPN Impossible — skeleton inventory filtered these out at load
Function POS (DET/ADP/AUX/PART/CCONJ/SCONJ/PRON) Existing synthesis from _render_function_pos (the dead-code branches inside _is_mass_noun get moved back to their parent function as part of this work)
PUNCT / SPACE / SYM / X Drop

No example string consulted. No content leakage from corpus parse. PROPN cannot appear in any output.

Realize is now a deterministic function of (pos_template, slots, fillers, non-slot-pool-pick-seed). Combined with the solver's seeded RNG, output is reproducible per request.

3. Function-word phoneme check

Function-POS synthesis ("the", "a/an", "in", "to", "that") must respect phoneme-level constraints. Implementation: when synthesizing a function word, check the synthesized form's CMU phonemes against hard_filter_expr. If the synthesized word violates an Exclude constraint, drop the template.

This matches the corpus path's behavior (already in place from PHON-117's earlier work today — corpus_sentences.parquet stores phonemes for all tokens including function words, and match_corpus enforces Exclude across all rows).

4. Reranker retrain

The PHON-116 head was trained on output from the broken realizer (DET-NOUN-VERB-DET-NOUN-heavy CSP nonce). Retrain is mandatory because:

  • The candidate distribution changes substantively (PRON-rich templates now render, ADJ/ADV scaffolding appears, function-word variety expands).
  • The previously-dominant "the X verbs a Y" pattern is no longer the overwhelming majority.

Retrain process (same harness as PHON-116):

  1. Regenerate the CSP nonce pool with the new realizer (~5K sentences).
  2. Re-sample curated corpus (~5K) and CoLA unacceptable (~2K) — these don't change.
  3. Re-label via the teacher LLM (see Open Question 1 below).
  4. Retrain bge-base + MLP head via existing train_head.py.
  5. Verify Spearman lift on held-out 15%.
  6. Drop new head into data/runtime/naturalness_scorer_head.pt.

5. Decommissioning

Remove from hot path (keep on disk for now; v5.3 cleanup):

  • selectional.parquet reads in solver.solve().
  • _load_pairs_for_request's join with selectional (pairs.parquet stays for the contrastive path; selectional drops from the join).
  • subcat_profile / role_fillability consumers in WordStore.
  • filter_subcat_noise (no longer needed; nothing consumes selectional).
  • The example field in skeletons.parquet — no longer consulted.
  • _realize_template and _realize_legacy (replaced by the new unified path).

Open questions

OQ1: Teacher LLM for reranker retrain

gpt-4.1-mini's ceiling on PHON-116 was Spearman 0.69 with mode-collapse on negatives (28% of CSP labels at exactly 2.0). For a fluency/grammaticality task — different shape from PHON-73/82/83 concept ratings — a stronger model is likely needed.

Candidates:

Model Approx cost / 12K labels Notes
gpt-4.1 (full, not -mini) ~$10-15 10x mini cost, much sharper
Claude Sonnet 4 ~$15-20 Strong English judgment; logprob API supported
GPT-4o ~$10-15 Comparable to gpt-4.1
Pilot two models, pick higher held-out Spearman ~$20-30 Pilots first, then committed

Recommendation: pilot 200 sentences each with gpt-4.1 and Claude Sonnet 4, pick the one with higher Spearman against author-rated gold (~30 sentences, ~10 minutes of author time). Commit to the winner for the full 12K relabel.

OQ2: pos_template length cap

Start at N=8 or N=10 tokens. Empirical pass after solver lands: generate 1000 sentences with each cap, eyeball the variety/quality trade. Commit to one number for v5.2 ship.

OQ3: Pre-gen catalog

If per-request latency exceeds ~10s warm (current PHON-116 latency baseline is ~7s warm), pre-gen is required for v5.2. Otherwise punt to v5.2.1 perf pass.

Storage if pre-gen lands in v5.2: - D1: out (size limits). - LFS: out (large blob). - Local SQLite shipped separately: viable, ~hundreds of MB per common constraint matrix. - R2 / Cloudflare KV: requires infrastructure work.

Decision deferred until latency benchmark.

Eval gates

Gate Threshold Notes
Realize success rate on random 500 templates ≥ 80% vs current 2%
Pool template diversity (unique pos_templates in top-1000 sample) ≥ 50 vs current ~5
PROPN appearance in any output 0 Hard guarantee
Function-word constraint enforcement on synthetic 0 violations on test set Matches corpus path
Reranker Spearman vs new teacher on held-out 15% ≥ 0.75 OR substantial lift over current 0.69 baseline If 0.75 not achievable, document the actual ceiling
Per-request latency, warm, 5K candidates < 10s Pre-gen decision gate

Non-goals

  • Storing semantic compatibility data (selectional or otherwise). The reranker is the only semantic plausibility judge.
  • Determiner statistics per (verb, role, filler). Hardcoded "the / a/an by leading sound" synthesis is sufficient; corpus-derived DET stats are a v5.3+ refinement.
  • Multi-model ensemble teacher. Single teacher with strong agreement against author gold is the target.
  • Pre-gen catalog as required for v5.2 ship — only required if latency fails the 10s gate.

Implementation plan (high-level)

  1. Skeleton inventory at load: PROPN-strip + length-cap + pos_template dedupe. Verify counts.
  2. Realizer rewrite: pool-substitution at non-slot content POS positions. Move dead _render_function_pos code back into the function. Unit tests for each POS path.
  3. Solver rewrite: drop selectional, sample uniformly from POS pools.
  4. Function-word constraint check at synthesis time.
  5. Regen CSP nonce pool for reranker retrain.
  6. Teacher pilot (OQ1 resolved here).
  7. Re-label full 12K + retrain head.
  8. End-to-end smoke: probe constraints from earlier in PHON-117 iteration (exclude rhotic, ends-with /z/, minpair b-d) — verify variety + quality gates.
  9. Latency benchmark → pre-gen decision.
  10. Update CLAUDE.md "Generation Runtime Data Contract" + decommission selectional from the documented hot path.
  11. Update PR #96 description, un-draft.

Risks

  • Reranker doesn't lift beyond 0.69 with a new teacher. Mitigation: fall back to ensemble or accept the ceiling honestly.
  • Brute-force candidate set exceeds memory at max_candidates=5000. Mitigation: cap-before-realize sampling at the lexicon-pool stage, not just the join stage.
  • Pre-gen turns out to be required and isn't done in time for v5.2 ship. Mitigation: prioritize OQ3 evaluation immediately after the solver+realizer land.
  • Function-word constraint check rejects so many templates the pool starves. Mitigation: empirical eval; if needed, soften to "function words pass if not contraindicated by an explicit constraint."