PHON-116 — Naturalness scorer (real reranking model, not bi-encoder cosine)¶

Status: SPEC Owner: TBD Filed: 2026-05-13 Parent workstream: PHON-56 (Generation Quality Overhaul) Branch (when implementation greenlit): feature/phon-116-naturalness-scorer off release/v5.2.0

TL;DR¶

Ship a real trained naturalness scorer to replace the bi-encoder Qwen-cosine path that PHON-110 substituted for PHON-107's 4-axis LightGBM. Single axis: naturalness. Teacher: gpt-4.1-mini (cloze-rating family parity with PHON-73/82/83/115). Training data: ~13K labeled sentences sourced naturally — CSP-generated nonce (~5K) + curated-corpus stratified sample (~5K) + CoLA unacceptable (~2K). Architecture: encoder + regression head; pick the encoder via bake-off.

This is v5.2 scope because PHON-110 (2026-05-10) ripped out the 4-axis LightGBM stack and shipped bi-encoder cosine under the name "reranker" without flagging the substitution. What ships in packages/generators/src/phonolex_generators/csp/reranker/predict.py today computes cosine distance to a 23K-sentence reference matrix — not a reranker by any standard sense (a reranker scores (query, candidate) pairs; this scores absolute candidate quality without a query, against a fixed reference distribution).

User direction 2026-05-13: that's shipping the wrong feature. Build the real thing for v5.2.

Why v5.2, not v5.3¶

project_reranker_simplification.md originally scoped this work as v5.3+ with explicit "don't rip out during release" guidance. PHON-110 ripped it anyway, against that guidance, and didn't surface the substitution. Two consequences:

v5.2 currently ships scoring that's fundamentally different from what users / docs / memory describe ("reranker"). That's the wrong feature label-wise and substantively.
The bi-encoder cosine path is also a weaker signal than a trained classifier. Cosine-to-corpus measures "is this sentence shaped like sentences in the reference corpus?" — which conflates surface register with grammaticality, and rewards verbose / formal phrasings disproportionately. A trained scorer learns the per-sentence naturalness judgment directly.

Both reasons justify pulling the work forward.

Goals¶

Drop-in replacement for _cached_model + cosine pathway in csp/reranker/predict.py. Same call shape (predict_axes_batch(candidates) -> [{naturalness: float, ...}]); single axis populated.
Eval gate: Spearman ≥ 0.75 vs gpt-4.1-mini on a held-out 15% split of the labeled data.
Latency budget: ≤ 100 ms per candidate, batched, on the Cloudflare Container's 1 vCPU (standard-2 tier). Today's Qwen3-Embedding-0.6B cosine is ~1.4 ms / candidate batched; that's the ceiling we should not silently regress.
Image size budget: model weights ≤ 200 MB on disk. Today's Qwen path is ~1.2 GB; that frees ~1 GB of image size for headroom on future container upgrades.

Non-goals¶

No multi-axis breakdown. PHON-107's 4-axis (naturalness / grammaticality / age_appropriate / coherence) is explicitly retired. Single scalar is what callers actually consume and what the UI surfaces; the others were either intrinsic to the realizer (grammaticality) or redundant in practice (coherence ≈ naturalness on single-sentence inputs; age_appropriate is poorly defined on CSP nonce and easier to enforce via constraints).
No cross-encoder. The task is absolute scoring (one input → one score), not (query, candidate) pair ranking. Cross-encoder pattern doesn't fit.
No PLL / pseudo-log-likelihood baseline. PHON-100 hit the canonical-degenerate failure mode ("the the the the the" outranks "the cat chased the ball" because "the" is over-predictable in masked context). Hard skip.
No manufactured surface-chaos negatives (shuffle, random substitution, truncation). CSP nonce is the natural negative generator — diverse failure modes that reflect production. Manufactured chaos teaches a useless distinction (well-formed vs gibberish) that doesn't reflect what the scorer needs to distinguish at runtime.
No retention of bi-encoder cosine as fallback. Single code path post-cutover.

Design¶

Architecture¶

candidate sentence (str)
    │
    ├─ encoder (frozen): sentence → vector
    │     candidates: bge-small-en-v1.5 (33M, 2023)
    │                 snowflake-arctic-embed-s (33M, 2024)
    │                 potion-retrieval-32M (Model2Vec distilled static)
    │                 one ~100M+ ceiling option (TBD by bake-off)
    │
    ├─ regression head (trained)
    │     Linear(d_embed → 128) → ReLU → Dropout(0.1) → Linear(128 → 1)
    │     Output: predicted naturalness in [1, 5]
    │
    └─ scalar score → reranker sort key

The encoder is frozen during training (only the head is learned). This keeps training cheap, image size predictable, and gives clean apples-to-apples comparison across encoder candidates.

Encoder bake-off (Phase 1)¶

Train one regression head per encoder candidate on the same training data + same loss + same hyperparameters. Hold out 15% for eval. Decision criteria, in priority order:

Spearman vs teacher labels on held-out (must clear 0.75)
Image MB (encoder weights on disk)
Inference latency (ms per candidate, batched n=64, on standard-2 vCPU)
Tie-break: prefer smaller model if Spearman within 0.02

Bake-off models:

Candidate	Params	Released	Notes
`BAAI/bge-small-en-v1.5`	33M	Sept 2023	Battle-tested baseline; user has prior experience
`Snowflake/snowflake-arctic-embed-s`	33M	Apr 2024	Newer with stronger MTEB scores at small size
`minishlab/potion-retrieval-32M`	32M (static)	2024	Model2Vec distillation; no transformer at inference (~100× faster)
TBD ceiling	~100M	—	If small models all fail Spearman gate, fall back to one ~100M option (`bge-base-en-v1.5` or similar) before re-scoping

Bake-off output: a decision memo + the winning encoder + saved head weights. Total bake-off compute: ~1-2 hours on user's MPS / CPU per encoder × 4 = ~4-8 hours.

Training data composition (~13K labeled)¶

Source	Count	Why
CSP-generated nonce	~5,000	The natural negative generator. Stratified across diverse constraint sets so the scorer sees the failure modes it'll encounter at runtime: weak selectional fillers, awkward agreement, ungrammatical determiner choices, semantically odd subject-verb-object triples.
Curated-corpus sample	~5,000	Stratified across the 6 sources (CoLA / UD-EWT / GUM / CHILDES / PhonBank / Tatoeba). Real-world "bad" surfaces appear naturally: Tatoeba awkward translations, CHILDES disfluencies, PhonBank fragments, GUM register variance.
CoLA unacceptable	~2,000	Warstadt 2019 linguistic-acceptability gold labels. Anchors the low end with published-research-grade negatives.
Optional supplement	~1,000	LLM-generated "make slightly worse" rewrites. Only if held-out shows the model can't distinguish near-misses. Don't manufacture this preemptively.

No manufactured surface-chaos. The CSP solver enumerates broadly precisely because the reranker is the quality gate — diverse CSP output IS the negative distribution the scorer needs to learn.

Teacher: gpt-4.1-mini¶

Cloze-rating prompt scores 1-5 naturalness per sentence. Single label, not multi-label. Prompt template parity with PHON-73/82/83/115 (uses the same logprob-expected-value extraction).

Cost estimate: ~13K sentences × ~200 input tokens × ~$0.15/1M = ~$0.40 input + ~$0.40 output. Total budget for teacher labeling: ≤ $2 (5× headroom for retries / re-labeling).

Loss¶

Huber regression on 1-5 anchors. Robust to outliers in teacher labels (~5% expected disagreement). Optional pairwise hinge term if direct ranking signal underperforms Huber alone on the held-out Spearman gate; don't ship if not needed.

loss = HuberLoss(delta=0.5)(predicted_score, teacher_label)

Deployment¶

Train head, save weights to data/runtime/naturalness_scorer.pt (LFS-tracked; expected ~1-2 MB for the regression head).
Drop-in replace csp/reranker/predict.py:
_cached_model() → loads encoder + head weights
predict_axes_batch(candidates) → encoder.encode(batch).matmul(head_weights) returning [{naturalness: float}]
Pre-bake corpus scores: keep for v5.2. Scorer runs at build time on corpus_sentences_index.parquet; precomputed naturalness_score column already exists from the bi-encoder path, just regenerated by the new scorer. v5.3 can simplify to runtime-only once latency is empirically verified.
Retire naturalness_reference.npy + naturalness_reference_meta.jsonl from LFS once the new scorer ships. Keep on disk briefly (1 release cycle) for recovery.

Eval gates¶

Gate	Threshold	Why
Spearman vs teacher on held-out 15%	≥ 0.75	PHON-107 v2 hit 0.805 on naturalness with LightGBM + 4 axes; 0.75 is the floor for a single-axis encoder + regression head.
Latency, batched n=64, standard-2 vCPU	≤ 100 ms / candidate	100× headroom over today's 1.4 ms (Qwen cosine). Anything slower risks user-perceived regression.
Image MB	≤ 200 MB encoder weights	~6× smaller than today's 1.2 GB Qwen. Frees container headroom.
Behavioral smoke (qualitative)	20 hand-curated good/bad pairs all score correctly	The "the the the the" / "natural English" sanity check. Catches PLL-style degenerate-text failures.

Risks¶

Encoder bake-off doesn't clear 0.75 with small models. Mitigation: fall back to ~100M ceiling encoder. If even that doesn't clear, the gate itself may be wrong — re-examine teacher label noise vs model capacity.
Teacher disagreement on CSP nonce. gpt-4.1-mini may rate CSP nonce inconsistently because the failure modes don't match its training distribution of natural-English-quality assessments. Mitigation: spot-check 100 random teacher labels manually before training; if disagreement > 20%, refine prompt.
Distribution shift between training-time CSP nonce and post-PHON-110-iteration nonce. The CSP solver may emit different surface distributions in 6 months. Mitigation: schedule periodic retraining (v5.3 ticket) when CSP solver or selectional.parquet change substantially.
Implementation timeline overruns v5.2 release pressure. Mitigation: spec includes hard scope boundaries (single axis, no cross-encoder, no manufactured negatives). If bake-off + training + eval doesn't fit, hold the bi-encoder cosine for the v5.2 release and file PHON-116 for v5.3 — DO NOT ship a half-trained scorer.

Implementation plan (sketch — to be detailed in plan doc once spec is signed off)¶

Generate training data (~3 hours)
5K CSP nonce via existing phonolex_generators.csp.solver with stratified constraint sets
5K curated-corpus stratified sample from corpus_sentences_index.parquet
2K CoLA unacceptable subset (already in repo)
Label all via gpt-4.1-mini batch API (parity with PHON-115 harness)
Encoder bake-off (~4-8 hours)
Implement train_head.py (PyTorch, single-script, scriptable across encoder candidates)
Run for each candidate; collect metrics
Write decision memo
Train winner + eval (~1 hour)
Deployment swap (~2 hours)
Replace predict.py internals
Regenerate corpus index naturalness scores
Update CLAUDE.md "Naturalness reranker" section + Data Contract
Smoke + PR (~1 hour)

Total estimate: ~10-15 hours.

Linked precedents¶

FineWeb-Edu quality classifier (HuggingFace 2024): the cleanest precedent. Large LLM scores documents, small encoder distills into a runtime classifier. Direct sentence-level analog.
CoLA (Warstadt 2019): sentence acceptability dataset, public, used as the linguistic anchor.
COMET (Rei 2020): MT quality estimation, regression-head-over-encoder at sentence-pair level (we adapt to single-sentence absolute scoring).
PHON-107: 4-axis LightGBM precedent. Same teacher family; same loss family; different architecture (LGB → encoder+head) and scope (4 axes → 1).

Open items pending implementation greenlight¶

Spec sign-off from user
Final encoder ceiling-option pick (bge-base-en-v1.5 vs other; decide during bake-off if small encoders fail the gate)
Ticket: PHON-116 to be filed in Jira after spec sign-off (parent: PHON-56)

No code changes will land until user explicitly greenlights implementation per feedback_scope_per_authorization.