Skip to content

PHON-116 — Naturalness scorer (real reranking model, not bi-encoder cosine)

Status: SPEC Owner: TBD Filed: 2026-05-13 Parent workstream: PHON-56 (Generation Quality Overhaul) Branch (when implementation greenlit): feature/phon-116-naturalness-scorer off release/v5.2.0

TL;DR

Ship a real trained naturalness scorer to replace the bi-encoder Qwen-cosine path that PHON-110 substituted for PHON-107's 4-axis LightGBM. Single axis: naturalness. Teacher: gpt-4.1-mini (cloze-rating family parity with PHON-73/82/83/115). Training data: ~13K labeled sentences sourced naturally — CSP-generated nonce (~5K) + curated-corpus stratified sample (~5K) + CoLA unacceptable (~2K). Architecture: encoder + regression head; pick the encoder via bake-off.

This is v5.2 scope because PHON-110 (2026-05-10) ripped out the 4-axis LightGBM stack and shipped bi-encoder cosine under the name "reranker" without flagging the substitution. What ships in packages/generators/src/phonolex_generators/csp/reranker/predict.py today computes cosine distance to a 23K-sentence reference matrix — not a reranker by any standard sense (a reranker scores (query, candidate) pairs; this scores absolute candidate quality without a query, against a fixed reference distribution).

User direction 2026-05-13: that's shipping the wrong feature. Build the real thing for v5.2.

Why v5.2, not v5.3

project_reranker_simplification.md originally scoped this work as v5.3+ with explicit "don't rip out during release" guidance. PHON-110 ripped it anyway, against that guidance, and didn't surface the substitution. Two consequences:

  1. v5.2 currently ships scoring that's fundamentally different from what users / docs / memory describe ("reranker"). That's the wrong feature label-wise and substantively.
  2. The bi-encoder cosine path is also a weaker signal than a trained classifier. Cosine-to-corpus measures "is this sentence shaped like sentences in the reference corpus?" — which conflates surface register with grammaticality, and rewards verbose / formal phrasings disproportionately. A trained scorer learns the per-sentence naturalness judgment directly.

Both reasons justify pulling the work forward.

Goals

  • Drop-in replacement for _cached_model + cosine pathway in csp/reranker/predict.py. Same call shape (predict_axes_batch(candidates) -> [{naturalness: float, ...}]); single axis populated.
  • Eval gate: Spearman ≥ 0.75 vs gpt-4.1-mini on a held-out 15% split of the labeled data.
  • Latency budget: ≤ 100 ms per candidate, batched, on the Cloudflare Container's 1 vCPU (standard-2 tier). Today's Qwen3-Embedding-0.6B cosine is ~1.4 ms / candidate batched; that's the ceiling we should not silently regress.
  • Image size budget: model weights ≤ 200 MB on disk. Today's Qwen path is ~1.2 GB; that frees ~1 GB of image size for headroom on future container upgrades.

Non-goals

  • No multi-axis breakdown. PHON-107's 4-axis (naturalness / grammaticality / age_appropriate / coherence) is explicitly retired. Single scalar is what callers actually consume and what the UI surfaces; the others were either intrinsic to the realizer (grammaticality) or redundant in practice (coherence ≈ naturalness on single-sentence inputs; age_appropriate is poorly defined on CSP nonce and easier to enforce via constraints).
  • No cross-encoder. The task is absolute scoring (one input → one score), not (query, candidate) pair ranking. Cross-encoder pattern doesn't fit.
  • No PLL / pseudo-log-likelihood baseline. PHON-100 hit the canonical-degenerate failure mode ("the the the the the" outranks "the cat chased the ball" because "the" is over-predictable in masked context). Hard skip.
  • No manufactured surface-chaos negatives (shuffle, random substitution, truncation). CSP nonce is the natural negative generator — diverse failure modes that reflect production. Manufactured chaos teaches a useless distinction (well-formed vs gibberish) that doesn't reflect what the scorer needs to distinguish at runtime.
  • No retention of bi-encoder cosine as fallback. Single code path post-cutover.

Design

Architecture

candidate sentence (str)
    │
    ├─ encoder (frozen): sentence → vector
    │     candidates: bge-small-en-v1.5 (33M, 2023)
    │                 snowflake-arctic-embed-s (33M, 2024)
    │                 potion-retrieval-32M (Model2Vec distilled static)
    │                 one ~100M+ ceiling option (TBD by bake-off)
    │
    ├─ regression head (trained)
    │     Linear(d_embed → 128) → ReLU → Dropout(0.1) → Linear(128 → 1)
    │     Output: predicted naturalness in [1, 5]
    │
    └─ scalar score → reranker sort key

The encoder is frozen during training (only the head is learned). This keeps training cheap, image size predictable, and gives clean apples-to-apples comparison across encoder candidates.

Encoder bake-off (Phase 1)

Train one regression head per encoder candidate on the same training data + same loss + same hyperparameters. Hold out 15% for eval. Decision criteria, in priority order:

  1. Spearman vs teacher labels on held-out (must clear 0.75)
  2. Image MB (encoder weights on disk)
  3. Inference latency (ms per candidate, batched n=64, on standard-2 vCPU)
  4. Tie-break: prefer smaller model if Spearman within 0.02

Bake-off models:

Candidate Params Released Notes
BAAI/bge-small-en-v1.5 33M Sept 2023 Battle-tested baseline; user has prior experience
Snowflake/snowflake-arctic-embed-s 33M Apr 2024 Newer with stronger MTEB scores at small size
minishlab/potion-retrieval-32M 32M (static) 2024 Model2Vec distillation; no transformer at inference (~100× faster)
TBD ceiling ~100M If small models all fail Spearman gate, fall back to one ~100M option (bge-base-en-v1.5 or similar) before re-scoping

Bake-off output: a decision memo + the winning encoder + saved head weights. Total bake-off compute: ~1-2 hours on user's MPS / CPU per encoder × 4 = ~4-8 hours.

Training data composition (~13K labeled)

Source Count Why
CSP-generated nonce ~5,000 The natural negative generator. Stratified across diverse constraint sets so the scorer sees the failure modes it'll encounter at runtime: weak selectional fillers, awkward agreement, ungrammatical determiner choices, semantically odd subject-verb-object triples.
Curated-corpus sample ~5,000 Stratified across the 6 sources (CoLA / UD-EWT / GUM / CHILDES / PhonBank / Tatoeba). Real-world "bad" surfaces appear naturally: Tatoeba awkward translations, CHILDES disfluencies, PhonBank fragments, GUM register variance.
CoLA unacceptable ~2,000 Warstadt 2019 linguistic-acceptability gold labels. Anchors the low end with published-research-grade negatives.
Optional supplement ~1,000 LLM-generated "make slightly worse" rewrites. Only if held-out shows the model can't distinguish near-misses. Don't manufacture this preemptively.

No manufactured surface-chaos. The CSP solver enumerates broadly precisely because the reranker is the quality gate — diverse CSP output IS the negative distribution the scorer needs to learn.

Teacher: gpt-4.1-mini

Cloze-rating prompt scores 1-5 naturalness per sentence. Single label, not multi-label. Prompt template parity with PHON-73/82/83/115 (uses the same logprob-expected-value extraction).

Cost estimate: ~13K sentences × ~200 input tokens × ~$0.15/1M = ~$0.40 input + ~$0.40 output. Total budget for teacher labeling: ≤ $2 (5× headroom for retries / re-labeling).

Loss

Huber regression on 1-5 anchors. Robust to outliers in teacher labels (~5% expected disagreement). Optional pairwise hinge term if direct ranking signal underperforms Huber alone on the held-out Spearman gate; don't ship if not needed.

loss = HuberLoss(delta=0.5)(predicted_score, teacher_label)

Deployment

  1. Train head, save weights to data/runtime/naturalness_scorer.pt (LFS-tracked; expected ~1-2 MB for the regression head).
  2. Drop-in replace csp/reranker/predict.py:
  3. _cached_model() → loads encoder + head weights
  4. predict_axes_batch(candidates)encoder.encode(batch).matmul(head_weights) returning [{naturalness: float}]
  5. Pre-bake corpus scores: keep for v5.2. Scorer runs at build time on corpus_sentences_index.parquet; precomputed naturalness_score column already exists from the bi-encoder path, just regenerated by the new scorer. v5.3 can simplify to runtime-only once latency is empirically verified.
  6. Retire naturalness_reference.npy + naturalness_reference_meta.jsonl from LFS once the new scorer ships. Keep on disk briefly (1 release cycle) for recovery.

Eval gates

Gate Threshold Why
Spearman vs teacher on held-out 15% ≥ 0.75 PHON-107 v2 hit 0.805 on naturalness with LightGBM + 4 axes; 0.75 is the floor for a single-axis encoder + regression head.
Latency, batched n=64, standard-2 vCPU ≤ 100 ms / candidate 100× headroom over today's 1.4 ms (Qwen cosine). Anything slower risks user-perceived regression.
Image MB ≤ 200 MB encoder weights ~6× smaller than today's 1.2 GB Qwen. Frees container headroom.
Behavioral smoke (qualitative) 20 hand-curated good/bad pairs all score correctly The "the the the the" / "natural English" sanity check. Catches PLL-style degenerate-text failures.

Risks

  1. Encoder bake-off doesn't clear 0.75 with small models. Mitigation: fall back to ~100M ceiling encoder. If even that doesn't clear, the gate itself may be wrong — re-examine teacher label noise vs model capacity.
  2. Teacher disagreement on CSP nonce. gpt-4.1-mini may rate CSP nonce inconsistently because the failure modes don't match its training distribution of natural-English-quality assessments. Mitigation: spot-check 100 random teacher labels manually before training; if disagreement > 20%, refine prompt.
  3. Distribution shift between training-time CSP nonce and post-PHON-110-iteration nonce. The CSP solver may emit different surface distributions in 6 months. Mitigation: schedule periodic retraining (v5.3 ticket) when CSP solver or selectional.parquet change substantially.
  4. Implementation timeline overruns v5.2 release pressure. Mitigation: spec includes hard scope boundaries (single axis, no cross-encoder, no manufactured negatives). If bake-off + training + eval doesn't fit, hold the bi-encoder cosine for the v5.2 release and file PHON-116 for v5.3 — DO NOT ship a half-trained scorer.

Implementation plan (sketch — to be detailed in plan doc once spec is signed off)

  1. Generate training data (~3 hours)
  2. 5K CSP nonce via existing phonolex_generators.csp.solver with stratified constraint sets
  3. 5K curated-corpus stratified sample from corpus_sentences_index.parquet
  4. 2K CoLA unacceptable subset (already in repo)
  5. Label all via gpt-4.1-mini batch API (parity with PHON-115 harness)
  6. Encoder bake-off (~4-8 hours)
  7. Implement train_head.py (PyTorch, single-script, scriptable across encoder candidates)
  8. Run for each candidate; collect metrics
  9. Write decision memo
  10. Train winner + eval (~1 hour)
  11. Deployment swap (~2 hours)
  12. Replace predict.py internals
  13. Regenerate corpus index naturalness scores
  14. Update CLAUDE.md "Naturalness reranker" section + Data Contract
  15. Smoke + PR (~1 hour)

Total estimate: ~10-15 hours.

Linked precedents

  • FineWeb-Edu quality classifier (HuggingFace 2024): the cleanest precedent. Large LLM scores documents, small encoder distills into a runtime classifier. Direct sentence-level analog.
  • CoLA (Warstadt 2019): sentence acceptability dataset, public, used as the linguistic anchor.
  • COMET (Rei 2020): MT quality estimation, regression-head-over-encoder at sentence-pair level (we adapt to single-sentence absolute scoring).
  • PHON-107: 4-axis LightGBM precedent. Same teacher family; same loss family; different architecture (LGB → encoder+head) and scope (4 axes → 1).

Open items pending implementation greenlight

  • Spec sign-off from user
  • Final encoder ceiling-option pick (bge-base-en-v1.5 vs other; decide during bake-off if small encoders fail the gate)
  • Ticket: PHON-116 to be filed in Jira after spec sign-off (parent: PHON-56)

No code changes will land until user explicitly greenlights implementation per feedback_scope_per_authorization.