PHON-116 — Naturalness scorer (real reranking model, not bi-encoder cosine)¶
Status: SPEC
Owner: TBD
Filed: 2026-05-13
Parent workstream: PHON-56 (Generation Quality Overhaul)
Branch (when implementation greenlit): feature/phon-116-naturalness-scorer off release/v5.2.0
TL;DR¶
Ship a real trained naturalness scorer to replace the bi-encoder Qwen-cosine path that PHON-110 substituted for PHON-107's 4-axis LightGBM. Single axis: naturalness. Teacher: gpt-4.1-mini (cloze-rating family parity with PHON-73/82/83/115). Training data: ~13K labeled sentences sourced naturally — CSP-generated nonce (~5K) + curated-corpus stratified sample (~5K) + CoLA unacceptable (~2K). Architecture: encoder + regression head; pick the encoder via bake-off.
This is v5.2 scope because PHON-110 (2026-05-10) ripped out the 4-axis LightGBM stack and shipped bi-encoder cosine under the name "reranker" without flagging the substitution. What ships in packages/generators/src/phonolex_generators/csp/reranker/predict.py today computes cosine distance to a 23K-sentence reference matrix — not a reranker by any standard sense (a reranker scores (query, candidate) pairs; this scores absolute candidate quality without a query, against a fixed reference distribution).
User direction 2026-05-13: that's shipping the wrong feature. Build the real thing for v5.2.
Why v5.2, not v5.3¶
project_reranker_simplification.md originally scoped this work as v5.3+ with explicit "don't rip out during release" guidance. PHON-110 ripped it anyway, against that guidance, and didn't surface the substitution. Two consequences:
- v5.2 currently ships scoring that's fundamentally different from what users / docs / memory describe ("reranker"). That's the wrong feature label-wise and substantively.
- The bi-encoder cosine path is also a weaker signal than a trained classifier. Cosine-to-corpus measures "is this sentence shaped like sentences in the reference corpus?" — which conflates surface register with grammaticality, and rewards verbose / formal phrasings disproportionately. A trained scorer learns the per-sentence naturalness judgment directly.
Both reasons justify pulling the work forward.
Goals¶
- Drop-in replacement for
_cached_model+ cosine pathway incsp/reranker/predict.py. Same call shape (predict_axes_batch(candidates) -> [{naturalness: float, ...}]); single axis populated. - Eval gate: Spearman ≥ 0.75 vs gpt-4.1-mini on a held-out 15% split of the labeled data.
- Latency budget: ≤ 100 ms per candidate, batched, on the Cloudflare Container's 1 vCPU (standard-2 tier). Today's Qwen3-Embedding-0.6B cosine is ~1.4 ms / candidate batched; that's the ceiling we should not silently regress.
- Image size budget: model weights ≤ 200 MB on disk. Today's Qwen path is ~1.2 GB; that frees ~1 GB of image size for headroom on future container upgrades.
Non-goals¶
- No multi-axis breakdown. PHON-107's 4-axis (naturalness / grammaticality / age_appropriate / coherence) is explicitly retired. Single scalar is what callers actually consume and what the UI surfaces; the others were either intrinsic to the realizer (grammaticality) or redundant in practice (coherence ≈ naturalness on single-sentence inputs; age_appropriate is poorly defined on CSP nonce and easier to enforce via constraints).
- No cross-encoder. The task is absolute scoring (one input → one score), not (query, candidate) pair ranking. Cross-encoder pattern doesn't fit.
- No PLL / pseudo-log-likelihood baseline. PHON-100 hit the canonical-degenerate failure mode ("the the the the the" outranks "the cat chased the ball" because "the" is over-predictable in masked context). Hard skip.
- No manufactured surface-chaos negatives (shuffle, random substitution, truncation). CSP nonce is the natural negative generator — diverse failure modes that reflect production. Manufactured chaos teaches a useless distinction (well-formed vs gibberish) that doesn't reflect what the scorer needs to distinguish at runtime.
- No retention of bi-encoder cosine as fallback. Single code path post-cutover.
Design¶
Architecture¶
candidate sentence (str)
│
├─ encoder (frozen): sentence → vector
│ candidates: bge-small-en-v1.5 (33M, 2023)
│ snowflake-arctic-embed-s (33M, 2024)
│ potion-retrieval-32M (Model2Vec distilled static)
│ one ~100M+ ceiling option (TBD by bake-off)
│
├─ regression head (trained)
│ Linear(d_embed → 128) → ReLU → Dropout(0.1) → Linear(128 → 1)
│ Output: predicted naturalness in [1, 5]
│
└─ scalar score → reranker sort key
The encoder is frozen during training (only the head is learned). This keeps training cheap, image size predictable, and gives clean apples-to-apples comparison across encoder candidates.
Encoder bake-off (Phase 1)¶
Train one regression head per encoder candidate on the same training data + same loss + same hyperparameters. Hold out 15% for eval. Decision criteria, in priority order:
- Spearman vs teacher labels on held-out (must clear 0.75)
- Image MB (encoder weights on disk)
- Inference latency (ms per candidate, batched n=64, on standard-2 vCPU)
- Tie-break: prefer smaller model if Spearman within 0.02
Bake-off models:
| Candidate | Params | Released | Notes |
|---|---|---|---|
BAAI/bge-small-en-v1.5 |
33M | Sept 2023 | Battle-tested baseline; user has prior experience |
Snowflake/snowflake-arctic-embed-s |
33M | Apr 2024 | Newer with stronger MTEB scores at small size |
minishlab/potion-retrieval-32M |
32M (static) | 2024 | Model2Vec distillation; no transformer at inference (~100× faster) |
| TBD ceiling | ~100M | — | If small models all fail Spearman gate, fall back to one ~100M option (bge-base-en-v1.5 or similar) before re-scoping |
Bake-off output: a decision memo + the winning encoder + saved head weights. Total bake-off compute: ~1-2 hours on user's MPS / CPU per encoder × 4 = ~4-8 hours.
Training data composition (~13K labeled)¶
| Source | Count | Why |
|---|---|---|
| CSP-generated nonce | ~5,000 | The natural negative generator. Stratified across diverse constraint sets so the scorer sees the failure modes it'll encounter at runtime: weak selectional fillers, awkward agreement, ungrammatical determiner choices, semantically odd subject-verb-object triples. |
| Curated-corpus sample | ~5,000 | Stratified across the 6 sources (CoLA / UD-EWT / GUM / CHILDES / PhonBank / Tatoeba). Real-world "bad" surfaces appear naturally: Tatoeba awkward translations, CHILDES disfluencies, PhonBank fragments, GUM register variance. |
| CoLA unacceptable | ~2,000 | Warstadt 2019 linguistic-acceptability gold labels. Anchors the low end with published-research-grade negatives. |
| Optional supplement | ~1,000 | LLM-generated "make slightly worse" rewrites. Only if held-out shows the model can't distinguish near-misses. Don't manufacture this preemptively. |
No manufactured surface-chaos. The CSP solver enumerates broadly precisely because the reranker is the quality gate — diverse CSP output IS the negative distribution the scorer needs to learn.
Teacher: gpt-4.1-mini¶
Cloze-rating prompt scores 1-5 naturalness per sentence. Single label, not multi-label. Prompt template parity with PHON-73/82/83/115 (uses the same logprob-expected-value extraction).
Cost estimate: ~13K sentences × ~200 input tokens × ~$0.15/1M = ~$0.40 input + ~$0.40 output. Total budget for teacher labeling: ≤ $2 (5× headroom for retries / re-labeling).
Loss¶
Huber regression on 1-5 anchors. Robust to outliers in teacher labels (~5% expected disagreement). Optional pairwise hinge term if direct ranking signal underperforms Huber alone on the held-out Spearman gate; don't ship if not needed.
loss = HuberLoss(delta=0.5)(predicted_score, teacher_label)
Deployment¶
- Train head, save weights to
data/runtime/naturalness_scorer.pt(LFS-tracked; expected ~1-2 MB for the regression head). - Drop-in replace
csp/reranker/predict.py: _cached_model()→ loads encoder + head weightspredict_axes_batch(candidates)→encoder.encode(batch).matmul(head_weights)returning[{naturalness: float}]- Pre-bake corpus scores: keep for v5.2. Scorer runs at build time on
corpus_sentences_index.parquet; precomputednaturalness_scorecolumn already exists from the bi-encoder path, just regenerated by the new scorer. v5.3 can simplify to runtime-only once latency is empirically verified. - Retire
naturalness_reference.npy+naturalness_reference_meta.jsonlfrom LFS once the new scorer ships. Keep on disk briefly (1 release cycle) for recovery.
Eval gates¶
| Gate | Threshold | Why |
|---|---|---|
| Spearman vs teacher on held-out 15% | ≥ 0.75 | PHON-107 v2 hit 0.805 on naturalness with LightGBM + 4 axes; 0.75 is the floor for a single-axis encoder + regression head. |
| Latency, batched n=64, standard-2 vCPU | ≤ 100 ms / candidate | 100× headroom over today's 1.4 ms (Qwen cosine). Anything slower risks user-perceived regression. |
| Image MB | ≤ 200 MB encoder weights | ~6× smaller than today's 1.2 GB Qwen. Frees container headroom. |
| Behavioral smoke (qualitative) | 20 hand-curated good/bad pairs all score correctly | The "the the the the" / "natural English" sanity check. Catches PLL-style degenerate-text failures. |
Risks¶
- Encoder bake-off doesn't clear 0.75 with small models. Mitigation: fall back to ~100M ceiling encoder. If even that doesn't clear, the gate itself may be wrong — re-examine teacher label noise vs model capacity.
- Teacher disagreement on CSP nonce. gpt-4.1-mini may rate CSP nonce inconsistently because the failure modes don't match its training distribution of natural-English-quality assessments. Mitigation: spot-check 100 random teacher labels manually before training; if disagreement > 20%, refine prompt.
- Distribution shift between training-time CSP nonce and post-PHON-110-iteration nonce. The CSP solver may emit different surface distributions in 6 months. Mitigation: schedule periodic retraining (v5.3 ticket) when CSP solver or selectional.parquet change substantially.
- Implementation timeline overruns v5.2 release pressure. Mitigation: spec includes hard scope boundaries (single axis, no cross-encoder, no manufactured negatives). If bake-off + training + eval doesn't fit, hold the bi-encoder cosine for the v5.2 release and file PHON-116 for v5.3 — DO NOT ship a half-trained scorer.
Implementation plan (sketch — to be detailed in plan doc once spec is signed off)¶
- Generate training data (~3 hours)
- 5K CSP nonce via existing
phonolex_generators.csp.solverwith stratified constraint sets - 5K curated-corpus stratified sample from
corpus_sentences_index.parquet - 2K CoLA unacceptable subset (already in repo)
- Label all via gpt-4.1-mini batch API (parity with PHON-115 harness)
- Encoder bake-off (~4-8 hours)
- Implement
train_head.py(PyTorch, single-script, scriptable across encoder candidates) - Run for each candidate; collect metrics
- Write decision memo
- Train winner + eval (~1 hour)
- Deployment swap (~2 hours)
- Replace
predict.pyinternals - Regenerate corpus index naturalness scores
- Update CLAUDE.md "Naturalness reranker" section + Data Contract
- Smoke + PR (~1 hour)
Total estimate: ~10-15 hours.
Linked precedents¶
- FineWeb-Edu quality classifier (HuggingFace 2024): the cleanest precedent. Large LLM scores documents, small encoder distills into a runtime classifier. Direct sentence-level analog.
- CoLA (Warstadt 2019): sentence acceptability dataset, public, used as the linguistic anchor.
- COMET (Rei 2020): MT quality estimation, regression-head-over-encoder at sentence-pair level (we adapt to single-sentence absolute scoring).
- PHON-107: 4-axis LightGBM precedent. Same teacher family; same loss family; different architecture (LGB → encoder+head) and scope (4 axes → 1).
Open items pending implementation greenlight¶
- Spec sign-off from user
- Final encoder ceiling-option pick (
bge-base-en-v1.5vs other; decide during bake-off if small encoders fail the gate) - Ticket: PHON-116 to be filed in Jira after spec sign-off (parent: PHON-56)
No code changes will land until user explicitly greenlights implementation per feedback_scope_per_authorization.