PHON-107 — Reranker v2 design¶
Goal¶
Replace the v1 single-scalar quality axis with 4 independent per-axis scorers (one each for naturalness, grammaticality, age_appropriate, coherence). Train on a 5× expanded dataset gathered via active-learning loops over the PHON-112/113 constraint-driven harness output. Caller-supplied axis weights compose into a single sort score; each candidate also exposes its per-axis predictions for downstream UI breakdown (PHON-110) and use-case-specific weighting.
Motivation¶
The v1 reranker (quality_axis.py, $1.58 one-shot, Spearman 0.633) collapses 4 distinct quality signals into one mean target. This loses information in two ways:
- No actionable axis breakdown for the frontend. The PHON-110 reframe asks for top-K candidates with per-axis breakdown so users can see why a candidate ranks where it does. v1's scalar can't supply that.
- No use-case-specific weighting. A clinical-SLP request may legitimately want age-appropriate and coherent over natural and grammatical (a slightly stilted but age-targeted sentence is often preferable). v1's fixed mean disallows this.
Secondary: v1 trained once on the v6/spike output. PHON-112 (single-sentence pivot) and PHON-113 (paragraph rebuild) shifted the candidate distribution substantially. The reranker should retrain on output from the new path so the model learns what the productionized system actually emits, not what a retired research path produced.
5× training data is a means, not a target — active learning will tell us when the per-axis Spearmans plateau. ~$8 budget bounds the teacher cost.
Architecture¶
candidate (sentence or paragraph dict)
│
├─ feature extraction
│ ├─ tabular: ppmi_total, feature_distance, sonorant_diff, n_sentences,
│ │ has_discourse_marker, candidate_length, … (existing v1 set)
│ └─ embedding: MiniLM-L6-v2 of the rendered text (existing v1 set)
│
├─ 4 independent LightGBM scorers
│ ├─ naturalness_lgb → 1-5 prediction
│ ├─ grammaticality_lgb → 1-5 prediction
│ ├─ age_appropriate_lgb → 1-5 prediction
│ └─ coherence_lgb → 1-5 prediction
│
└─ composite score
composite = sum(weights[axis] * pred[axis] for axis in axes)
weights default to {0.25, 0.25, 0.25, 0.25}; caller may override
The composite is the existing reranker's sort key — drop in replacement. Per-axis predictions are NEW outputs available to downstream consumers.
API¶
# v2 reranker — replaces quality_axis.predict_quality
def predict_axes(
candidate: dict,
*,
is_paragraph: bool,
band: str,
) -> dict[str, float]:
"""Return per-axis predictions: {naturalness, grammaticality, age_appropriate, coherence}.
Each value is a model prediction in roughly [1.0, 5.0]."""
def composite_score(
axis_scores: dict[str, float],
*,
weights: dict[str, float] | None = None,
) -> float:
"""Return weighted sum of axis scores. Default weights = 0.25 each."""
def rerank_with_axes(
candidates: list[dict],
*,
is_paragraph: bool,
band: str,
weights: dict[str, float] | None = None,
top_k: int = 8,
) -> list[dict]:
"""Score each candidate's 4 axes, compute composite, sort desc, return top_k.
Each returned candidate is annotated with:
candidate["axis_scores"] = {naturalness, grammaticality, age_appropriate, coherence}
candidate["composite_score"] = float
"""
The existing reranker.rerank() (with τ + MMR + sampling) keeps its existing role — it's the sampling layer over a scored candidate pool. PHON-107 replaces the scoring primitive, not the sampling logic.
Active learning protocol¶
Round 0 — bootstrap¶
Use existing v1 labeled data: ~2K candidates with 4-axis ratings already in outputs/judged.jsonl. Train 4 LightGBM models on this base. Compute baseline per-axis Spearman on a held-out (verb, band) split.
Round N (N=1..K) — uncertainty-driven¶
1. Run build_judging_set.py to produce a large unlabeled candidate pool
from PHON-112/113 harness (single-sentence + paragraph, all constraint configs)
2. For each unlabeled candidate, compute per-axis prediction VARIANCE
(across the 4 axis models — high variance = axes disagree = informative)
3. Pick top-N uncertain candidates (N=200 per round)
4. Send to llm_judge (Sonnet 4.6) for 4-axis ratings
5. Add to training set, retrain 4 axis models
6. Recompute held-out Spearmans
Stopping criterion¶
Stop when EITHER: - Per-axis Spearman improvements between rounds drop below 0.01, OR - Total teacher spend exceeds $8 (~10K labels @ ~$0.001/label, with the v1 cost as the unit rate).
Whichever fires first. The budget cap is the hard ceiling; the Spearman criterion is the soft "diminishing returns" signal.
Diversity safeguard¶
Pure variance sampling can collapse onto one corner of the candidate space. Add a coverage stratification: at each round, the top-N uncertainty pool is bucketed by (request_type, band, constraint_config) and we sample proportionally so the labeled distribution stays representative.
Pipeline (training run walk-through)¶
A full v2 training run:
# Round 0: bootstrap from existing labels
uv run python <spike>/train_reranker_v2.py --round 0
# Round 1: generate fresh candidate pool, pick uncertain, label, retrain
uv run python <spike>/build_judging_set.py --no-judge # large unlabeled pool
uv run python <spike>/active_learning_select.py --batch 200
uv run python <spike>/llm_judge.py --jsonl outputs/active_round_1.jsonl
uv run python <spike>/train_reranker_v2.py --round 1
# Repeat for rounds 2..K until stopping criterion
Final artifact: <spike>/outputs/reranker_v2.pkl — a dict with 4 LightGBM Boosters, the MiniLM embedder reference, and the version metadata.
Per-axis weighting in production¶
Frontend (PHON-110) sends a request like:
{
"spec": "spec1",
"constraints": [...],
"axis_weights": {"age_appropriate": 0.4, "coherence": 0.4, "grammaticality": 0.2, "naturalness": 0.0}
}
The generation server passes weights to rerank_with_axes. Equal weights (0.25 each) recover v1's mean-of-4 behavior. SLP-clinical defaults could prefer age + coherence; child-language defaults could prefer naturalness + grammaticality. Per-band defaults can be hard-coded server-side and overridden per request.
Scope¶
In scope:
- v2 trainer (<spike>/train_reranker_v2.py)
- 4 independent LightGBM models, MiniLM-L6-v2 features (same as v1)
- v2 predictor (<spike>/quality_axis_v2.py or replace quality_axis.py)
- v2 reranker (<spike>/reranker_v2.py with rerank_with_axes)
- Active-learning selector (<spike>/active_learning_select.py)
- Updated build_judging_set.py to optionally skip teacher labeling (--no-judge mode)
- Tests: per-axis Spearman regression, composite score determinism, active-learning batch selection
- Updated demo_quality_axis.py to surface per-axis breakdown
Out of scope: - PHON-109 productionization (moving reranker to a package, server integration) - PHON-110 frontend reframe (UI consumption of axis breakdown) - Cross-axis correlation modeling (v2 keeps axes independent at the model level) - Adaptive composite weighting (learned-from-data weights — defer to v3 if useful)
Migration plan¶
-
Survives unchanged:
llm_judge.py(already produces 4-axis ratings), the existingoutputs/judged.jsonl(becomes the bootstrap training set). -
Gets retired:
quality_axis._composite_target(the mean-of-4 reduction). The composite role moves tocomposite_scorewith caller-supplied weights. -
Gets rewritten:
train_reranker.py→train_reranker_v2.py(4 separate model-training loops + per-axis Spearman reporting).quality_axis.py→quality_axis_v2.py(loads 4 models, predicts 4 scores). -
Tests: existing v1 reranker tests retired or updated. New tests cover per-axis prediction shape, composite scoring with custom weights, and active-learning batch determinism.
-
Branch: continues
feature/csp-iterationafter PHON-113. No PR until PHON-109 productionization.
Risks¶
- Per-axis Spearman could be lower than the v1 composite Spearman. v1's mean-of-4 averages out noise; per-axis exposes the noise. Mitigation: use the v1 baseline as a sanity check — if any individual axis Spearman drops below ~0.5 after the active-learning rounds, the per-axis approach is failing for that axis. Could fall back to a 2-axis model (e.g., merge correlated axes) or accept the lower Spearman as the cost of granularity.
- Active-learning variance collapse. Pure uncertainty sampling can chase noise (the model is uncertain about garbage that's hard to distinguish from garbage). The diversity safeguard (stratify by request_type × band × constraint_config) mitigates this.
- Teacher cost overrun. v1 was $1.58 for ~2K labels = ~$0.0008/label. 5× = ~10K labels = ~$8. If Sonnet pricing changes or labels run longer, budget could double. Hard cap at $20 — stop and reassess if rounds get expensive.
- Distribution drift. Active-learning rounds sample from the new constraint-driven harness output. If the harness changes (e.g., post-PHON-109), v2 model may degrade. v3 would retrain on the production distribution; for now, v2 trains on the spike harness which is the closest proxy.
Open questions¶
- Should embedding features be re-extracted per round or cached? MiniLM embeddings are deterministic given input text. Cache by text hash to avoid recomputation. Implement as a sidecar disk cache (~10K entries × 384 dims × float32 = ~15MB).
- What's the correct held-out split for active-learning Spearman tracking? v1 used (verb, band) groups. v2 should keep that — splitting on (request_type, band, constraint_config) groups instead would over-fragment the held-out and add variance.
- Should we collect per-axis confidence intervals during teacher labeling? Sonnet 4.6 outputs deterministic ratings (no built-in uncertainty). Could ask for "low/medium/high confidence" alongside, but adds prompt complexity. Defer; v2 treats each axis label as a noiseless training signal.
- Granularity beyond 1-5 ratings? Sonnet's 1-5 ordinal scale is coarse. Could request 1-10 or 0.0-1.0 continuous. v1 used 1-5 successfully; preserve.
Self-review¶
- [x] All decisions concrete: 4 LightGBM models, MiniLM-L6-v2 features, composite weighting, active-learning loop with stratified uncertainty + Spearman/budget stopping criterion.
- [x] No "TBD" / placeholder language.
- [x] Internal consistency: v1 already produces 4-axis labels; v2 just stops collapsing them. Active learning bootstraps from v1's existing data; retraining cost is incremental.
- [x] Scope is decomposed correctly: v2 stays in spike; productionization is PHON-109.
- [x] Ambiguity check: caller-supplied weights default to 0.25 each (recovers v1 behavior). The composite is always a weighted sum (not learned). Active learning has a hard $8 budget cap.