PHON-107 — Reranker v2 design¶

Goal¶

Replace the v1 single-scalar quality axis with 4 independent per-axis scorers (one each for naturalness, grammaticality, age_appropriate, coherence). Train on a 5× expanded dataset gathered via active-learning loops over the PHON-112/113 constraint-driven harness output. Caller-supplied axis weights compose into a single sort score; each candidate also exposes its per-axis predictions for downstream UI breakdown (PHON-110) and use-case-specific weighting.

Motivation¶

The v1 reranker (quality_axis.py, $1.58 one-shot, Spearman 0.633) collapses 4 distinct quality signals into one mean target. This loses information in two ways:

No actionable axis breakdown for the frontend. The PHON-110 reframe asks for top-K candidates with per-axis breakdown so users can see why a candidate ranks where it does. v1's scalar can't supply that.
No use-case-specific weighting. A clinical-SLP request may legitimately want age-appropriate and coherent over natural and grammatical (a slightly stilted but age-targeted sentence is often preferable). v1's fixed mean disallows this.

Secondary: v1 trained once on the v6/spike output. PHON-112 (single-sentence pivot) and PHON-113 (paragraph rebuild) shifted the candidate distribution substantially. The reranker should retrain on output from the new path so the model learns what the productionized system actually emits, not what a retired research path produced.

5× training data is a means, not a target — active learning will tell us when the per-axis Spearmans plateau. ~$8 budget bounds the teacher cost.

Architecture¶

candidate (sentence or paragraph dict)
    │
    ├─ feature extraction
    │     ├─ tabular: ppmi_total, feature_distance, sonorant_diff, n_sentences,
    │     │           has_discourse_marker, candidate_length, … (existing v1 set)
    │     └─ embedding: MiniLM-L6-v2 of the rendered text (existing v1 set)
    │
    ├─ 4 independent LightGBM scorers
    │     ├─ naturalness_lgb       → 1-5 prediction
    │     ├─ grammaticality_lgb    → 1-5 prediction
    │     ├─ age_appropriate_lgb   → 1-5 prediction
    │     └─ coherence_lgb         → 1-5 prediction
    │
    └─ composite score
          composite = sum(weights[axis] * pred[axis] for axis in axes)
          weights default to {0.25, 0.25, 0.25, 0.25}; caller may override

The composite is the existing reranker's sort key — drop in replacement. Per-axis predictions are NEW outputs available to downstream consumers.

API¶

# v2 reranker — replaces quality_axis.predict_quality
def predict_axes(
    candidate: dict,
    *,
    is_paragraph: bool,
    band: str,
) -> dict[str, float]:
    """Return per-axis predictions: {naturalness, grammaticality, age_appropriate, coherence}.
    Each value is a model prediction in roughly [1.0, 5.0]."""


def composite_score(
    axis_scores: dict[str, float],
    *,
    weights: dict[str, float] | None = None,
) -> float:
    """Return weighted sum of axis scores. Default weights = 0.25 each."""


def rerank_with_axes(
    candidates: list[dict],
    *,
    is_paragraph: bool,
    band: str,
    weights: dict[str, float] | None = None,
    top_k: int = 8,
) -> list[dict]:
    """Score each candidate's 4 axes, compute composite, sort desc, return top_k.

    Each returned candidate is annotated with:
        candidate["axis_scores"] = {naturalness, grammaticality, age_appropriate, coherence}
        candidate["composite_score"] = float
    """

The existing reranker.rerank() (with τ + MMR + sampling) keeps its existing role — it's the sampling layer over a scored candidate pool. PHON-107 replaces the scoring primitive, not the sampling logic.

Active learning protocol¶

Round 0 — bootstrap¶

Use existing v1 labeled data: ~2K candidates with 4-axis ratings already in outputs/judged.jsonl. Train 4 LightGBM models on this base. Compute baseline per-axis Spearman on a held-out (verb, band) split.

Round N (N=1..K) — uncertainty-driven¶

1. Run build_judging_set.py to produce a large unlabeled candidate pool
   from PHON-112/113 harness (single-sentence + paragraph, all constraint configs)
2. For each unlabeled candidate, compute per-axis prediction VARIANCE
   (across the 4 axis models — high variance = axes disagree = informative)
3. Pick top-N uncertain candidates (N=200 per round)
4. Send to llm_judge (Sonnet 4.6) for 4-axis ratings
5. Add to training set, retrain 4 axis models
6. Recompute held-out Spearmans

Stopping criterion¶

Stop when EITHER: - Per-axis Spearman improvements between rounds drop below 0.01, OR - Total teacher spend exceeds $8 (~10K labels @ ~$0.001/label, with the v1 cost as the unit rate).

Whichever fires first. The budget cap is the hard ceiling; the Spearman criterion is the soft "diminishing returns" signal.

Diversity safeguard¶

Pure variance sampling can collapse onto one corner of the candidate space. Add a coverage stratification: at each round, the top-N uncertainty pool is bucketed by (request_type, band, constraint_config) and we sample proportionally so the labeled distribution stays representative.

Pipeline (training run walk-through)¶

A full v2 training run:

# Round 0: bootstrap from existing labels
uv run python <spike>/train_reranker_v2.py --round 0

# Round 1: generate fresh candidate pool, pick uncertain, label, retrain
uv run python <spike>/build_judging_set.py --no-judge  # large unlabeled pool
uv run python <spike>/active_learning_select.py --batch 200
uv run python <spike>/llm_judge.py --jsonl outputs/active_round_1.jsonl
uv run python <spike>/train_reranker_v2.py --round 1

# Repeat for rounds 2..K until stopping criterion

Final artifact: <spike>/outputs/reranker_v2.pkl — a dict with 4 LightGBM Boosters, the MiniLM embedder reference, and the version metadata.

Per-axis weighting in production¶

Frontend (PHON-110) sends a request like:

{
  "spec": "spec1",
  "constraints": [...],
  "axis_weights": {"age_appropriate": 0.4, "coherence": 0.4, "grammaticality": 0.2, "naturalness": 0.0}
}

The generation server passes weights to rerank_with_axes. Equal weights (0.25 each) recover v1's mean-of-4 behavior. SLP-clinical defaults could prefer age + coherence; child-language defaults could prefer naturalness + grammaticality. Per-band defaults can be hard-coded server-side and overridden per request.

Scope¶

In scope: - v2 trainer (<spike>/train_reranker_v2.py) - 4 independent LightGBM models, MiniLM-L6-v2 features (same as v1) - v2 predictor (<spike>/quality_axis_v2.py or replace quality_axis.py) - v2 reranker (<spike>/reranker_v2.py with rerank_with_axes) - Active-learning selector (<spike>/active_learning_select.py) - Updated build_judging_set.py to optionally skip teacher labeling (--no-judge mode) - Tests: per-axis Spearman regression, composite score determinism, active-learning batch selection - Updated demo_quality_axis.py to surface per-axis breakdown

Out of scope: - PHON-109 productionization (moving reranker to a package, server integration) - PHON-110 frontend reframe (UI consumption of axis breakdown) - Cross-axis correlation modeling (v2 keeps axes independent at the model level) - Adaptive composite weighting (learned-from-data weights — defer to v3 if useful)

Migration plan¶

Survives unchanged: llm_judge.py (already produces 4-axis ratings), the existing outputs/judged.jsonl (becomes the bootstrap training set).
Gets retired: quality_axis._composite_target (the mean-of-4 reduction). The composite role moves to composite_score with caller-supplied weights.
Gets rewritten: train_reranker.py → train_reranker_v2.py (4 separate model-training loops + per-axis Spearman reporting). quality_axis.py → quality_axis_v2.py (loads 4 models, predicts 4 scores).
Tests: existing v1 reranker tests retired or updated. New tests cover per-axis prediction shape, composite scoring with custom weights, and active-learning batch determinism.
Branch: continues feature/csp-iteration after PHON-113. No PR until PHON-109 productionization.

Risks¶

Per-axis Spearman could be lower than the v1 composite Spearman. v1's mean-of-4 averages out noise; per-axis exposes the noise. Mitigation: use the v1 baseline as a sanity check — if any individual axis Spearman drops below ~0.5 after the active-learning rounds, the per-axis approach is failing for that axis. Could fall back to a 2-axis model (e.g., merge correlated axes) or accept the lower Spearman as the cost of granularity.
Active-learning variance collapse. Pure uncertainty sampling can chase noise (the model is uncertain about garbage that's hard to distinguish from garbage). The diversity safeguard (stratify by request_type × band × constraint_config) mitigates this.
Teacher cost overrun. v1 was $1.58 for ~2K labels = ~$0.0008/label. 5× = ~10K labels = ~$8. If Sonnet pricing changes or labels run longer, budget could double. Hard cap at $20 — stop and reassess if rounds get expensive.
Distribution drift. Active-learning rounds sample from the new constraint-driven harness output. If the harness changes (e.g., post-PHON-109), v2 model may degrade. v3 would retrain on the production distribution; for now, v2 trains on the spike harness which is the closest proxy.

Open questions¶

Should embedding features be re-extracted per round or cached? MiniLM embeddings are deterministic given input text. Cache by text hash to avoid recomputation. Implement as a sidecar disk cache (~10K entries × 384 dims × float32 = ~15MB).
What's the correct held-out split for active-learning Spearman tracking? v1 used (verb, band) groups. v2 should keep that — splitting on (request_type, band, constraint_config) groups instead would over-fragment the held-out and add variance.
Should we collect per-axis confidence intervals during teacher labeling? Sonnet 4.6 outputs deterministic ratings (no built-in uncertainty). Could ask for "low/medium/high confidence" alongside, but adds prompt complexity. Defer; v2 treats each axis label as a noiseless training signal.
Granularity beyond 1-5 ratings? Sonnet's 1-5 ordinal scale is coarse. Could request 1-10 or 0.0-1.0 continuous. v1 used 1-5 successfully; preserve.

Self-review¶

[x] All decisions concrete: 4 LightGBM models, MiniLM-L6-v2 features, composite weighting, active-learning loop with stratified uncertainty + Spearman/budget stopping criterion.
[x] No "TBD" / placeholder language.
[x] Internal consistency: v1 already produces 4-axis labels; v2 just stops collapsing them. Active learning bootstraps from v1's existing data; retraining cost is incremental.
[x] Scope is decomposed correctly: v2 stays in spike; productionization is PHON-109.
[x] Ambiguity check: caller-supplied weights default to 0.25 each (recovers v1 behavior). The composite is always a weighted sum (not learned). Active learning has a hard $8 budget cap.