Skip to content

PHON-107 — Reranker v2 Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace v1's mean-of-4 collapse with 4 independent LightGBM per-axis scorers over MiniLM-L6-v2 features. Active-learning loops on the PHON-112/113 harness output expand training data until Spearman plateaus or $8 budget hits.

Architecture: 4 independent models, composite weighting at scoring time, uncertainty-driven active-learning rounds with diversity stratification.

Tech Stack: LightGBM (existing), sentence-transformers MiniLM-L6-v2 (existing), Anthropic SDK for Sonnet 4.6 teacher labeling (existing via llm_judge.py), Polars for batch operations.


File map

Files created (new): - <spike>/train_reranker_v2.py — trains 4 LightGBM models, reports per-axis Spearman - <spike>/quality_axis_v2.py — loads 4 models, returns per-axis predictions - <spike>/active_learning_select.py — uncertainty-driven candidate selector - <spike>/embedding_cache.py — text-hash-keyed MiniLM cache (sidecar disk) - <spike>/test_reranker_v2.py — tests for trainer, predictor, composite, active-learning

Files modified: - <spike>/build_judging_set.py — add --no-judge mode (emit unlabeled pool) - <spike>/demo_quality_axis.py — surface per-axis breakdown - <spike>/llm_judge.py — accept active-learning batch input format

Where <spike> = /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms/.

Tests run via cd /Users/jneumann/Repos/PhonoLex/packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v.


Task 1: per-axis target extractor

Files: - Create: <spike>/test_reranker_v2.py - Modify: <spike>/train_reranker.py (factor out helper) OR create new helper module

Replace v1's _composite_target with _per_axis_targets that returns dict[axis, float | None].

  • [ ] Step 1.1: Write failing test

Create <spike>/test_reranker_v2.py:

"""Tests for PHON-107 reranker v2."""
from __future__ import annotations

import sys
from pathlib import Path

import pytest

sys.path.insert(0, str(Path(__file__).parent))


AXES = ("naturalness", "grammaticality", "age_appropriate", "coherence")


def test_per_axis_targets_returns_4_floats():
    from train_reranker_v2 import _per_axis_targets
    ratings = {"naturalness": 4, "grammaticality": 5, "age_appropriate": 3, "coherence": 4}
    targets = _per_axis_targets(ratings)
    assert set(targets.keys()) == set(AXES)
    for ax in AXES:
        assert isinstance(targets[ax], float)
        assert 1.0 <= targets[ax] <= 5.0


def test_per_axis_targets_partial_returns_none_for_missing():
    from train_reranker_v2 import _per_axis_targets
    ratings = {"naturalness": 4, "grammaticality": None}  # missing age_appropriate, coherence
    targets = _per_axis_targets(ratings)
    assert targets["naturalness"] == 4.0
    assert targets["grammaticality"] is None
    assert targets["age_appropriate"] is None
    assert targets["coherence"] is None
  • [ ] Step 1.2: Run, verify fail (ModuleNotFoundError on train_reranker_v2)

  • [ ] Step 1.3: Create train_reranker_v2.py with helper

Create <spike>/train_reranker_v2.py:

"""PHON-107: Reranker v2 trainer — 4 independent LightGBM models per axis.

Replaces v1's mean-of-4 collapse with per-axis scorers, trained on
existing llm_ratings labels. Active-learning rounds layer on top.
"""
from __future__ import annotations

import json
import pickle
from pathlib import Path

AXES = ("naturalness", "grammaticality", "age_appropriate", "coherence")


def _per_axis_targets(ratings: dict) -> dict[str, float | None]:
    """Return per-axis float targets from llm_ratings dict.
    None if missing or non-numeric."""
    out: dict[str, float | None] = {}
    for ax in AXES:
        v = ratings.get(ax)
        if isinstance(v, (int, float)) and 1 <= v <= 5:
            out[ax] = float(v)
        else:
            out[ax] = None
    return out
  • [ ] Step 1.4: Run, verify pass (2 tests)
cd /Users/jneumann/Repos/PhonoLex/packages/generation && \
  uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_reranker_v2.py -v

Expected: 2 passed.

  • [ ] Step 1.5: Commit
cd /Users/jneumann/Repos/PhonoLex && \
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/train_reranker_v2.py \
        packages/generation/research/2026-05-07-sentence-generation-paradigms/test_reranker_v2.py
git commit -m "$(cat <<'EOF'
PHON-107: per-axis target extractor

_per_axis_targets returns {axis: float | None} from llm_ratings dict.
None on missing or non-numeric. Replaces v1's _composite_target.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 2: load_data + build_features for per-axis training

Files: - Modify: <spike>/train_reranker_v2.py - Modify: <spike>/test_reranker_v2.py

Reuse v1's tabular feature extraction + MiniLM embeddings, but produce per-axis target arrays instead of a single composite.

  • [ ] Step 2.1: Inspect v1's load_data + build_features
sed -n '116,175p' packages/generation/research/2026-05-07-sentence-generation-paradigms/train_reranker.py

Identify the existing helpers' shape — what fields they expect, what they return.

  • [ ] Step 2.2: Append failing tests
def test_load_data_v2_returns_per_axis_targets():
    from train_reranker_v2 import load_data
    candidates, axis_targets, groups = load_data()
    # axis_targets is now a dict, not a list of floats
    assert set(axis_targets.keys()) == set(AXES)
    for ax in AXES:
        assert len(axis_targets[ax]) == len(candidates)
        # Each target is float or None (None for partial labels)
        for t in axis_targets[ax]:
            assert t is None or isinstance(t, float)
  • [ ] Step 2.3: Implement load_data() and build_features() in v2 trainer
def load_data(judged_jsonl: Path | None = None) -> tuple[list[dict], dict[str, list[float | None]], list[tuple[str, str]]]:
    """Load judged candidates, return (candidates, per-axis-targets, group_keys).

    Each candidate's llm_ratings is split into 4 per-axis targets. None for missing.
    """
    if judged_jsonl is None:
        judged_jsonl = Path(__file__).parent / "outputs" / "judged.jsonl"

    candidates: list[dict] = []
    axis_targets: dict[str, list[float | None]] = {ax: [] for ax in AXES}
    groups: list[tuple[str, str]] = []

    for line in judged_jsonl.read_text().splitlines():
        if not line.strip():
            continue
        req = json.loads(line)
        verb = req.get("verb", "(constraint-driven)")
        band = req["band"]
        for c in req["candidates"]:
            ratings = c.get("llm_ratings", {})
            if not ratings:
                continue
            targets = _per_axis_targets(ratings)
            # Skip candidates with no usable rating
            if all(v is None for v in targets.values()):
                continue
            candidates.append(c)
            for ax in AXES:
                axis_targets[ax].append(targets[ax])
            groups.append((verb, band))

    return candidates, axis_targets, groups

For build_features, copy v1's structure (tabular + MiniLM), wrap into a function that returns numpy arrays. Keep backwards-compatible feature extraction so the v1 features can be reused.

  • [ ] Step 2.4: Run, verify pass

  • [ ] Step 2.5: Commit


Task 3: train 4 LightGBM models (Round 0)

Files: - Modify: <spike>/train_reranker_v2.py - Modify: <spike>/test_reranker_v2.py

Per-axis training loop: for each axis, drop rows with None target, train LightGBM on (features, axis_target). Save 4 models + per-axis Spearman to disk.

  • [ ] Step 3.1: Append failing test
def test_train_per_axis_models_round_0(tmp_path):
    from train_reranker_v2 import train_round
    out_path = tmp_path / "reranker_v2.pkl"
    metrics = train_round(round_num=0, out_path=out_path)
    assert out_path.exists()
    # Metrics: per-axis Spearman + held-out R2 for each of 4 axes
    for ax in AXES:
        assert ax in metrics
        assert "spearman" in metrics[ax]
        assert "r2" in metrics[ax]
  • [ ] Step 3.2: Implement train_round
import lightgbm as lgb
import numpy as np
from scipy.stats import spearmanr
from sklearn.metrics import r2_score

def train_round(*, round_num: int, out_path: Path | None = None) -> dict:
    """Train 4 axis models. Returns per-axis Spearman + R2 metrics."""
    candidates, axis_targets, groups = load_data()
    X = build_features(candidates)
    train_idx, test_idx = split_by_group(groups, test_frac=0.25)

    models: dict[str, lgb.Booster] = {}
    metrics: dict[str, dict[str, float]] = {}

    for ax in AXES:
        y = np.asarray(axis_targets[ax], dtype=object)
        # Mask out None
        valid = np.array([v is not None for v in y])
        train_ax = train_idx & valid
        test_ax = test_idx & valid

        y_train = np.array([y[i] for i in np.where(train_ax)[0]], dtype=float)
        y_test = np.array([y[i] for i in np.where(test_ax)[0]], dtype=float)
        X_train = X[train_ax]
        X_test = X[test_ax]

        params = {"objective": "regression", "metric": "rmse", "verbose": -1}
        model = lgb.train(params, lgb.Dataset(X_train, label=y_train), num_boost_round=200)
        y_pred = model.predict(X_test)

        sp, _ = spearmanr(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        models[ax] = model
        metrics[ax] = {"spearman": float(sp), "r2": float(r2)}

    if out_path is None:
        out_path = Path(__file__).parent / "outputs" / "reranker_v2.pkl"

    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("wb") as f:
        pickle.dump({
            "version": 2,
            "round": round_num,
            "axes": AXES,
            "models": {ax: models[ax].model_to_string() for ax in AXES},
            "metrics": metrics,
        }, f)

    return metrics

split_by_group reuses v1's helper if present, otherwise implement: random split of unique groups into train/test.

  • [ ] Step 3.3: Run, verify pass

The test should produce real Spearmans. Sanity-check: each axis's Spearman should be in [0.3, 0.8] roughly, comparable to v1's 0.633 composite.

  • [ ] Step 3.4: Commit

Task 4: per-axis predictor

Files: - Create: <spike>/quality_axis_v2.py - Modify: <spike>/test_reranker_v2.py

predict_axes(candidate) returns 4 axis predictions. composite_score(axis_scores, weights) returns weighted sum.

  • [ ] Step 4.1: Append failing tests
def test_predict_axes_returns_4_floats():
    from quality_axis_v2 import predict_axes
    candidate = {
        "sentence": "The cat eats the bat.",
        "ppmi_total": 5.0,
        "feature_distance": 0.0,
        "sonorant_diff": 0.0,
    }
    pred = predict_axes(candidate, is_paragraph=False, band="fineweb_adult")
    assert set(pred.keys()) == set(AXES)
    for ax in AXES:
        assert isinstance(pred[ax], float)
        assert 0.0 < pred[ax] < 6.0  # plausible range for 1-5 ratings


def test_composite_score_weighted_sum():
    from quality_axis_v2 import composite_score
    axis = {"naturalness": 4.0, "grammaticality": 5.0, "age_appropriate": 3.0, "coherence": 4.0}
    # Default equal weights
    s = composite_score(axis)
    assert abs(s - 4.0) < 0.001  # mean of 4, 5, 3, 4 = 4.0
    # Custom weights
    s = composite_score(axis, weights={"naturalness": 1.0, "grammaticality": 0.0, "age_appropriate": 0.0, "coherence": 0.0})
    assert abs(s - 4.0) < 0.001  # only naturalness counts
    s = composite_score(axis, weights={"grammaticality": 1.0, "naturalness": 0.0, "age_appropriate": 0.0, "coherence": 0.0})
    assert abs(s - 5.0) < 0.001
  • [ ] Step 4.2: Implement

Create <spike>/quality_axis_v2.py:

"""PHON-107: per-axis quality predictor.

Loads 4 LightGBM models trained by train_reranker_v2 + the existing
MiniLM-L6-v2 embedder. Returns 4 axis predictions per candidate."""
from __future__ import annotations

import pickle
from pathlib import Path
from typing import Any

import lightgbm as lgb
import numpy as np

DEFAULT_MODEL_PATH = Path(__file__).parent / "outputs" / "reranker_v2.pkl"
AXES = ("naturalness", "grammaticality", "age_appropriate", "coherence")
DEFAULT_WEIGHTS = {ax: 0.25 for ax in AXES}


def _load_model(path: Path = DEFAULT_MODEL_PATH) -> dict:
    with path.open("rb") as f:
        return pickle.load(f)


_MODEL_CACHE: dict[Path, dict] = {}

def _cached_model(path: Path = DEFAULT_MODEL_PATH) -> dict:
    if path not in _MODEL_CACHE:
        m = _load_model(path)
        # Reconstitute LightGBM Boosters
        m["boosters"] = {
            ax: lgb.Booster(model_str=m["models"][ax]) for ax in m["axes"]
        }
        _MODEL_CACHE[path] = m
    return _MODEL_CACHE[path]


def predict_axes(
    candidate: dict,
    *,
    is_paragraph: bool,
    band: str,
    model_path: Path = DEFAULT_MODEL_PATH,
) -> dict[str, float]:
    """Predict 4 axis scores for a candidate."""
    from train_reranker_v2 import build_features
    m = _cached_model(model_path)
    X = build_features([candidate])  # 1xD feature row
    out: dict[str, float] = {}
    for ax in AXES:
        pred = m["boosters"][ax].predict(X)[0]
        out[ax] = float(pred)
    return out


def composite_score(
    axis_scores: dict[str, float],
    *,
    weights: dict[str, float] | None = None,
) -> float:
    """Weighted sum across axes. Default equal weights."""
    w = weights if weights is not None else DEFAULT_WEIGHTS
    return sum(w.get(ax, 0.0) * axis_scores.get(ax, 0.0) for ax in AXES)

The build_features import from train_reranker_v2 returns a 2D numpy array; passing a single candidate produces shape (1, D).

  • [ ] Step 4.3: Run, verify pass

  • [ ] Step 4.4: Commit


Task 5: rerank_with_axes

Files: - Create: <spike>/reranker_v2.py - Modify: <spike>/test_reranker_v2.py

Score each candidate's 4 axes, attach to candidate dict, sort by composite, return top_k.

  • [ ] Step 5.1: Append failing test
def test_rerank_with_axes_attaches_predictions():
    from reranker_v2 import rerank_with_axes
    candidates = [
        {"sentence": "The cat eats the bat.", "ppmi_total": 5.0, "feature_distance": 0.0, "sonorant_diff": 0.0},
        {"sentence": "The dog runs in the park.", "ppmi_total": 4.5, "feature_distance": 0.0, "sonorant_diff": 0.0},
    ]
    out = rerank_with_axes(candidates, is_paragraph=False, band="fineweb_adult", top_k=2)
    assert len(out) == 2
    for c in out:
        assert "axis_scores" in c
        assert set(c["axis_scores"].keys()) == set(AXES)
        assert "composite_score" in c
    # Sort: top result has highest composite
    assert out[0]["composite_score"] >= out[1]["composite_score"]
  • [ ] Step 5.2: Implement

Create <spike>/reranker_v2.py:

"""PHON-107: rerank_with_axes — sort by composite of 4 per-axis predictions."""
from __future__ import annotations

from quality_axis_v2 import predict_axes, composite_score, AXES


def rerank_with_axes(
    candidates: list[dict],
    *,
    is_paragraph: bool,
    band: str,
    weights: dict[str, float] | None = None,
    top_k: int = 8,
) -> list[dict]:
    """Score per-axis, attach to candidate, sort by composite, return top_k."""
    scored = []
    for c in candidates:
        axis_scores = predict_axes(c, is_paragraph=is_paragraph, band=band)
        composite = composite_score(axis_scores, weights=weights)
        scored.append({**c, "axis_scores": axis_scores, "composite_score": composite})
    scored.sort(key=lambda c: -c["composite_score"])
    return scored[:top_k]
  • [ ] Step 5.3: Run, verify pass

  • [ ] Step 5.4: Commit


Task 6: embedding cache

Files: - Create: <spike>/embedding_cache.py - Modify: <spike>/test_reranker_v2.py

Sidecar cache: text_hash → MiniLM-384 vector. Survives across rounds; halves embedding time after Round 0.

  • [ ] Step 6.1: Append failing test
def test_embedding_cache_round_trip(tmp_path):
    from embedding_cache import EmbeddingCache
    cache = EmbeddingCache(path=tmp_path / "embeddings.pkl")
    text = "The cat sat on the mat."
    assert cache.get(text) is None
    vec = [0.1] * 384  # mock
    cache.put(text, vec)
    assert cache.get(text) == vec
    cache.save()
    cache2 = EmbeddingCache(path=tmp_path / "embeddings.pkl")
    cache2.load()
    assert cache2.get(text) == vec
  • [ ] Step 6.2: Implement
import hashlib
import pickle
from pathlib import Path


class EmbeddingCache:
    def __init__(self, path: Path):
        self.path = path
        self._cache: dict[str, list[float]] = {}

    def _key(self, text: str) -> str:
        return hashlib.sha256(text.encode("utf-8")).hexdigest()

    def get(self, text: str):
        return self._cache.get(self._key(text))

    def put(self, text: str, vec):
        self._cache[self._key(text)] = list(vec) if not isinstance(vec, list) else vec

    def save(self):
        self.path.parent.mkdir(parents=True, exist_ok=True)
        with self.path.open("wb") as f:
            pickle.dump(self._cache, f)

    def load(self):
        if self.path.exists():
            with self.path.open("rb") as f:
                self._cache = pickle.load(f)

Wire build_features to consult the cache before invoking MiniLM.

  • [ ] Step 6.3: Run, verify pass

  • [ ] Step 6.4: Commit


Task 7: build_judging_set --no-judge

Files: - Modify: <spike>/build_judging_set.py

Add a flag that emits the request/candidate JSONL but skips the teacher-labeling step. Used to produce unlabeled candidate pools for active-learning rounds.

  • [ ] Step 7.1: Inspect main()
grep -n "argparse\|add_argument\|parse_args\|--judge\|llm_judge" packages/generation/research/2026-05-07-sentence-generation-paradigms/build_judging_set.py

Find the argparse section.

  • [ ] Step 7.2: Add --no-judge flag

In main():

parser.add_argument("--no-judge", action="store_true",
    help="Emit unlabeled candidates only (skip Sonnet teacher labeling). "
         "Used for active-learning batch generation.")
args = parser.parse_args()

# After requests are built:
if args.no_judge:
    # Emit raw judging_set.jsonl, skip llm_judge
    ...
else:
    # Existing path: build then judge
    ...

Adapt to existing code structure. The output file should still be outputs/judging_set.jsonl so downstream tools work; outputs/judged.jsonl only gets written when judging happens.

  • [ ] Step 7.3: Smoke test --no-judge
cd /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms && \
  uv run python build_judging_set.py --dry-run --no-judge 2>&1 | tail -10

Expected: produces judging_set.jsonl without invoking Sonnet.

  • [ ] Step 7.4: Commit

Task 8: active-learning selector

Files: - Create: <spike>/active_learning_select.py - Modify: <spike>/test_reranker_v2.py

Given an unlabeled candidate pool + the Round-N model, score each candidate's per-axis variance and pick top-uncertainty stratified by (request_type, band, constraint_config).

  • [ ] Step 8.1: Append failing test
def test_active_learning_selector_picks_high_variance():
    """Candidates with high cross-axis variance should rank above low-variance ones."""
    from active_learning_select import select_uncertain_batch
    candidates = [
        # Low variance: all axes agree (low informativeness)
        {"sentence": "The cat eats.", "axis_scores": {"naturalness": 4.0, "grammaticality": 4.0, "age_appropriate": 4.0, "coherence": 4.0}},
        # High variance: axes disagree (high informativeness)
        {"sentence": "Cat very eats.", "axis_scores": {"naturalness": 2.0, "grammaticality": 5.0, "age_appropriate": 3.0, "coherence": 1.0}},
    ]
    selected = select_uncertain_batch(candidates, n=1)
    assert selected[0]["sentence"] == "Cat very eats."  # higher variance picked
  • [ ] Step 8.2: Implement
"""PHON-107: active-learning selector — pick uncertain candidates for teacher labeling."""
from __future__ import annotations

from collections import defaultdict
from statistics import variance


AXES = ("naturalness", "grammaticality", "age_appropriate", "coherence")


def select_uncertain_batch(
    candidates: list[dict],
    *,
    n: int = 200,
    stratify_keys: tuple[str, ...] = ("request_type", "band", "constraint_label"),
) -> list[dict]:
    """Select top-N candidates by per-axis variance, stratified by stratify_keys.

    Each candidate must have axis_scores already attached (from rerank_with_axes).
    """
    # Compute variance for each candidate
    for c in candidates:
        scores = list(c["axis_scores"].values())
        c["_uncertainty"] = variance(scores) if len(scores) > 1 else 0.0

    # Bucket by stratify keys
    buckets: dict[tuple, list[dict]] = defaultdict(list)
    for c in candidates:
        key = tuple(c.get(k, "") for k in stratify_keys)
        buckets[key].append(c)

    # Per-bucket: sort desc by uncertainty
    for k in buckets:
        buckets[k].sort(key=lambda c: -c["_uncertainty"])

    # Round-robin across buckets, take top per until n total
    selected: list[dict] = []
    bucket_keys = list(buckets.keys())
    bucket_idx = [0] * len(bucket_keys)
    while len(selected) < n:
        progress = False
        for i, k in enumerate(bucket_keys):
            if bucket_idx[i] < len(buckets[k]):
                selected.append(buckets[k][bucket_idx[i]])
                bucket_idx[i] += 1
                progress = True
                if len(selected) >= n:
                    break
        if not progress:
            break

    # Strip the temporary _uncertainty key before returning
    for c in selected:
        c.pop("_uncertainty", None)

    return selected
  • [ ] Step 8.3: Run, verify pass

  • [ ] Step 8.4: Commit


Task 9: Round 1 active-learning run

Files: - (run script — no new module)

Execute the first active-learning round end-to-end: 1. Generate unlabeled pool via build_judging_set.py --no-judge (large request count) 2. Score all candidates with the Round-0 model via rerank_with_axes 3. Select 200 most-uncertain via active_learning_select 4. Write to outputs/active_round_1.jsonl for teacher 5. Run llm_judge.py outputs/active_round_1.jsonl (Sonnet 4.6) 6. Merge round_1 labels into existing judged.jsonl 7. Retrain via train_reranker_v2.py --round 1 8. Compare per-axis Spearmans to Round 0

  • [ ] Step 9.1: Run pipeline
cd /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms

# 1. Unlabeled pool (much larger than the eval harness — N_REQUESTS = 50+)
uv run python build_judging_set.py --no-judge --n-requests 50

# 2-3. Score + select
uv run python -c "
import json
from pathlib import Path
import pair_driven, reranker_v2
from active_learning_select import select_uncertain_batch

reqs = [json.loads(line) for line in Path('outputs/judging_set.jsonl').read_text().splitlines() if line.strip()]
pool = []
for req in reqs:
    band = req['band']
    for c in req['candidates']:
        c['_request_meta'] = {'request_type': req['request_type'], 'band': band, 'constraint_label': req.get('constraint_label', 'C0')}
        scored = reranker_v2.rerank_with_axes([c], is_paragraph=req['request_type']=='paragraph', band=band, top_k=1)
        pool.append({**scored[0], **c['_request_meta']})

selected = select_uncertain_batch(pool, n=200)
# Write back to active_round_1.jsonl in the request/candidates shape
# (group by request_id, embed selected candidates)
...
"

# 4. Run teacher
uv run python llm_judge.py --jsonl outputs/active_round_1.jsonl --output outputs/active_round_1_judged.jsonl

# 5. Merge labels
cat outputs/judged.jsonl outputs/active_round_1_judged.jsonl > outputs/judged_round_1.jsonl
mv outputs/judged_round_1.jsonl outputs/judged.jsonl

# 6-7. Retrain + report
uv run python train_reranker_v2.py --round 1
  • [ ] Step 9.2: Verify Spearman delta

The reported Round 1 metrics should show Spearman improvement on at least 2 of 4 axes vs Round 0. If improvement is < 0.01 across all axes, that's the stopping criterion firing — proceed to Task 12 verification with Round 0 model.

  • [ ] Step 9.3: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/
git commit -m "PHON-107: Round 1 active-learning labels + retrained models"

Task 10: Rounds 2..K (loop until plateau or budget)

Files: - Same as Task 9, repeated

Repeat Task 9 for Round 2, 3, … until: - Per-axis Spearman delta < 0.01 across all 4 axes for a round, OR - Total teacher spend tracked across rounds exceeds $8

  • [ ] Step 10.1: Track teacher cost

After each llm_judge.py run, log the dollar cost (Anthropic SDK reports input/output token usage). Maintain a running total in a small outputs/active_learning_log.json.

  • [ ] Step 10.2: Stop when criterion fires

After each round, check both criteria. Print the stopping reason.

  • [ ] Step 10.3: Final commit

After the last round, commit the final model artifact + the active_learning_log.json.


Task 11: update demo_quality_axis to surface per-axis breakdown

Files: - Modify: <spike>/demo_quality_axis.py

Rewrite the demo to call rerank_with_axes and print per-axis scores per candidate. Useful for the next session of qualitative review.

  • [ ] Step 11.1: Read current demo
grep -n "predict_quality\|demo\|main\|composite" packages/generation/research/2026-05-07-sentence-generation-paradigms/demo_quality_axis.py
  • [ ] Step 11.2: Rewrite to call v2

Replace predict_quality(c) calls with rerank_with_axes(candidates, ...). Print per-axis breakdown like:

Top candidate: "The cat eats the bat."
  composite: 4.13
  naturalness: 4.5  grammaticality: 4.8  age_appropriate: 3.6  coherence: 3.5
  • [ ] Step 11.3: Smoke test
cd /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms && \
  uv run python demo_quality_axis.py 2>&1 | tail -20

Expected: prints per-axis breakdown for top-K of a few sample requests.

  • [ ] Step 11.4: Commit

Task 12: final verification + retire v1

Files: - Delete: <spike>/quality_axis.py (replaced by v2) - Delete: <spike>/train_reranker.py (replaced by v2) - Delete: <spike>/reranker.py if it's the v1 one (vs the rerank-with-mmr-sampling layer — check)

  • [ ] Step 12.1: Identify retire candidates
grep -rn "quality_axis\.\|from quality_axis import\|train_reranker\.\|from train_reranker import" packages/generation/research/2026-05-07-sentence-generation-paradigms/

For each caller of v1 modules: update to import v2 equivalents.

  • [ ] Step 12.2: Replace v1 imports

In each caller, change from quality_axis import predict_qualityfrom quality_axis_v2 import predict_axes, composite_score. Update the caller logic accordingly.

  • [ ] Step 12.3: Delete v1 files
rm packages/generation/research/2026-05-07-sentence-generation-paradigms/quality_axis.py
rm packages/generation/research/2026-05-07-sentence-generation-paradigms/train_reranker.py

If reranker.py contains the MMR/sampling layer (different from quality_axis), KEEP it — the MMR sampling layer is orthogonal.

  • [ ] Step 12.4: Run full spike suite
cd /Users/jneumann/Repos/PhonoLex/packages/generation && \
  uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v

Expected: all tests pass; no quality_axis or train_reranker (v1) imports remain.

  • [ ] Step 12.5: Smoke — full pipeline end-to-end
cd /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms && \
  uv run python -c "
import paragraph_csp
import paradigm_3_csp
import reranker_v2
import polars as pl
from pathlib import Path
from phonolex_data.runtime.store import WordStore
from constraint_surface import MinpairConstraint

repo = Path('/Users/jneumann/Repos/PhonoLex')
store = WordStore.from_parquet(repo / 'data' / 'runtime' / 'words.parquet')
sel_df = pl.read_parquet(repo / 'data' / 'runtime' / 'selectional.parquet')
skeletons_df = pl.read_parquet(Path('outputs') / 'skeletons.parquet')

import pair_driven
spec_words = paradigm_3_csp.spec_lexicon(store, 'spec1')
candidates = pair_driven.solve(
    spec_words=spec_words, word_df=store.df, sel_df=sel_df,
    pairs_df=store.pairs_df, skeletons_df=skeletons_df,
    band='fineweb_adult',
    constraints=[MinpairConstraint(phoneme1='d', phoneme2='z', position='final')],
    top_k=10,
)
ranked = reranker_v2.rerank_with_axes(candidates, is_paragraph=False, band='fineweb_adult', top_k=3)
for c in ranked:
    print(f'{c[\"sentence\"]} | composite={c[\"composite_score\"]:.2f}')
    print(f'  axes: {c[\"axis_scores\"]}')
"

Expected: prints 3 sentences with composite + per-axis scores.

  • [ ] Step 12.6: Final commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/
git commit -m "$(cat <<'EOF'
PHON-107: retire v1 reranker; v2 is the canonical path

Removes quality_axis.py and train_reranker.py. All callers updated
to use quality_axis_v2 / train_reranker_v2 / reranker_v2.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Done

After Task 12 verification, PHON-107 v1 is complete. Stack continues toward PHON-109 productionization.

Follow-ups not in this plan: - PHON-109 — productionize: move v2 reranker into a package, integrate with /api/generate-single - PHON-110 — frontend reframe: surface per-axis breakdown + axis weight controls in the UI - v3 reranker (deferred) — adaptive composite weighting (learned from per-band defaults), cross-axis correlation modeling