PHON-107 — Reranker v2 Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Replace v1's mean-of-4 collapse with 4 independent LightGBM per-axis scorers over MiniLM-L6-v2 features. Active-learning loops on the PHON-112/113 harness output expand training data until Spearman plateaus or $8 budget hits.
Architecture: 4 independent models, composite weighting at scoring time, uncertainty-driven active-learning rounds with diversity stratification.
Tech Stack: LightGBM (existing), sentence-transformers MiniLM-L6-v2 (existing), Anthropic SDK for Sonnet 4.6 teacher labeling (existing via llm_judge.py), Polars for batch operations.
File map¶
Files created (new):
- <spike>/train_reranker_v2.py — trains 4 LightGBM models, reports per-axis Spearman
- <spike>/quality_axis_v2.py — loads 4 models, returns per-axis predictions
- <spike>/active_learning_select.py — uncertainty-driven candidate selector
- <spike>/embedding_cache.py — text-hash-keyed MiniLM cache (sidecar disk)
- <spike>/test_reranker_v2.py — tests for trainer, predictor, composite, active-learning
Files modified:
- <spike>/build_judging_set.py — add --no-judge mode (emit unlabeled pool)
- <spike>/demo_quality_axis.py — surface per-axis breakdown
- <spike>/llm_judge.py — accept active-learning batch input format
Where <spike> = /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms/.
Tests run via cd /Users/jneumann/Repos/PhonoLex/packages/generation && uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v.
Task 1: per-axis target extractor¶
Files:
- Create: <spike>/test_reranker_v2.py
- Modify: <spike>/train_reranker.py (factor out helper) OR create new helper module
Replace v1's _composite_target with _per_axis_targets that returns dict[axis, float | None].
- [ ] Step 1.1: Write failing test
Create <spike>/test_reranker_v2.py:
"""Tests for PHON-107 reranker v2."""
from __future__ import annotations
import sys
from pathlib import Path
import pytest
sys.path.insert(0, str(Path(__file__).parent))
AXES = ("naturalness", "grammaticality", "age_appropriate", "coherence")
def test_per_axis_targets_returns_4_floats():
from train_reranker_v2 import _per_axis_targets
ratings = {"naturalness": 4, "grammaticality": 5, "age_appropriate": 3, "coherence": 4}
targets = _per_axis_targets(ratings)
assert set(targets.keys()) == set(AXES)
for ax in AXES:
assert isinstance(targets[ax], float)
assert 1.0 <= targets[ax] <= 5.0
def test_per_axis_targets_partial_returns_none_for_missing():
from train_reranker_v2 import _per_axis_targets
ratings = {"naturalness": 4, "grammaticality": None} # missing age_appropriate, coherence
targets = _per_axis_targets(ratings)
assert targets["naturalness"] == 4.0
assert targets["grammaticality"] is None
assert targets["age_appropriate"] is None
assert targets["coherence"] is None
-
[ ] Step 1.2: Run, verify fail (ModuleNotFoundError on train_reranker_v2)
-
[ ] Step 1.3: Create
train_reranker_v2.pywith helper
Create <spike>/train_reranker_v2.py:
"""PHON-107: Reranker v2 trainer — 4 independent LightGBM models per axis.
Replaces v1's mean-of-4 collapse with per-axis scorers, trained on
existing llm_ratings labels. Active-learning rounds layer on top.
"""
from __future__ import annotations
import json
import pickle
from pathlib import Path
AXES = ("naturalness", "grammaticality", "age_appropriate", "coherence")
def _per_axis_targets(ratings: dict) -> dict[str, float | None]:
"""Return per-axis float targets from llm_ratings dict.
None if missing or non-numeric."""
out: dict[str, float | None] = {}
for ax in AXES:
v = ratings.get(ax)
if isinstance(v, (int, float)) and 1 <= v <= 5:
out[ax] = float(v)
else:
out[ax] = None
return out
- [ ] Step 1.4: Run, verify pass (2 tests)
cd /Users/jneumann/Repos/PhonoLex/packages/generation && \
uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/test_reranker_v2.py -v
Expected: 2 passed.
- [ ] Step 1.5: Commit
cd /Users/jneumann/Repos/PhonoLex && \
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/train_reranker_v2.py \
packages/generation/research/2026-05-07-sentence-generation-paradigms/test_reranker_v2.py
git commit -m "$(cat <<'EOF'
PHON-107: per-axis target extractor
_per_axis_targets returns {axis: float | None} from llm_ratings dict.
None on missing or non-numeric. Replaces v1's _composite_target.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 2: load_data + build_features for per-axis training¶
Files:
- Modify: <spike>/train_reranker_v2.py
- Modify: <spike>/test_reranker_v2.py
Reuse v1's tabular feature extraction + MiniLM embeddings, but produce per-axis target arrays instead of a single composite.
- [ ] Step 2.1: Inspect v1's load_data + build_features
sed -n '116,175p' packages/generation/research/2026-05-07-sentence-generation-paradigms/train_reranker.py
Identify the existing helpers' shape — what fields they expect, what they return.
- [ ] Step 2.2: Append failing tests
def test_load_data_v2_returns_per_axis_targets():
from train_reranker_v2 import load_data
candidates, axis_targets, groups = load_data()
# axis_targets is now a dict, not a list of floats
assert set(axis_targets.keys()) == set(AXES)
for ax in AXES:
assert len(axis_targets[ax]) == len(candidates)
# Each target is float or None (None for partial labels)
for t in axis_targets[ax]:
assert t is None or isinstance(t, float)
- [ ] Step 2.3: Implement
load_data()andbuild_features()in v2 trainer
def load_data(judged_jsonl: Path | None = None) -> tuple[list[dict], dict[str, list[float | None]], list[tuple[str, str]]]:
"""Load judged candidates, return (candidates, per-axis-targets, group_keys).
Each candidate's llm_ratings is split into 4 per-axis targets. None for missing.
"""
if judged_jsonl is None:
judged_jsonl = Path(__file__).parent / "outputs" / "judged.jsonl"
candidates: list[dict] = []
axis_targets: dict[str, list[float | None]] = {ax: [] for ax in AXES}
groups: list[tuple[str, str]] = []
for line in judged_jsonl.read_text().splitlines():
if not line.strip():
continue
req = json.loads(line)
verb = req.get("verb", "(constraint-driven)")
band = req["band"]
for c in req["candidates"]:
ratings = c.get("llm_ratings", {})
if not ratings:
continue
targets = _per_axis_targets(ratings)
# Skip candidates with no usable rating
if all(v is None for v in targets.values()):
continue
candidates.append(c)
for ax in AXES:
axis_targets[ax].append(targets[ax])
groups.append((verb, band))
return candidates, axis_targets, groups
For build_features, copy v1's structure (tabular + MiniLM), wrap into a function that returns numpy arrays. Keep backwards-compatible feature extraction so the v1 features can be reused.
-
[ ] Step 2.4: Run, verify pass
-
[ ] Step 2.5: Commit
Task 3: train 4 LightGBM models (Round 0)¶
Files:
- Modify: <spike>/train_reranker_v2.py
- Modify: <spike>/test_reranker_v2.py
Per-axis training loop: for each axis, drop rows with None target, train LightGBM on (features, axis_target). Save 4 models + per-axis Spearman to disk.
- [ ] Step 3.1: Append failing test
def test_train_per_axis_models_round_0(tmp_path):
from train_reranker_v2 import train_round
out_path = tmp_path / "reranker_v2.pkl"
metrics = train_round(round_num=0, out_path=out_path)
assert out_path.exists()
# Metrics: per-axis Spearman + held-out R2 for each of 4 axes
for ax in AXES:
assert ax in metrics
assert "spearman" in metrics[ax]
assert "r2" in metrics[ax]
- [ ] Step 3.2: Implement
train_round
import lightgbm as lgb
import numpy as np
from scipy.stats import spearmanr
from sklearn.metrics import r2_score
def train_round(*, round_num: int, out_path: Path | None = None) -> dict:
"""Train 4 axis models. Returns per-axis Spearman + R2 metrics."""
candidates, axis_targets, groups = load_data()
X = build_features(candidates)
train_idx, test_idx = split_by_group(groups, test_frac=0.25)
models: dict[str, lgb.Booster] = {}
metrics: dict[str, dict[str, float]] = {}
for ax in AXES:
y = np.asarray(axis_targets[ax], dtype=object)
# Mask out None
valid = np.array([v is not None for v in y])
train_ax = train_idx & valid
test_ax = test_idx & valid
y_train = np.array([y[i] for i in np.where(train_ax)[0]], dtype=float)
y_test = np.array([y[i] for i in np.where(test_ax)[0]], dtype=float)
X_train = X[train_ax]
X_test = X[test_ax]
params = {"objective": "regression", "metric": "rmse", "verbose": -1}
model = lgb.train(params, lgb.Dataset(X_train, label=y_train), num_boost_round=200)
y_pred = model.predict(X_test)
sp, _ = spearmanr(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
models[ax] = model
metrics[ax] = {"spearman": float(sp), "r2": float(r2)}
if out_path is None:
out_path = Path(__file__).parent / "outputs" / "reranker_v2.pkl"
out_path.parent.mkdir(parents=True, exist_ok=True)
with out_path.open("wb") as f:
pickle.dump({
"version": 2,
"round": round_num,
"axes": AXES,
"models": {ax: models[ax].model_to_string() for ax in AXES},
"metrics": metrics,
}, f)
return metrics
split_by_group reuses v1's helper if present, otherwise implement: random split of unique groups into train/test.
- [ ] Step 3.3: Run, verify pass
The test should produce real Spearmans. Sanity-check: each axis's Spearman should be in [0.3, 0.8] roughly, comparable to v1's 0.633 composite.
- [ ] Step 3.4: Commit
Task 4: per-axis predictor¶
Files:
- Create: <spike>/quality_axis_v2.py
- Modify: <spike>/test_reranker_v2.py
predict_axes(candidate) returns 4 axis predictions. composite_score(axis_scores, weights) returns weighted sum.
- [ ] Step 4.1: Append failing tests
def test_predict_axes_returns_4_floats():
from quality_axis_v2 import predict_axes
candidate = {
"sentence": "The cat eats the bat.",
"ppmi_total": 5.0,
"feature_distance": 0.0,
"sonorant_diff": 0.0,
}
pred = predict_axes(candidate, is_paragraph=False, band="fineweb_adult")
assert set(pred.keys()) == set(AXES)
for ax in AXES:
assert isinstance(pred[ax], float)
assert 0.0 < pred[ax] < 6.0 # plausible range for 1-5 ratings
def test_composite_score_weighted_sum():
from quality_axis_v2 import composite_score
axis = {"naturalness": 4.0, "grammaticality": 5.0, "age_appropriate": 3.0, "coherence": 4.0}
# Default equal weights
s = composite_score(axis)
assert abs(s - 4.0) < 0.001 # mean of 4, 5, 3, 4 = 4.0
# Custom weights
s = composite_score(axis, weights={"naturalness": 1.0, "grammaticality": 0.0, "age_appropriate": 0.0, "coherence": 0.0})
assert abs(s - 4.0) < 0.001 # only naturalness counts
s = composite_score(axis, weights={"grammaticality": 1.0, "naturalness": 0.0, "age_appropriate": 0.0, "coherence": 0.0})
assert abs(s - 5.0) < 0.001
- [ ] Step 4.2: Implement
Create <spike>/quality_axis_v2.py:
"""PHON-107: per-axis quality predictor.
Loads 4 LightGBM models trained by train_reranker_v2 + the existing
MiniLM-L6-v2 embedder. Returns 4 axis predictions per candidate."""
from __future__ import annotations
import pickle
from pathlib import Path
from typing import Any
import lightgbm as lgb
import numpy as np
DEFAULT_MODEL_PATH = Path(__file__).parent / "outputs" / "reranker_v2.pkl"
AXES = ("naturalness", "grammaticality", "age_appropriate", "coherence")
DEFAULT_WEIGHTS = {ax: 0.25 for ax in AXES}
def _load_model(path: Path = DEFAULT_MODEL_PATH) -> dict:
with path.open("rb") as f:
return pickle.load(f)
_MODEL_CACHE: dict[Path, dict] = {}
def _cached_model(path: Path = DEFAULT_MODEL_PATH) -> dict:
if path not in _MODEL_CACHE:
m = _load_model(path)
# Reconstitute LightGBM Boosters
m["boosters"] = {
ax: lgb.Booster(model_str=m["models"][ax]) for ax in m["axes"]
}
_MODEL_CACHE[path] = m
return _MODEL_CACHE[path]
def predict_axes(
candidate: dict,
*,
is_paragraph: bool,
band: str,
model_path: Path = DEFAULT_MODEL_PATH,
) -> dict[str, float]:
"""Predict 4 axis scores for a candidate."""
from train_reranker_v2 import build_features
m = _cached_model(model_path)
X = build_features([candidate]) # 1xD feature row
out: dict[str, float] = {}
for ax in AXES:
pred = m["boosters"][ax].predict(X)[0]
out[ax] = float(pred)
return out
def composite_score(
axis_scores: dict[str, float],
*,
weights: dict[str, float] | None = None,
) -> float:
"""Weighted sum across axes. Default equal weights."""
w = weights if weights is not None else DEFAULT_WEIGHTS
return sum(w.get(ax, 0.0) * axis_scores.get(ax, 0.0) for ax in AXES)
The build_features import from train_reranker_v2 returns a 2D numpy array; passing a single candidate produces shape (1, D).
-
[ ] Step 4.3: Run, verify pass
-
[ ] Step 4.4: Commit
Task 5: rerank_with_axes¶
Files:
- Create: <spike>/reranker_v2.py
- Modify: <spike>/test_reranker_v2.py
Score each candidate's 4 axes, attach to candidate dict, sort by composite, return top_k.
- [ ] Step 5.1: Append failing test
def test_rerank_with_axes_attaches_predictions():
from reranker_v2 import rerank_with_axes
candidates = [
{"sentence": "The cat eats the bat.", "ppmi_total": 5.0, "feature_distance": 0.0, "sonorant_diff": 0.0},
{"sentence": "The dog runs in the park.", "ppmi_total": 4.5, "feature_distance": 0.0, "sonorant_diff": 0.0},
]
out = rerank_with_axes(candidates, is_paragraph=False, band="fineweb_adult", top_k=2)
assert len(out) == 2
for c in out:
assert "axis_scores" in c
assert set(c["axis_scores"].keys()) == set(AXES)
assert "composite_score" in c
# Sort: top result has highest composite
assert out[0]["composite_score"] >= out[1]["composite_score"]
- [ ] Step 5.2: Implement
Create <spike>/reranker_v2.py:
"""PHON-107: rerank_with_axes — sort by composite of 4 per-axis predictions."""
from __future__ import annotations
from quality_axis_v2 import predict_axes, composite_score, AXES
def rerank_with_axes(
candidates: list[dict],
*,
is_paragraph: bool,
band: str,
weights: dict[str, float] | None = None,
top_k: int = 8,
) -> list[dict]:
"""Score per-axis, attach to candidate, sort by composite, return top_k."""
scored = []
for c in candidates:
axis_scores = predict_axes(c, is_paragraph=is_paragraph, band=band)
composite = composite_score(axis_scores, weights=weights)
scored.append({**c, "axis_scores": axis_scores, "composite_score": composite})
scored.sort(key=lambda c: -c["composite_score"])
return scored[:top_k]
-
[ ] Step 5.3: Run, verify pass
-
[ ] Step 5.4: Commit
Task 6: embedding cache¶
Files:
- Create: <spike>/embedding_cache.py
- Modify: <spike>/test_reranker_v2.py
Sidecar cache: text_hash → MiniLM-384 vector. Survives across rounds; halves embedding time after Round 0.
- [ ] Step 6.1: Append failing test
def test_embedding_cache_round_trip(tmp_path):
from embedding_cache import EmbeddingCache
cache = EmbeddingCache(path=tmp_path / "embeddings.pkl")
text = "The cat sat on the mat."
assert cache.get(text) is None
vec = [0.1] * 384 # mock
cache.put(text, vec)
assert cache.get(text) == vec
cache.save()
cache2 = EmbeddingCache(path=tmp_path / "embeddings.pkl")
cache2.load()
assert cache2.get(text) == vec
- [ ] Step 6.2: Implement
import hashlib
import pickle
from pathlib import Path
class EmbeddingCache:
def __init__(self, path: Path):
self.path = path
self._cache: dict[str, list[float]] = {}
def _key(self, text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
def get(self, text: str):
return self._cache.get(self._key(text))
def put(self, text: str, vec):
self._cache[self._key(text)] = list(vec) if not isinstance(vec, list) else vec
def save(self):
self.path.parent.mkdir(parents=True, exist_ok=True)
with self.path.open("wb") as f:
pickle.dump(self._cache, f)
def load(self):
if self.path.exists():
with self.path.open("rb") as f:
self._cache = pickle.load(f)
Wire build_features to consult the cache before invoking MiniLM.
-
[ ] Step 6.3: Run, verify pass
-
[ ] Step 6.4: Commit
Task 7: build_judging_set --no-judge¶
Files:
- Modify: <spike>/build_judging_set.py
Add a flag that emits the request/candidate JSONL but skips the teacher-labeling step. Used to produce unlabeled candidate pools for active-learning rounds.
- [ ] Step 7.1: Inspect main()
grep -n "argparse\|add_argument\|parse_args\|--judge\|llm_judge" packages/generation/research/2026-05-07-sentence-generation-paradigms/build_judging_set.py
Find the argparse section.
- [ ] Step 7.2: Add
--no-judgeflag
In main():
parser.add_argument("--no-judge", action="store_true",
help="Emit unlabeled candidates only (skip Sonnet teacher labeling). "
"Used for active-learning batch generation.")
args = parser.parse_args()
# After requests are built:
if args.no_judge:
# Emit raw judging_set.jsonl, skip llm_judge
...
else:
# Existing path: build then judge
...
Adapt to existing code structure. The output file should still be outputs/judging_set.jsonl so downstream tools work; outputs/judged.jsonl only gets written when judging happens.
- [ ] Step 7.3: Smoke test --no-judge
cd /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms && \
uv run python build_judging_set.py --dry-run --no-judge 2>&1 | tail -10
Expected: produces judging_set.jsonl without invoking Sonnet.
- [ ] Step 7.4: Commit
Task 8: active-learning selector¶
Files:
- Create: <spike>/active_learning_select.py
- Modify: <spike>/test_reranker_v2.py
Given an unlabeled candidate pool + the Round-N model, score each candidate's per-axis variance and pick top-uncertainty stratified by (request_type, band, constraint_config).
- [ ] Step 8.1: Append failing test
def test_active_learning_selector_picks_high_variance():
"""Candidates with high cross-axis variance should rank above low-variance ones."""
from active_learning_select import select_uncertain_batch
candidates = [
# Low variance: all axes agree (low informativeness)
{"sentence": "The cat eats.", "axis_scores": {"naturalness": 4.0, "grammaticality": 4.0, "age_appropriate": 4.0, "coherence": 4.0}},
# High variance: axes disagree (high informativeness)
{"sentence": "Cat very eats.", "axis_scores": {"naturalness": 2.0, "grammaticality": 5.0, "age_appropriate": 3.0, "coherence": 1.0}},
]
selected = select_uncertain_batch(candidates, n=1)
assert selected[0]["sentence"] == "Cat very eats." # higher variance picked
- [ ] Step 8.2: Implement
"""PHON-107: active-learning selector — pick uncertain candidates for teacher labeling."""
from __future__ import annotations
from collections import defaultdict
from statistics import variance
AXES = ("naturalness", "grammaticality", "age_appropriate", "coherence")
def select_uncertain_batch(
candidates: list[dict],
*,
n: int = 200,
stratify_keys: tuple[str, ...] = ("request_type", "band", "constraint_label"),
) -> list[dict]:
"""Select top-N candidates by per-axis variance, stratified by stratify_keys.
Each candidate must have axis_scores already attached (from rerank_with_axes).
"""
# Compute variance for each candidate
for c in candidates:
scores = list(c["axis_scores"].values())
c["_uncertainty"] = variance(scores) if len(scores) > 1 else 0.0
# Bucket by stratify keys
buckets: dict[tuple, list[dict]] = defaultdict(list)
for c in candidates:
key = tuple(c.get(k, "") for k in stratify_keys)
buckets[key].append(c)
# Per-bucket: sort desc by uncertainty
for k in buckets:
buckets[k].sort(key=lambda c: -c["_uncertainty"])
# Round-robin across buckets, take top per until n total
selected: list[dict] = []
bucket_keys = list(buckets.keys())
bucket_idx = [0] * len(bucket_keys)
while len(selected) < n:
progress = False
for i, k in enumerate(bucket_keys):
if bucket_idx[i] < len(buckets[k]):
selected.append(buckets[k][bucket_idx[i]])
bucket_idx[i] += 1
progress = True
if len(selected) >= n:
break
if not progress:
break
# Strip the temporary _uncertainty key before returning
for c in selected:
c.pop("_uncertainty", None)
return selected
-
[ ] Step 8.3: Run, verify pass
-
[ ] Step 8.4: Commit
Task 9: Round 1 active-learning run¶
Files: - (run script — no new module)
Execute the first active-learning round end-to-end:
1. Generate unlabeled pool via build_judging_set.py --no-judge (large request count)
2. Score all candidates with the Round-0 model via rerank_with_axes
3. Select 200 most-uncertain via active_learning_select
4. Write to outputs/active_round_1.jsonl for teacher
5. Run llm_judge.py outputs/active_round_1.jsonl (Sonnet 4.6)
6. Merge round_1 labels into existing judged.jsonl
7. Retrain via train_reranker_v2.py --round 1
8. Compare per-axis Spearmans to Round 0
- [ ] Step 9.1: Run pipeline
cd /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms
# 1. Unlabeled pool (much larger than the eval harness — N_REQUESTS = 50+)
uv run python build_judging_set.py --no-judge --n-requests 50
# 2-3. Score + select
uv run python -c "
import json
from pathlib import Path
import pair_driven, reranker_v2
from active_learning_select import select_uncertain_batch
reqs = [json.loads(line) for line in Path('outputs/judging_set.jsonl').read_text().splitlines() if line.strip()]
pool = []
for req in reqs:
band = req['band']
for c in req['candidates']:
c['_request_meta'] = {'request_type': req['request_type'], 'band': band, 'constraint_label': req.get('constraint_label', 'C0')}
scored = reranker_v2.rerank_with_axes([c], is_paragraph=req['request_type']=='paragraph', band=band, top_k=1)
pool.append({**scored[0], **c['_request_meta']})
selected = select_uncertain_batch(pool, n=200)
# Write back to active_round_1.jsonl in the request/candidates shape
# (group by request_id, embed selected candidates)
...
"
# 4. Run teacher
uv run python llm_judge.py --jsonl outputs/active_round_1.jsonl --output outputs/active_round_1_judged.jsonl
# 5. Merge labels
cat outputs/judged.jsonl outputs/active_round_1_judged.jsonl > outputs/judged_round_1.jsonl
mv outputs/judged_round_1.jsonl outputs/judged.jsonl
# 6-7. Retrain + report
uv run python train_reranker_v2.py --round 1
- [ ] Step 9.2: Verify Spearman delta
The reported Round 1 metrics should show Spearman improvement on at least 2 of 4 axes vs Round 0. If improvement is < 0.01 across all axes, that's the stopping criterion firing — proceed to Task 12 verification with Round 0 model.
- [ ] Step 9.3: Commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/
git commit -m "PHON-107: Round 1 active-learning labels + retrained models"
Task 10: Rounds 2..K (loop until plateau or budget)¶
Files: - Same as Task 9, repeated
Repeat Task 9 for Round 2, 3, … until: - Per-axis Spearman delta < 0.01 across all 4 axes for a round, OR - Total teacher spend tracked across rounds exceeds $8
- [ ] Step 10.1: Track teacher cost
After each llm_judge.py run, log the dollar cost (Anthropic SDK reports input/output token usage). Maintain a running total in a small outputs/active_learning_log.json.
- [ ] Step 10.2: Stop when criterion fires
After each round, check both criteria. Print the stopping reason.
- [ ] Step 10.3: Final commit
After the last round, commit the final model artifact + the active_learning_log.json.
Task 11: update demo_quality_axis to surface per-axis breakdown¶
Files:
- Modify: <spike>/demo_quality_axis.py
Rewrite the demo to call rerank_with_axes and print per-axis scores per candidate. Useful for the next session of qualitative review.
- [ ] Step 11.1: Read current demo
grep -n "predict_quality\|demo\|main\|composite" packages/generation/research/2026-05-07-sentence-generation-paradigms/demo_quality_axis.py
- [ ] Step 11.2: Rewrite to call v2
Replace predict_quality(c) calls with rerank_with_axes(candidates, ...). Print per-axis breakdown like:
Top candidate: "The cat eats the bat."
composite: 4.13
naturalness: 4.5 grammaticality: 4.8 age_appropriate: 3.6 coherence: 3.5
- [ ] Step 11.3: Smoke test
cd /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms && \
uv run python demo_quality_axis.py 2>&1 | tail -20
Expected: prints per-axis breakdown for top-K of a few sample requests.
- [ ] Step 11.4: Commit
Task 12: final verification + retire v1¶
Files:
- Delete: <spike>/quality_axis.py (replaced by v2)
- Delete: <spike>/train_reranker.py (replaced by v2)
- Delete: <spike>/reranker.py if it's the v1 one (vs the rerank-with-mmr-sampling layer — check)
- [ ] Step 12.1: Identify retire candidates
grep -rn "quality_axis\.\|from quality_axis import\|train_reranker\.\|from train_reranker import" packages/generation/research/2026-05-07-sentence-generation-paradigms/
For each caller of v1 modules: update to import v2 equivalents.
- [ ] Step 12.2: Replace v1 imports
In each caller, change from quality_axis import predict_quality → from quality_axis_v2 import predict_axes, composite_score. Update the caller logic accordingly.
- [ ] Step 12.3: Delete v1 files
rm packages/generation/research/2026-05-07-sentence-generation-paradigms/quality_axis.py
rm packages/generation/research/2026-05-07-sentence-generation-paradigms/train_reranker.py
If reranker.py contains the MMR/sampling layer (different from quality_axis), KEEP it — the MMR sampling layer is orthogonal.
- [ ] Step 12.4: Run full spike suite
cd /Users/jneumann/Repos/PhonoLex/packages/generation && \
uv run python -m pytest research/2026-05-07-sentence-generation-paradigms/ -v
Expected: all tests pass; no quality_axis or train_reranker (v1) imports remain.
- [ ] Step 12.5: Smoke — full pipeline end-to-end
cd /Users/jneumann/Repos/PhonoLex/packages/generation/research/2026-05-07-sentence-generation-paradigms && \
uv run python -c "
import paragraph_csp
import paradigm_3_csp
import reranker_v2
import polars as pl
from pathlib import Path
from phonolex_data.runtime.store import WordStore
from constraint_surface import MinpairConstraint
repo = Path('/Users/jneumann/Repos/PhonoLex')
store = WordStore.from_parquet(repo / 'data' / 'runtime' / 'words.parquet')
sel_df = pl.read_parquet(repo / 'data' / 'runtime' / 'selectional.parquet')
skeletons_df = pl.read_parquet(Path('outputs') / 'skeletons.parquet')
import pair_driven
spec_words = paradigm_3_csp.spec_lexicon(store, 'spec1')
candidates = pair_driven.solve(
spec_words=spec_words, word_df=store.df, sel_df=sel_df,
pairs_df=store.pairs_df, skeletons_df=skeletons_df,
band='fineweb_adult',
constraints=[MinpairConstraint(phoneme1='d', phoneme2='z', position='final')],
top_k=10,
)
ranked = reranker_v2.rerank_with_axes(candidates, is_paragraph=False, band='fineweb_adult', top_k=3)
for c in ranked:
print(f'{c[\"sentence\"]} | composite={c[\"composite_score\"]:.2f}')
print(f' axes: {c[\"axis_scores\"]}')
"
Expected: prints 3 sentences with composite + per-axis scores.
- [ ] Step 12.6: Final commit
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/
git commit -m "$(cat <<'EOF'
PHON-107: retire v1 reranker; v2 is the canonical path
Removes quality_axis.py and train_reranker.py. All callers updated
to use quality_axis_v2 / train_reranker_v2 / reranker_v2.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Done¶
After Task 12 verification, PHON-107 v1 is complete. Stack continues toward PHON-109 productionization.
Follow-ups not in this plan: - PHON-109 — productionize: move v2 reranker into a package, integrate with /api/generate-single - PHON-110 — frontend reframe: surface per-axis breakdown + axis weight controls in the UI - v3 reranker (deferred) — adaptive composite weighting (learned from per-band defaults), cross-axis correlation modeling