Generation Quality Eval Harness — Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Build a fresh eval harness under packages/generation/research/2026-04-29-eval-harness-v1/ that measures governed-generation quality across (config × prompt × constraints) using auto metrics + an LLM-judge rubric.
Architecture: Three deterministic CLI stages — generate.py (sweep → JSONL), score.py (auto metrics + Claude judge → scored JSONL), compare.py (markdown report) — driven by YAML configs, rubric, prompts, constraint pool. Autoresearch-ready (file-based primitives, append-only experiments.jsonl).
Tech Stack: Python 3.11, httpx (SSE client), pyyaml, transformers + torch (GPT-2 PPL), anthropic SDK (judge), pytest (testing).
Spec: docs/superpowers/specs/2026-04-29-generation-quality-eval-harness-design.md
Jira: PHON-57 (parent: PHON-56 Workstream; blocked by PHON-63 for full sweep usage)
File Structure¶
packages/generation/research/2026-04-29-eval-harness-v1/
├── notebook.md # lab notebook — append-only
├── rubric.yaml # judge dims + auto metrics + scales
├── prompts.yaml # 5 prompts (narrative, kids, declamatory, procedural, instructional)
├── constraints.yaml # 8 constraint cells
├── configs/
│ └── baseline-v6.yaml # production decoding + governor settings, snapshot 2026-04-29
├── vocab.txt # D1 207K word snapshot (committed once, refreshable)
├── generate.py # CLI: stage 1 (sweep runner)
├── score.py # CLI: stage 2 (auto metrics + judge)
├── compare.py # CLI: stage 3 (markdown comparison report)
├── auto_metrics.py # OOV gate + n-gram + PPL
├── judge.py # Claude API call + rubric-driven prompting
├── sse_client.py # POST /api/generate-single, parse SSE events
├── schemas.py # YAML loaders + Pydantic validators
├── experiments.py # experiments.jsonl helpers (autoresearch readiness)
├── tests/
│ ├── conftest.py # shared fixtures
│ ├── test_auto_metrics.py
│ ├── test_judge.py
│ ├── test_schemas.py
│ ├── test_sse_client.py
│ ├── test_experiments.py
│ └── test_e2e.py
└── runs/, reports/, experiments.jsonl # runtime artifacts (gitignored)
Dependencies to add (to packages/generation/pyproject.toml):
- anthropic>=0.40.0 (judge)
- pyyaml>=6.0 (config loaders)
- (existing) transformers, torch, httpx, pytest
Task 1: Scaffold the research directory¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/notebook.md
- Create: packages/generation/research/2026-04-29-eval-harness-v1/rubric.yaml
- Create: packages/generation/research/2026-04-29-eval-harness-v1/prompts.yaml
- Create: packages/generation/research/2026-04-29-eval-harness-v1/constraints.yaml
- Create: packages/generation/research/2026-04-29-eval-harness-v1/configs/baseline-v6.yaml
- Create: packages/generation/research/2026-04-29-eval-harness-v1/.gitignore
- Modify: packages/generation/pyproject.toml (add anthropic, pyyaml dependencies)
- [ ] Step 1: Create notebook.md
# Eval Harness v1 — Lab Notebook
**Created:** 2026-04-29
**Spec:** `docs/superpowers/specs/2026-04-29-generation-quality-eval-harness-design.md`
**Jira:** PHON-57
## Methodology
Epistemic scavenging — fresh design from first principles. Existing scripts (`constraint_grid_sweep.py`, `analyze_sweep.py`) are historical references for consiliation, not foundations.
## Run index
(populated as runs accumulate)
## Hypothesis log
(populated as experiments are proposed)
## Findings
(populated as runs complete and patterns emerge)
- [ ] Step 2: Create rubric.yaml
version: 1
description: First-cut rubric — designed from observed failure modes (friend testing, 2026-04-29)
judge:
default_model: claude-haiku-4-5
cache_system_prompt: true
dimensions:
- id: grammaticality
description: Is the output well-formed English syntax?
scale: [1, 2, 3, 4, 5]
anchors:
1: "Severe grammar errors throughout (broken syntax, missing core words)"
3: "Some grammar issues but generally readable"
5: "Fully grammatical, no syntactic errors"
- id: coherence
description: Does the output hold together as discourse and stay on topic?
scale: [1, 2, 3, 4, 5]
anchors:
1: "Incoherent, topic salad, contradictory"
3: "Mostly coherent with some drift or non-sequiturs"
5: "Tight coherence end to end"
- id: prompt_following
description: Does the output address what the prompt asked for?
scale: [1, 2, 3, 4, 5]
anchors:
1: "Ignores the prompt entirely"
3: "Partially addresses it"
5: "Fully addresses the prompt"
- id: natural_ending
description: Does it conclude on its own, or pad to budget?
scale: [1, 2, 3, 4, 5]
anchors:
1: "Truncated mid-thought or padded with filler"
3: "Acceptable ending but somewhat abrupt or padded"
5: "Concludes naturally"
- id: stays_in_english
description: Does it avoid code-switching into other languages?
scale: [1, 2, 3, 4, 5]
anchors:
1: "Heavy code-switching, mostly non-English tokens"
3: "Some non-English tokens"
5: "Fully English"
auto_metrics:
- id: real_english_rate
description: Fraction of [a-zA-Z]+ tokens present in the D1 207K vocab snapshot
source: vocab.txt
- id: distinct_3
description: distinct-3-gram ratio (1.0 = no 3-gram repeats)
- id: distinct_5
description: distinct-5-gram ratio
- id: max_3gram_rep
description: max repetition rate of any single 3-gram
- id: ppl_gpt2
description: token-level perplexity from GPT-2 base
model: gpt2
- [ ] Step 3: Create prompts.yaml
- id: dog_park
text: "Write a short paragraph about a dog playing in the park."
genre: narrative
- id: asteroid_kids
text: "Tell a kids' story about an asteroid in space."
genre: kids_story
- id: king_edicts
text: "Write a proclamation by a king announcing three royal edicts."
genre: declamation
- id: pb_sandwich
text: "Describe how to make a peanut butter sandwich in three steps."
genre: procedural
- id: tying_shoes
text: "Give a brief instruction for tying shoes."
genre: instructional
- [ ] Step 4: Create constraints.yaml
- id: exc_r
type: exclude
phonemes: ["ɹ", "ɝ", "ɚ"]
- id: exc_szshzh
type: exclude
phonemes: ["s", "z", "ʃ", "ʒ"]
- id: aoa_le5
type: bound
norm: aoa_kuperman
max: 5.0
- id: aoa_le7
type: bound
norm: aoa_kuperman
max: 7.0
- id: mixed_aoa7_excr
combine:
- { type: bound, norm: aoa_kuperman, max: 7.0 }
- { type: exclude, phonemes: ["ɹ", "ɝ", "ɚ"] }
- id: mixed_aoa5_excsz
combine:
- { type: bound, norm: aoa_kuperman, max: 5.0 }
- { type: exclude, phonemes: ["s", "z"] }
- id: inc_k
type: include
phonemes: ["k"]
target_rate: 0.20
- id: con_sz
type: contrastive
pair_type: minpair
phoneme1: s
phoneme2: z
position: any
- [ ] Step 5: Create configs/baseline-v6.yaml
name: baseline-v6
description: Production decoding + governor settings, snapshot 2026-04-29
parent_config: null
decoding:
temperature: 0.8
top_p: 0.92
top_k: 80
repetition_penalty: 1.2
max_new_tokens: 128
num_drafts: 4
governor:
use_punctuation_boost: true
use_trie_steering: true
use_lookahead: true
- [ ] Step 6: Create .gitignore for runtime artifacts
runs/
reports/
experiments.jsonl
__pycache__/
*.pyc
.pytest_cache/
- [ ] Step 7: Add anthropic + pyyaml to packages/generation/pyproject.toml
Open packages/generation/pyproject.toml, find the dependencies list, add:
"anthropic>=0.40.0",
"pyyaml>=6.0",
(Keep alphabetical order; insert after existing entries that come before alphabetically.)
- [ ] Step 8: Install new dependencies
Run: uv sync (from repo root)
Expected: pulls anthropic + pyyaml into the env.
- [ ] Step 9: Commit scaffold
git add packages/generation/research/2026-04-29-eval-harness-v1/ packages/generation/pyproject.toml uv.lock
git commit -m "$(cat <<'EOF'
scaffold(generation/research): PHON-57 — eval harness v1 directory + YAML stubs
Notebook, rubric, prompts, constraints, baseline-v6 config. Adds anthropic + pyyaml deps. No Python code yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 2: Snapshot D1 vocabulary to vocab.txt¶
The OOV gate needs the canonical PhonoLex vocabulary. Snapshot once, commit, refreshable on demand.
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/vocab.txt
- Create: packages/generation/research/2026-04-29-eval-harness-v1/refresh_vocab.sh
- [ ] Step 1: Create refresh_vocab.sh helper
#!/usr/bin/env bash
# Snapshot the local D1 words table to vocab.txt.
# Run from this directory: ./refresh_vocab.sh
# Requires: local D1 to be seeded (npx wrangler d1 execute phonolex --local --file scripts/d1-seed.sql)
set -e
cd "$(dirname "$0")"
WORKERS_DIR="../../../web/workers"
DB_PATH="$WORKERS_DIR/.wrangler/state/v3/d1/miniflare-D1DatabaseObject"
DB_FILE=$(find "$DB_PATH" -name "*.sqlite" 2>/dev/null | head -1)
if [[ -z "$DB_FILE" ]]; then
echo "ERROR: local D1 SQLite file not found at $DB_PATH"
echo "Seed it first: cd packages/web/workers && npx wrangler d1 execute phonolex --local --file scripts/d1-seed.sql"
exit 1
fi
echo "Snapshotting from $DB_FILE..."
sqlite3 "$DB_FILE" "SELECT word FROM words ORDER BY word" > vocab.txt
echo "Wrote $(wc -l < vocab.txt) words to vocab.txt"
Make executable: chmod +x packages/generation/research/2026-04-29-eval-harness-v1/refresh_vocab.sh
- [ ] Step 2: Run the snapshot
cd packages/generation/research/2026-04-29-eval-harness-v1
./refresh_vocab.sh
Expected: "Wrote 207665 words to vocab.txt" (or close — exact count depends on seed version).
- [ ] Step 3: Sanity-check vocab.txt
wc -l vocab.txt
grep -c '^the$' vocab.txt
grep -c '^mek$' vocab.txt
Expected: ~207665 lines, ^the$ returns 1, ^mek$ returns 0.
- [ ] Step 4: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/vocab.txt packages/generation/research/2026-04-29-eval-harness-v1/refresh_vocab.sh
git commit -m "$(cat <<'EOF'
snapshot(generation/research): PHON-57 — D1 vocabulary snapshot for OOV gate
Snapshot of local D1 words table (207,665 entries) committed as vocab.txt. refresh_vocab.sh re-snapshots on demand. Decouples the eval harness from D1 state during runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 3: OOV gate (real_english_rate auto-metric)¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py
- Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/__init__.py
- Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py
- [ ] Step 1: Write the failing test
Create tests/test_auto_metrics.py:
from pathlib import Path
import pytest
from auto_metrics import VocabGate, real_english_rate
@pytest.fixture
def vocab_gate():
return VocabGate(Path(__file__).parent.parent / "vocab.txt")
def test_real_english_rate_all_in_vocab(vocab_gate):
text = "the dog ran in the park"
assert real_english_rate(text, vocab_gate) == 1.0
def test_real_english_rate_all_oov(vocab_gate):
text = "mek aan alen vivere"
assert real_english_rate(text, vocab_gate) == 0.0
def test_real_english_rate_mixed(vocab_gate):
text = "the mek dog aan ran" # 3 in-vocab / 5 total
assert real_english_rate(text, vocab_gate) == pytest.approx(3 / 5)
def test_real_english_rate_case_insensitive(vocab_gate):
text = "The DOG ran"
assert real_english_rate(text, vocab_gate) == 1.0
def test_real_english_rate_empty():
text = ""
# No tokens — undefined; convention: return 1.0 (vacuously true)
gate = VocabGate(Path(__file__).parent.parent / "vocab.txt")
assert real_english_rate(text, gate) == 1.0
def test_real_english_rate_strips_punctuation(vocab_gate):
text = "The dog, running fast!"
assert real_english_rate(text, vocab_gate) == 1.0
- [ ] Step 2: Run test, verify it fails
cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python -m pytest tests/test_auto_metrics.py -v
Expected: FAIL with "ModuleNotFoundError: No module named 'auto_metrics'"
- [ ] Step 3: Implement minimal VocabGate + real_english_rate
Create auto_metrics.py:
"""Auto metrics for the eval harness — local computation, no network."""
from __future__ import annotations
import re
from pathlib import Path
WORD_RE = re.compile(r"[a-zA-Z]+")
class VocabGate:
"""Membership check against a one-word-per-line vocabulary file."""
def __init__(self, vocab_path: Path) -> None:
with open(vocab_path) as f:
self._words = {line.strip().lower() for line in f if line.strip()}
def __contains__(self, word: str) -> bool:
return word.lower() in self._words
def __len__(self) -> int:
return len(self._words)
def real_english_rate(text: str, gate: VocabGate) -> float:
"""Fraction of [a-zA-Z]+ tokens in `text` that are members of `gate`."""
tokens = WORD_RE.findall(text)
if not tokens:
return 1.0
in_vocab = sum(1 for t in tokens if t in gate)
return in_vocab / len(tokens)
- [ ] Step 4: Run test, verify it passes
uv run python -m pytest tests/test_auto_metrics.py -v
Expected: 6 passed.
- [ ] Step 5: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py packages/generation/research/2026-04-29-eval-harness-v1/tests/
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — OOV gate auto-metric (real_english_rate)
VocabGate loads vocab.txt as a Python set; real_english_rate returns the fraction of [a-zA-Z]+ tokens present in the gate. Catches the friend's "mek"/"aan"/"alen" pseudo-English failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 4: N-gram repetition metrics¶
Files:
- Modify: packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py
- Modify: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py
- [ ] Step 1: Write failing tests for n-gram metrics
Append to tests/test_auto_metrics.py:
from auto_metrics import distinct_n, max_ngram_rep
def test_distinct_n_no_repeats():
text = "the dog ran fast"
# 2 trigrams: "the dog ran", "dog ran fast" — both unique
assert distinct_n(text, n=3) == 1.0
def test_distinct_n_full_repetition():
text = "mek mek mek mek mek"
# 3 trigrams, all "mek mek mek" — distinct = 1/3
assert distinct_n(text, n=3) == pytest.approx(1 / 3)
def test_distinct_n_short_text():
# Fewer tokens than n: distinct-n is undefined; return 1.0
text = "hi"
assert distinct_n(text, n=3) == 1.0
def test_max_3gram_rep_no_repetition():
text = "the quick brown fox jumps over"
# 4 trigrams, each appears once — max rep = 1/4
assert max_ngram_rep(text, n=3) == pytest.approx(1 / 4)
def test_max_3gram_rep_heavy_repetition():
text = "mek mek mek mek mek"
# 3 trigrams, "mek mek mek" appears 3 times — max rep = 3/3 = 1.0
assert max_ngram_rep(text, n=3) == 1.0
def test_max_3gram_rep_partial_repetition():
text = "the dog ran the dog ran fast"
# Trigrams: (the dog ran)x2, (dog ran the), (ran the dog), (the dog ran)... wait recount
# tokens: the dog ran the dog ran fast (7 tokens) -> 5 trigrams
# (the,dog,ran), (dog,ran,the), (ran,the,dog), (the,dog,ran), (dog,ran,fast)
# most frequent: "the dog ran" appears 2x out of 5 = 0.4
assert max_ngram_rep(text, n=3) == pytest.approx(2 / 5)
- [ ] Step 2: Run tests, verify they fail
uv run python -m pytest tests/test_auto_metrics.py::test_distinct_n_no_repeats -v
Expected: FAIL with "ImportError: cannot import name 'distinct_n'".
- [ ] Step 3: Implement n-gram functions
Append to auto_metrics.py:
from collections import Counter
def _ngrams(text: str, n: int) -> list[tuple[str, ...]]:
tokens = [t.lower() for t in WORD_RE.findall(text)]
if len(tokens) < n:
return []
return [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]
def distinct_n(text: str, n: int) -> float:
"""distinct-n: ratio of unique n-grams to total n-grams. 1.0 = no repeats."""
grams = _ngrams(text, n)
if not grams:
return 1.0
return len(set(grams)) / len(grams)
def max_ngram_rep(text: str, n: int) -> float:
"""Max repetition rate of any single n-gram (count of most frequent / total)."""
grams = _ngrams(text, n)
if not grams:
return 0.0
counts = Counter(grams)
most_freq = counts.most_common(1)[0][1]
return most_freq / len(grams)
- [ ] Step 4: Run tests, verify they pass
uv run python -m pytest tests/test_auto_metrics.py -v
Expected: all tests pass (12 total at this point).
- [ ] Step 5: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — n-gram repetition auto-metrics
distinct_n (1 - repetition rate) and max_ngram_rep (worst-offending n-gram). Quantifies the "mek mek mek" / "Alen Alen Alen" failures from friend testing.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 5: GPT-2 PPL scorer¶
Files:
- Modify: packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py
- Modify: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py
- [ ] Step 1: Write the failing test
Append to tests/test_auto_metrics.py:
from auto_metrics import GPT2PPLScorer
@pytest.fixture(scope="module")
def ppl_scorer():
return GPT2PPLScorer()
def test_ppl_fluent_lower_than_gibberish(ppl_scorer):
fluent = "The dog ran across the park and chased a ball."
gibberish = "mek aan alen oude vivere allaha goed buona"
assert ppl_scorer.ppl(fluent) < ppl_scorer.ppl(gibberish)
def test_ppl_returns_finite_positive(ppl_scorer):
text = "The cat sat on the mat."
val = ppl_scorer.ppl(text)
assert val > 0
assert val < float("inf")
def test_ppl_short_text(ppl_scorer):
# Very short input — should still return a value, not crash
val = ppl_scorer.ppl("Hi.")
assert val > 0
- [ ] Step 2: Run tests, verify they fail
uv run python -m pytest tests/test_auto_metrics.py::test_ppl_fluent_lower_than_gibberish -v
Expected: FAIL with "ImportError: cannot import name 'GPT2PPLScorer'".
- [ ] Step 3: Implement GPT2PPLScorer
Append to auto_metrics.py:
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
class GPT2PPLScorer:
"""Token-level perplexity from GPT-2 base. Loaded once; thread-unsafe but fast."""
def __init__(self, model_name: str = "gpt2") -> None:
self.tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
self.model = GPT2LMHeadModel.from_pretrained(model_name)
self.model.eval()
self.device = "mps" if torch.backends.mps.is_available() else "cpu"
self.model.to(self.device)
@torch.no_grad()
def ppl(self, text: str) -> float:
if not text.strip():
return float("inf")
enc = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
input_ids = enc["input_ids"].to(self.device)
if input_ids.shape[1] < 2:
# Need at least 2 tokens for a meaningful loss
return float("nan")
outputs = self.model(input_ids, labels=input_ids)
return float(torch.exp(outputs.loss).item())
- [ ] Step 4: Run tests, verify they pass
uv run python -m pytest tests/test_auto_metrics.py -v
Expected: all pass. First run downloads GPT-2 (~500MB, cached to ~/.cache/huggingface/) — subsequent runs are fast.
- [ ] Step 5: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — GPT-2 PPL auto-metric
GPT2PPLScorer wraps gpt2 base for token-level perplexity. Tripwire metric — degenerate outputs (vivere vivere vivere) get absurd PPL.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 6: YAML schema loaders + validators¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/schemas.py
- Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_schemas.py
- [ ] Step 1: Write failing tests
Create tests/test_schemas.py:
from pathlib import Path
import pytest
import yaml
from schemas import load_rubric, load_config, load_prompts, load_constraints
HERE = Path(__file__).parent
ROOT = HERE.parent
def test_load_rubric_well_formed():
rubric = load_rubric(ROOT / "rubric.yaml")
assert rubric.version == 1
assert len(rubric.dimensions) == 5
assert {d.id for d in rubric.dimensions} == {
"grammaticality", "coherence", "prompt_following", "natural_ending", "stays_in_english"
}
assert {m.id for m in rubric.auto_metrics} == {
"real_english_rate", "distinct_3", "distinct_5", "max_3gram_rep", "ppl_gpt2"
}
assert rubric.judge.default_model == "claude-haiku-4-5"
def test_load_config_baseline():
config = load_config(ROOT / "configs" / "baseline-v6.yaml")
assert config.name == "baseline-v6"
assert config.decoding.temperature == 0.8
assert config.decoding.num_drafts == 4
assert config.governor.use_punctuation_boost is True
def test_load_prompts():
prompts = load_prompts(ROOT / "prompts.yaml")
assert len(prompts) == 5
assert {p.id for p in prompts} == {
"dog_park", "asteroid_kids", "king_edicts", "pb_sandwich", "tying_shoes"
}
def test_load_constraints():
constraints = load_constraints(ROOT / "constraints.yaml")
assert len(constraints) == 8
assert {c["id"] for c in constraints} == {
"exc_r", "exc_szshzh", "aoa_le5", "aoa_le7",
"mixed_aoa7_excr", "mixed_aoa5_excsz", "inc_k", "con_sz",
}
def test_load_rubric_invalid_missing_dims(tmp_path):
bad = tmp_path / "bad.yaml"
bad.write_text(yaml.safe_dump({"version": 1, "judge": {"default_model": "x"}}))
with pytest.raises((KeyError, ValueError, Exception)):
load_rubric(bad)
- [ ] Step 2: Run tests, verify they fail
uv run python -m pytest tests/test_schemas.py -v
Expected: FAIL with "ImportError: No module named 'schemas'".
- [ ] Step 3: Implement schemas.py
Create schemas.py:
"""YAML loaders for rubric, config, prompts, constraints."""
from __future__ import annotations
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
import yaml
@dataclass
class JudgeConfig:
default_model: str
cache_system_prompt: bool = True
@dataclass
class RubricDimension:
id: str
description: str
scale: list[int]
anchors: dict[int, str]
@dataclass
class AutoMetric:
id: str
description: str = ""
source: str | None = None
model: str | None = None
@dataclass
class Rubric:
version: int
judge: JudgeConfig
dimensions: list[RubricDimension]
auto_metrics: list[AutoMetric]
description: str = ""
@dataclass
class DecodingConfig:
temperature: float
top_p: float
top_k: int
repetition_penalty: float
max_new_tokens: int
num_drafts: int
@dataclass
class GovernorConfig:
use_punctuation_boost: bool = True
use_trie_steering: bool = True
use_lookahead: bool = True
@dataclass
class ExperimentConfig:
name: str
description: str
decoding: DecodingConfig
governor: GovernorConfig
parent_config: str | None = None
@dataclass
class Prompt:
id: str
text: str
genre: str = ""
def _load_yaml(path: Path) -> Any:
with open(path) as f:
return yaml.safe_load(f)
def load_rubric(path: Path) -> Rubric:
raw = _load_yaml(path)
if "dimensions" not in raw or not raw["dimensions"]:
raise ValueError(f"rubric.yaml missing dimensions: {path}")
return Rubric(
version=raw["version"],
description=raw.get("description", ""),
judge=JudgeConfig(**raw["judge"]),
dimensions=[
RubricDimension(
id=d["id"],
description=d["description"],
scale=d["scale"],
anchors={int(k): v for k, v in d["anchors"].items()},
)
for d in raw["dimensions"]
],
auto_metrics=[AutoMetric(**m) for m in raw["auto_metrics"]],
)
def load_config(path: Path) -> ExperimentConfig:
raw = _load_yaml(path)
return ExperimentConfig(
name=raw["name"],
description=raw["description"],
parent_config=raw.get("parent_config"),
decoding=DecodingConfig(**raw["decoding"]),
governor=GovernorConfig(**raw.get("governor", {})),
)
def load_prompts(path: Path) -> list[Prompt]:
raw = _load_yaml(path)
return [Prompt(**p) for p in raw]
def load_constraints(path: Path) -> list[dict]:
"""Constraints stay as raw dicts — they pass directly to the generation API."""
raw = _load_yaml(path)
if not isinstance(raw, list):
raise ValueError(f"constraints.yaml must be a list: {path}")
for c in raw:
if "id" not in c:
raise ValueError(f"constraint missing id: {c}")
return raw
- [ ] Step 4: Run tests, verify they pass
uv run python -m pytest tests/test_schemas.py -v
Expected: 5 passed.
- [ ] Step 5: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/schemas.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_schemas.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — YAML schema loaders for rubric/config/prompts/constraints
Dataclass-based loaders with validation. Constraints stay as raw dicts (passed verbatim to /api/generate-single).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 7: SSE client for /api/generate-single¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/sse_client.py
- Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_sse_client.py
- [ ] Step 1: Write failing test using a stubbed HTTP client
Create tests/test_sse_client.py:
from unittest.mock import MagicMock, patch
from sse_client import generate_one, GenerationResult
SAMPLE_SSE = (
'data: {"status": "Building VocabTrie"}\n'
'data: {"status": "Vocabulary survival: 48%"}\n'
'data: {"status": "Generating 4 drafts (attempt 1)"}\n'
'data: {"status": " Draft 1: compliant"}\n'
'data: {"result": {"text": "The dog ran.", "compliant": true, "violation_count": 0, "violation_words": [], "boost_coverage": [], "warnings": null, "gen_time_ms": 1234}}\n'
)
def test_generate_one_parses_sse():
mock_resp = MagicMock()
mock_resp.iter_text.return_value = [SAMPLE_SSE]
mock_resp.raise_for_status = MagicMock()
mock_stream = MagicMock()
mock_stream.__enter__ = MagicMock(return_value=mock_resp)
mock_stream.__exit__ = MagicMock(return_value=None)
mock_client = MagicMock()
mock_client.stream = MagicMock(return_value=mock_stream)
mock_client.__enter__ = MagicMock(return_value=mock_client)
mock_client.__exit__ = MagicMock(return_value=None)
with patch("sse_client.httpx.Client", return_value=mock_client):
result = generate_one(
url="http://localhost:8000",
prompt="Test prompt",
constraints=[{"type": "exclude", "phonemes": ["s"]}],
decoding={"temperature": 0.8},
)
assert isinstance(result, GenerationResult)
assert result.text == "The dog ran."
assert result.compliant is True
assert result.survival_ratio == 0.48
assert result.gen_time_ms == 1234
assert result.error is None
assert result.drafts_compliant == 1
def test_generate_one_handles_error_event():
error_sse = (
'data: {"status": "starting"}\n'
'data: {"error": "Server crashed"}\n'
)
mock_resp = MagicMock()
mock_resp.iter_text.return_value = [error_sse]
mock_resp.raise_for_status = MagicMock()
mock_stream = MagicMock()
mock_stream.__enter__ = MagicMock(return_value=mock_resp)
mock_stream.__exit__ = MagicMock(return_value=None)
mock_client = MagicMock()
mock_client.stream = MagicMock(return_value=mock_stream)
mock_client.__enter__ = MagicMock(return_value=mock_client)
mock_client.__exit__ = MagicMock(return_value=None)
with patch("sse_client.httpx.Client", return_value=mock_client):
result = generate_one(
url="http://localhost:8000",
prompt="Test",
constraints=[],
decoding={},
)
assert result.error == "Server crashed"
assert result.text == ""
assert result.compliant is False
- [ ] Step 2: Run tests, verify they fail
uv run python -m pytest tests/test_sse_client.py -v
Expected: FAIL with "ImportError: No module named 'sse_client'".
- [ ] Step 3: Implement sse_client.py
Create sse_client.py:
"""SSE client for the local generation server's /api/generate-single endpoint.
The server streams `data: {...json...}` events with either {"status": "..."},
{"result": {...}}, or {"error": "..."} payloads. We collect statuses for
pipeline metrics and return the final result (or error).
"""
from __future__ import annotations
import json
import re
from dataclasses import dataclass, field
from typing import Any
import httpx
@dataclass
class GenerationResult:
text: str = ""
compliant: bool = False
violation_count: int = 0
violation_words: list[str] = field(default_factory=list)
boost_coverage: list[dict] = field(default_factory=list)
warnings: str | None = None
gen_time_ms: int = 0
survival_ratio: float | None = None
retry_count: int = 0
drafts_compliant: int = 0
hit_escalation: bool = False
statuses: list[str] = field(default_factory=list)
error: str | None = None
def generate_one(
url: str,
prompt: str,
constraints: list[dict],
decoding: dict | None = None,
governor: dict | None = None,
timeout: float = 600.0,
) -> GenerationResult:
"""POST to /api/generate-single, parse SSE stream, return structured result."""
payload: dict[str, Any] = {"prompt": prompt, "constraints": constraints}
if decoding:
payload["decoding"] = decoding
if governor:
payload["governor"] = governor
statuses: list[str] = []
result_payload: dict | None = None
error_msg: str | None = None
with httpx.Client(timeout=httpx.Timeout(timeout, connect=30.0)) as client:
with client.stream(
"POST",
f"{url}/api/generate-single",
json=payload,
headers={"Content-Type": "application/json"},
) as resp:
resp.raise_for_status()
buffer = ""
for chunk in resp.iter_text():
buffer += chunk
while "\n" in buffer:
line, buffer = buffer.split("\n", 1)
line = line.strip()
if not line.startswith("data: "):
continue
data_str = line[6:].strip()
if not data_str:
continue
try:
event = json.loads(data_str)
except json.JSONDecodeError:
continue
if "status" in event:
statuses.append(event["status"])
elif "result" in event:
result_payload = event["result"]
elif "error" in event:
error_msg = event["error"]
return _build_result(statuses, result_payload, error_msg)
def _build_result(
statuses: list[str],
result: dict | None,
error: str | None,
) -> GenerationResult:
out = GenerationResult(statuses=statuses, error=error)
for s in statuses:
m = re.search(r"Vocabulary survival: (\d+)%", s)
if m:
out.survival_ratio = int(m.group(1)) / 100.0
attempt_m = re.match(r"Generating \d+ drafts \(attempt (\d+)\)", s)
if attempt_m and int(attempt_m.group(1)) > 1:
out.retry_count = int(attempt_m.group(1)) - 1
if re.match(r"\s+Draft \d+: compliant", s):
out.drafts_compliant += 1
if "targeted rollout" in s.lower():
out.hit_escalation = True
if error or result is None:
return out
out.text = result.get("text", "")
out.compliant = result.get("compliant", False)
out.violation_count = result.get("violation_count", 0)
out.violation_words = result.get("violation_words", [])
out.boost_coverage = result.get("boost_coverage", [])
out.warnings = result.get("warnings")
out.gen_time_ms = result.get("gen_time_ms", 0)
return out
- [ ] Step 4: Run tests, verify they pass
uv run python -m pytest tests/test_sse_client.py -v
Expected: 2 passed.
- [ ] Step 5: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/sse_client.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_sse_client.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — SSE client for /api/generate-single
generate_one() POSTs prompt+constraints+decoding+governor, parses SSE events into a structured GenerationResult capturing text, compliance, pipeline metrics (survival, retries, escalation), and any error.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 8: generate.py (Stage 1 CLI)¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/generate.py
- [ ] Step 1: Implement generate.py
Create generate.py:
"""Stage 1: sweep runner.
Usage:
uv run python generate.py <config_name> [--server http://localhost:8000] [--resume]
"""
from __future__ import annotations
import argparse
import hashlib
import json
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
import httpx
from schemas import load_config, load_prompts, load_constraints
from sse_client import generate_one
HERE = Path(__file__).parent
RUNS_DIR = HERE / "runs"
def _sha(s: str) -> str:
return hashlib.sha256(s.encode()).hexdigest()[:12]
def _hash_yaml(path: Path) -> str:
return _sha(path.read_text())
def _check_server_ready(url: str) -> dict:
resp = httpx.get(f"{url}/api/server/status", timeout=10.0)
resp.raise_for_status()
status = resp.json()
if status.get("status") != "ready":
raise RuntimeError(f"Server not ready: {status}")
return status
def _load_done(path: Path) -> set[str]:
"""Load (combo_id|prompt_id) keys from existing generations.jsonl."""
if not path.exists():
return set()
done = set()
with open(path) as f:
for line in f:
try:
row = json.loads(line)
done.add(f"{row['combo_id']}|{row['prompt_id']}")
except (json.JSONDecodeError, KeyError):
continue
return done
def run(config_name: str, server: str, resume: bool) -> None:
config_path = HERE / "configs" / f"{config_name}.yaml"
config = load_config(config_path)
prompts = load_prompts(HERE / "prompts.yaml")
constraints = load_constraints(HERE / "constraints.yaml")
server_status = _check_server_ready(server)
print(f"Server ready: {server_status.get('model', 'unknown')}")
# Establish run dir
if resume:
existing = sorted((RUNS_DIR).glob(f"{config_name}-*"))
if existing:
run_dir = existing[-1]
print(f"Resuming into {run_dir}")
else:
print(f"--resume requested but no prior run for {config_name}; starting new")
timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
run_dir = RUNS_DIR / f"{config_name}-{timestamp}"
else:
timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
run_dir = RUNS_DIR / f"{config_name}-{timestamp}"
run_dir.mkdir(parents=True, exist_ok=True)
run_id = run_dir.name
generations_path = run_dir / "generations.jsonl"
# Write meta.json
meta = {
"run_id": run_id,
"config_name": config_name,
"config_path": str(config_path.relative_to(HERE)),
"config_hash": _hash_yaml(config_path),
"prompts_hash": _hash_yaml(HERE / "prompts.yaml"),
"constraints_hash": _hash_yaml(HERE / "constraints.yaml"),
"server_status": server_status,
"started_at": datetime.now(timezone.utc).isoformat(),
}
(run_dir / "meta.json").write_text(json.dumps(meta, indent=2))
done = _load_done(generations_path) if resume else set()
total = len(constraints) * len(prompts)
print(f"Sweep: {len(constraints)} constraints × {len(prompts)} prompts = {total} generations")
if done:
print(f"Resuming: {len(done)} already done")
decoding = {
"temperature": config.decoding.temperature,
"top_p": config.decoding.top_p,
"top_k": config.decoding.top_k,
"repetition_penalty": config.decoding.repetition_penalty,
"max_new_tokens": config.decoding.max_new_tokens,
"num_drafts": config.decoding.num_drafts,
}
governor_block = {
"use_punctuation_boost": config.governor.use_punctuation_boost,
"use_trie_steering": config.governor.use_trie_steering,
"use_lookahead": config.governor.use_lookahead,
}
t_start = time.time()
completed = 0
with open(generations_path, "a") as out:
for combo in constraints:
combo_id = combo["id"]
# Constraint payload: 'combine' expands; otherwise the combo dict itself sans id
if "combine" in combo:
combo_payload = combo["combine"]
else:
combo_payload = [{k: v for k, v in combo.items() if k != "id"}]
for prompt in prompts:
key = f"{combo_id}|{prompt.id}"
if key in done:
continue
progress = completed + 1
print(f"[{progress}] {combo_id} × {prompt.id} ", end="", flush=True)
try:
result = generate_one(
url=server,
prompt=prompt.text,
constraints=combo_payload,
decoding=decoding,
governor=governor_block,
)
err = result.error
except Exception as e:
print(f"ERROR: {e}")
err = str(e)
result = None
row = {
"run_id": run_id,
"config_name": config_name,
"prompt_id": prompt.id,
"prompt": prompt.text,
"combo_id": combo_id,
"constraints": combo_payload,
"text": result.text if result else "",
"pipeline_metrics": {
"compliant": result.compliant if result else False,
"violation_count": result.violation_count if result else 0,
"violation_words": result.violation_words if result else [],
"survival_ratio": result.survival_ratio if result else None,
"retry_count": result.retry_count if result else 0,
"drafts_compliant": result.drafts_compliant if result else 0,
"hit_escalation": result.hit_escalation if result else False,
"gen_time_ms": result.gen_time_ms if result else 0,
},
"error": err,
"ts": datetime.now(timezone.utc).isoformat(),
}
out.write(json.dumps(row, ensure_ascii=False) + "\n")
out.flush()
completed += 1
if err:
print(f"ERROR: {err}")
else:
c = "✓" if result.compliant else "✗"
surv = result.survival_ratio
surv_s = f" surv={surv:.0%}" if surv is not None else ""
print(f"{c} {result.gen_time_ms / 1000:.1f}s{surv_s}")
elapsed = time.time() - t_start
print(f"\nRun complete: {completed} generations in {elapsed/60:.1f} min")
print(f"Output: {generations_path}")
def main():
parser = argparse.ArgumentParser(description="Stage 1: sweep generator")
parser.add_argument("config_name", help="Name of a configs/<name>.yaml preset")
parser.add_argument("--server", default="http://localhost:8000")
parser.add_argument("--resume", action="store_true")
args = parser.parse_args()
run(args.config_name, args.server, args.resume)
if __name__ == "__main__":
main()
- [ ] Step 2: Sanity-check imports
cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python -c "import generate; print('ok')"
Expected: ok
- [ ] Step 3: Sanity-check CLI help
uv run python generate.py --help
Expected: argparse help with config_name, --server, --resume.
- [ ] Step 4: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/generate.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — generate.py (Stage 1 sweep runner)
Loads named config, iterates constraints × prompts, POSTs each via SSE client, writes generations.jsonl with structured pipeline metrics. Resume-safe by (combo_id, prompt_id) key. Writes meta.json with config + pool hashes for reproducibility.
Decoding + governor blocks are passed through to /api/generate-single, but the server endpoint won't honor them until PHON-63 lands. Until then, the harness sends the blocks as documentation/forward-compat; the server uses compiled-in defaults.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 9: Claude judge module¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/judge.py
- Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_judge.py
- [ ] Step 1: Write failing test using mocked Anthropic client
Create tests/test_judge.py:
import json
from unittest.mock import MagicMock, patch
import pytest
from judge import build_system_prompt, build_user_prompt, score_one
from schemas import load_rubric
from pathlib import Path
HERE = Path(__file__).parent
ROOT = HERE.parent
@pytest.fixture
def rubric():
return load_rubric(ROOT / "rubric.yaml")
def test_build_system_prompt_includes_all_dims(rubric):
sysp = build_system_prompt(rubric)
for dim in rubric.dimensions:
assert dim.id in sysp
assert dim.description in sysp
def test_build_user_prompt_includes_text(rubric):
user = build_user_prompt(prompt="Tell a story", text="Once upon a time...")
assert "Once upon a time" in user
assert "Tell a story" in user
def test_score_one_parses_response(rubric):
fake_response = MagicMock()
fake_response.content = [MagicMock()]
fake_response.content[0].text = json.dumps({
"grammaticality": {"score": 5, "rationale": "Well-formed."},
"coherence": {"score": 4, "rationale": "Coherent."},
"prompt_following": {"score": 5, "rationale": "Addresses prompt."},
"natural_ending": {"score": 3, "rationale": "Abrupt."},
"stays_in_english": {"score": 5, "rationale": "Fully English."},
})
fake_response.usage.input_tokens = 1234
fake_response.usage.output_tokens = 200
fake_client = MagicMock()
fake_client.messages.create.return_value = fake_response
result = score_one(
client=fake_client,
rubric=rubric,
prompt="Tell a story",
text="Once upon a time, a dog ran in the park.",
)
assert result["dim_scores"]["grammaticality"] == 5
assert result["dim_scores"]["natural_ending"] == 3
assert result["dim_rationales"]["coherence"] == "Coherent."
assert result["input_tokens"] == 1234
assert result["output_tokens"] == 200
assert result["model"] == rubric.judge.default_model
assert result["judge_cost_usd"] > 0
def test_score_one_handles_malformed_response(rubric):
fake_response = MagicMock()
fake_response.content = [MagicMock()]
fake_response.content[0].text = "This is not JSON."
fake_response.usage.input_tokens = 10
fake_response.usage.output_tokens = 5
fake_client = MagicMock()
fake_client.messages.create.return_value = fake_response
result = score_one(
client=fake_client,
rubric=rubric,
prompt="x",
text="y",
)
# Bad JSON -> all dim scores null, error captured
assert all(v is None for v in result["dim_scores"].values())
assert result["error"] is not None
- [ ] Step 2: Run test, verify it fails
uv run python -m pytest tests/test_judge.py -v
Expected: FAIL with "ImportError: No module named 'judge'".
- [ ] Step 3: Implement judge.py
Create judge.py:
"""Claude-based rubric judge.
Single API call per generation. System prompt = rubric definitions + scale anchors,
cached via Anthropic prompt caching. User prompt = the prompt+text being scored.
Output: JSON with score+rationale per dimension.
"""
from __future__ import annotations
import json
import os
import re
import time
from typing import Any
from anthropic import Anthropic, APIError
from schemas import Rubric
# Haiku 4.5 pricing (as of 2026-04, $/Mtoken)
PRICING_USD_PER_MTOK = {
"claude-haiku-4-5": {"input": 1.0, "output": 5.0, "cache_read": 0.1, "cache_write": 1.25},
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.3, "cache_write": 3.75},
}
def build_system_prompt(rubric: Rubric) -> str:
"""System prompt — cached via Anthropic prompt caching when called."""
lines = [
"You are an expert evaluator of short text generations from a constrained-generation system.",
"",
"Score the generation on each dimension below. Each dimension is independent — score them separately.",
"",
"DIMENSIONS:",
"",
]
for d in rubric.dimensions:
lines.append(f"## {d.id}: {d.description}")
lines.append(f"Scale: {min(d.scale)}-{max(d.scale)}")
for s, anchor in sorted(d.anchors.items()):
lines.append(f" {s}: {anchor}")
lines.append("")
lines.extend([
"OUTPUT FORMAT:",
"Return a JSON object with one key per dimension. Each value is an object with `score` (integer in scale) and `rationale` (one short sentence).",
"",
"Example:",
'```json',
"{",
])
for i, d in enumerate(rubric.dimensions):
comma = "," if i < len(rubric.dimensions) - 1 else ""
lines.append(f' "{d.id}": {{"score": 4, "rationale": "<short reason>"}}{comma}')
lines.extend([
"}",
"```",
"",
"Respond with ONLY the JSON object, no preamble or commentary.",
])
return "\n".join(lines)
def build_user_prompt(prompt: str, text: str) -> str:
return (
f"PROMPT GIVEN TO THE GENERATOR:\n{prompt}\n\n"
f"GENERATED OUTPUT:\n{text}\n\n"
"Score this generation on each dimension and return the JSON object."
)
def _extract_json(s: str) -> dict | None:
"""Pull the first JSON object out of a string. Tolerates ```json fences."""
fence = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", s, re.DOTALL)
if fence:
try:
return json.loads(fence.group(1))
except json.JSONDecodeError:
pass
# Try the whole string
try:
return json.loads(s)
except json.JSONDecodeError:
pass
# Try to find a balanced top-level {...}
m = re.search(r"\{.*\}", s, re.DOTALL)
if m:
try:
return json.loads(m.group(0))
except json.JSONDecodeError:
pass
return None
def _cost_usd(model: str, input_tokens: int, output_tokens: int) -> float:
p = PRICING_USD_PER_MTOK.get(model)
if not p:
return 0.0
return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000
def score_one(
client: Anthropic,
rubric: Rubric,
prompt: str,
text: str,
model: str | None = None,
max_retries: int = 3,
) -> dict[str, Any]:
"""Score a single (prompt, text) pair with the Claude judge. Retries on transient failures."""
chosen_model = model or rubric.judge.default_model
sysp = build_system_prompt(rubric)
user = build_user_prompt(prompt, text)
dim_ids = [d.id for d in rubric.dimensions]
last_error: str | None = None
response = None
for attempt in range(max_retries):
try:
t0 = time.time()
kwargs = {
"model": chosen_model,
"max_tokens": 1024,
"messages": [{"role": "user", "content": user}],
}
if rubric.judge.cache_system_prompt:
kwargs["system"] = [
{"type": "text", "text": sysp, "cache_control": {"type": "ephemeral"}}
]
else:
kwargs["system"] = sysp
response = client.messages.create(**kwargs)
judge_ms = int((time.time() - t0) * 1000)
break
except APIError as e:
last_error = str(e)
if attempt == max_retries - 1:
break
time.sleep(2 ** attempt)
except Exception as e:
last_error = str(e)
break
if response is None:
return {
"model": chosen_model,
"dim_scores": {d: None for d in dim_ids},
"dim_rationales": {d: None for d in dim_ids},
"judge_ms": 0,
"judge_cost_usd": 0.0,
"input_tokens": 0,
"output_tokens": 0,
"error": last_error or "unknown",
}
raw = response.content[0].text if response.content else ""
parsed = _extract_json(raw)
dim_scores: dict[str, int | None] = {d: None for d in dim_ids}
dim_rationales: dict[str, str | None] = {d: None for d in dim_ids}
error = None
if parsed is None:
error = f"failed to parse judge JSON: {raw[:200]}"
else:
for did in dim_ids:
cell = parsed.get(did)
if isinstance(cell, dict):
s = cell.get("score")
if isinstance(s, int):
dim_scores[did] = s
dim_rationales[did] = cell.get("rationale")
return {
"model": chosen_model,
"dim_scores": dim_scores,
"dim_rationales": dim_rationales,
"judge_ms": judge_ms,
"judge_cost_usd": _cost_usd(
chosen_model, response.usage.input_tokens, response.usage.output_tokens
),
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"error": error,
}
def make_client() -> Anthropic:
"""Create an Anthropic client. Reads ANTHROPIC_API_KEY from env."""
key = os.environ.get("ANTHROPIC_API_KEY")
if not key:
raise RuntimeError(
"ANTHROPIC_API_KEY not set. Source from Repos/eureka or your local secret store."
)
return Anthropic(api_key=key)
- [ ] Step 4: Run tests, verify they pass
uv run python -m pytest tests/test_judge.py -v
Expected: 4 passed.
- [ ] Step 5: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/judge.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_judge.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — Claude judge module
score_one() takes (rubric, prompt, text), builds a cacheable system prompt from rubric dims, calls Claude (default haiku-4-5), parses structured JSON output, returns dim scores + rationales + cost. Retries with exponential backoff on transient failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 10: score.py (Stage 2 CLI)¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/score.py
- [ ] Step 1: Implement score.py
Create score.py:
"""Stage 2: scoring runner.
Usage:
uv run python score.py <run_id>
uv run python score.py runs/baseline-v6-20260429T120000Z
Reads runs/<run_id>/generations.jsonl, augments each row with auto metrics +
judge scores, writes runs/<run_id>/scored.jsonl. Idempotent — skips rows already
scored (matched on (run_id, combo_id, prompt_id) key).
"""
from __future__ import annotations
import argparse
import json
import sys
from datetime import datetime, timezone
from pathlib import Path
from auto_metrics import GPT2PPLScorer, VocabGate, distinct_n, max_ngram_rep, real_english_rate
from judge import make_client, score_one
from schemas import load_rubric
HERE = Path(__file__).parent
RUNS_DIR = HERE / "runs"
def _load_scored(path: Path) -> set[str]:
if not path.exists():
return set()
out = set()
with open(path) as f:
for line in f:
try:
row = json.loads(line)
out.add(f"{row['run_id']}|{row['combo_id']}|{row['prompt_id']}")
except (json.JSONDecodeError, KeyError):
continue
return out
def _resolve_run_dir(run_id: str) -> Path:
"""Accept either a run_id or a path to the run directory."""
candidate = Path(run_id)
if candidate.is_dir() and (candidate / "generations.jsonl").exists():
return candidate
candidate = RUNS_DIR / run_id
if candidate.is_dir():
return candidate
print(f"ERROR: run not found: {run_id}", file=sys.stderr)
sys.exit(1)
def run(run_id: str, model_override: str | None) -> None:
run_dir = _resolve_run_dir(run_id)
generations_path = run_dir / "generations.jsonl"
scored_path = run_dir / "scored.jsonl"
rubric = load_rubric(HERE / "rubric.yaml")
vocab = VocabGate(HERE / "vocab.txt")
print(f"Vocab gate loaded: {len(vocab):,} entries")
print("Loading GPT-2 PPL scorer...")
ppl = GPT2PPLScorer()
client = make_client()
done = _load_scored(scored_path)
rows: list[dict] = []
with open(generations_path) as f:
for line in f:
line = line.strip()
if line:
rows.append(json.loads(line))
todo = [r for r in rows if f"{r['run_id']}|{r['combo_id']}|{r['prompt_id']}" not in done]
print(f"{len(rows)} total, {len(done)} already scored, {len(todo)} to score")
total_cost = 0.0
with open(scored_path, "a") as out:
for i, row in enumerate(todo, 1):
text = row.get("text", "")
if not text or row.get("error"):
# Skip judging — write a row with null scores so we don't reprocess
scored_row = {
**row,
"auto_metrics": {
"real_english_rate": None,
"distinct_3": None,
"distinct_5": None,
"max_3gram_rep": None,
"ppl_gpt2": None,
},
"judge": {
"model": rubric.judge.default_model,
"dim_scores": {d.id: None for d in rubric.dimensions},
"dim_rationales": {d.id: None for d in rubric.dimensions},
"judge_ms": 0,
"judge_cost_usd": 0.0,
"input_tokens": 0,
"output_tokens": 0,
"error": "skipped (no text)",
},
"scored_at": datetime.now(timezone.utc).isoformat(),
}
out.write(json.dumps(scored_row, ensure_ascii=False) + "\n")
out.flush()
continue
auto = {
"real_english_rate": real_english_rate(text, vocab),
"distinct_3": distinct_n(text, 3),
"distinct_5": distinct_n(text, 5),
"max_3gram_rep": max_ngram_rep(text, 3),
"ppl_gpt2": ppl.ppl(text),
}
judge = score_one(
client=client,
rubric=rubric,
prompt=row.get("prompt", ""),
text=text,
model=model_override,
)
total_cost += judge["judge_cost_usd"]
scored_row = {
**row,
"auto_metrics": auto,
"judge": judge,
"scored_at": datetime.now(timezone.utc).isoformat(),
}
out.write(json.dumps(scored_row, ensure_ascii=False) + "\n")
out.flush()
print(
f"[{i}/{len(todo)}] {row['combo_id']} × {row['prompt_id']} "
f"OOV={auto['real_english_rate']:.0%} "
f"PPL={auto['ppl_gpt2']:.0f} "
f"gram={judge['dim_scores'].get('grammaticality')} "
f"coh={judge['dim_scores'].get('coherence')} "
f"${judge['judge_cost_usd']:.4f}"
)
print(f"\nScored: {len(todo)} new rows. Cumulative judge cost: ${total_cost:.4f}")
print(f"Output: {scored_path}")
def main():
parser = argparse.ArgumentParser(description="Stage 2: score generations")
parser.add_argument("run_id", help="Run id or path to run dir")
parser.add_argument("--model", default=None, help="Override judge model (e.g., claude-sonnet-4-6)")
args = parser.parse_args()
run(args.run_id, args.model)
if __name__ == "__main__":
main()
- [ ] Step 2: Sanity-check imports + CLI
cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python score.py --help
Expected: argparse help output.
- [ ] Step 3: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/score.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — score.py (Stage 2 scoring runner)
For each generation row: compute auto metrics (real_english_rate, distinct_3/5, max_3gram_rep, ppl_gpt2) + Claude judge scores. Idempotent — skips rows already in scored.jsonl. Tracks cumulative judge cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 11: compare.py (Stage 3 CLI)¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/compare.py
- [ ] Step 1: Implement compare.py
Create compare.py:
"""Stage 3: comparison report.
Usage:
uv run python compare.py <run_id> # single-run report
uv run python compare.py <run_id_a> <run_id_b> [<run_id_c>] # multi-run comparison
Aggregates one or more scored.jsonl files, emits a markdown report under reports/.
"""
from __future__ import annotations
import argparse
import json
import statistics
import sys
from datetime import datetime, timezone
from pathlib import Path
HERE = Path(__file__).parent
RUNS_DIR = HERE / "runs"
REPORTS_DIR = HERE / "reports"
JUDGE_DIMS = ["grammaticality", "coherence", "prompt_following", "natural_ending", "stays_in_english"]
AUTO_METRICS = ["real_english_rate", "distinct_3", "distinct_5", "max_3gram_rep", "ppl_gpt2"]
def _resolve_run_dir(run_id: str) -> Path:
candidate = Path(run_id)
if candidate.is_dir():
return candidate
return RUNS_DIR / run_id
def load_scored(run_id: str) -> tuple[str, list[dict]]:
run_dir = _resolve_run_dir(run_id)
path = run_dir / "scored.jsonl"
if not path.exists():
print(f"ERROR: scored.jsonl not found at {path}", file=sys.stderr)
sys.exit(1)
rows = []
with open(path) as f:
for line in f:
line = line.strip()
if line:
rows.append(json.loads(line))
return run_dir.name, rows
def _mean(values: list[float | int | None]) -> float | None:
nums = [v for v in values if v is not None]
if not nums:
return None
return statistics.fmean(nums)
def aggregate(rows: list[dict]) -> dict[str, float | None]:
out: dict[str, float | None] = {}
for d in JUDGE_DIMS:
out[d] = _mean([r["judge"]["dim_scores"].get(d) for r in rows])
for m in AUTO_METRICS:
out[m] = _mean([r["auto_metrics"].get(m) for r in rows])
out["cost_usd_total"] = sum(r["judge"].get("judge_cost_usd", 0.0) for r in rows)
out["count"] = len(rows)
return out
def worst_per_dim(rows: list[dict], dim: str, n: int = 3) -> list[dict]:
scored = [(r, r["judge"]["dim_scores"].get(dim)) for r in rows]
scored = [(r, s) for r, s in scored if s is not None]
scored.sort(key=lambda x: x[1])
return [r for r, _ in scored[:n]]
def best_per_dim(rows: list[dict], dim: str, n: int = 3) -> list[dict]:
scored = [(r, r["judge"]["dim_scores"].get(dim)) for r in rows]
scored = [(r, s) for r, s in scored if s is not None]
scored.sort(key=lambda x: x[1], reverse=True)
return [r for r, _ in scored[:n]]
def fmt(value: float | None, decimals: int = 2) -> str:
if value is None:
return "—"
return f"{value:.{decimals}f}"
def render_summary_table(per_run: dict[str, dict]) -> str:
runs = list(per_run.keys())
lines = ["| Metric | " + " | ".join(runs) + (" | Δ |" if len(runs) == 2 else " |")]
lines.append("|" + "---|" * (len(runs) + (2 if len(runs) == 2 else 1)))
metrics = JUDGE_DIMS + AUTO_METRICS
for m in metrics:
cells = []
for r in runs:
cells.append(fmt(per_run[r].get(m)))
if len(runs) == 2:
a = per_run[runs[0]].get(m)
b = per_run[runs[1]].get(m)
if a is not None and b is not None:
cells.append(f"{b - a:+.2f}")
else:
cells.append("—")
lines.append(f"| {m} | " + " | ".join(cells) + " |")
# totals
cost_cells = [fmt(per_run[r].get("cost_usd_total"), 4) for r in runs]
if len(runs) == 2:
cost_cells.append("—")
lines.append(f"| total cost USD | " + " | ".join(cost_cells) + " |")
return "\n".join(lines)
def render_examples(rows: list[dict], dim: str, kind: str = "worst", n: int = 3) -> str:
selected = (worst_per_dim if kind == "worst" else best_per_dim)(rows, dim, n)
lines = []
for r in selected:
score = r["judge"]["dim_scores"].get(dim, "?")
rationale = r["judge"]["dim_rationales"].get(dim, "?")
snippet = (r.get("text") or "").replace("\n", " ")[:200]
lines.append(
f"- **{r['combo_id']} × {r['prompt_id']}** — score {score}\n"
f" > {snippet}\n"
f" *Rationale:* {rationale}"
)
return "\n".join(lines) if lines else "_(no rows)_"
def render_report(per_run: dict[str, list[dict]]) -> str:
runs = list(per_run.keys())
aggregates = {r: aggregate(rows) for r, rows in per_run.items()}
lines = [
f"# Eval Comparison: {' vs '.join(runs)}",
"",
f"_Generated {datetime.now(timezone.utc).isoformat()}_",
"",
"## Summary",
"",
render_summary_table(aggregates),
"",
]
# Per-run examples
for run_id, rows in per_run.items():
lines.extend([
f"## Run: {run_id}",
"",
f"Generations: {len(rows)} | total judge cost: ${aggregates[run_id]['cost_usd_total']:.4f}",
"",
])
for dim in JUDGE_DIMS:
lines.extend([f"### Worst on `{dim}` (run: {run_id})", ""])
lines.append(render_examples(rows, dim, kind="worst", n=3))
lines.append("")
# Auto-vs-judge sanity check (cheap proxy: real_english_rate vs stays_in_english)
lines.extend(["## Sanity check: real_english_rate vs stays_in_english", ""])
for run_id, rows in per_run.items():
rer_se = [
(r["auto_metrics"].get("real_english_rate"), r["judge"]["dim_scores"].get("stays_in_english"))
for r in rows
]
rer_se = [(a, b) for a, b in rer_se if a is not None and b is not None]
if len(rer_se) >= 2:
try:
corr = statistics.correlation([a for a, _ in rer_se], [b for _, b in rer_se])
lines.append(f"- {run_id}: Pearson r = {corr:+.2f} (n={len(rer_se)})")
except statistics.StatisticsError:
lines.append(f"- {run_id}: correlation undefined (constant data)")
else:
lines.append(f"- {run_id}: insufficient data")
lines.append("")
return "\n".join(lines)
def run(run_ids: list[str]) -> None:
per_run: dict[str, list[dict]] = {}
for rid in run_ids:
name, rows = load_scored(rid)
per_run[name] = rows
report = render_report(per_run)
REPORTS_DIR.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
name = "-vs-".join(per_run.keys())[:80]
report_path = REPORTS_DIR / f"{timestamp}-{name}.md"
report_path.write_text(report)
print(report)
print(f"\nReport saved to {report_path}")
def main():
parser = argparse.ArgumentParser(description="Stage 3: compare scored runs")
parser.add_argument("run_ids", nargs="+", help="Run id(s) or path(s) to run dir(s)")
args = parser.parse_args()
run(args.run_ids)
if __name__ == "__main__":
main()
- [ ] Step 2: Sanity-check CLI
cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python compare.py --help
Expected: argparse help.
- [ ] Step 3: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/compare.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — compare.py (Stage 3 markdown report)
Aggregates one or more scored runs into a markdown report: per-dim summary table with deltas, worst-3-per-dim examples per run, auto-vs-judge sanity correlation.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 12: experiments.jsonl helpers (autoresearch readiness)¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/experiments.py
- Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_experiments.py
- [ ] Step 1: Write failing test
Create tests/test_experiments.py:
import json
from pathlib import Path
from experiments import append_experiment, ExperimentEntry
def test_append_creates_file(tmp_path):
log = tmp_path / "experiments.jsonl"
entry = ExperimentEntry(
actor="human",
hypothesis="Lower rep_penalty improves naturalness",
config_path="configs/lower-rep-penalty.yaml",
run_id="lower-rep-penalty-20260429T120000Z",
comparison_against="baseline-v6-20260429T100000Z",
verdict="rejected",
verdict_evidence="grammaticality unchanged, real_english_rate dropped 8pp",
)
append_experiment(log, entry)
assert log.exists()
with open(log) as f:
lines = f.readlines()
assert len(lines) == 1
parsed = json.loads(lines[0])
assert parsed["hypothesis"] == "Lower rep_penalty improves naturalness"
assert "ts" in parsed
def test_append_appends_to_existing(tmp_path):
log = tmp_path / "experiments.jsonl"
log.write_text('{"existing": true}\n')
entry = ExperimentEntry(
actor="agent",
hypothesis="x",
config_path="y",
run_id="z",
)
append_experiment(log, entry)
with open(log) as f:
lines = f.readlines()
assert len(lines) == 2
- [ ] Step 2: Run test, verify it fails
uv run python -m pytest tests/test_experiments.py -v
Expected: FAIL with import error.
- [ ] Step 3: Implement experiments.py
Create experiments.py:
"""experiments.jsonl — append-only log of hypothesis → run → verdict entries.
Both human and agent contribute. Same primitives drive manual and autoresearch loops.
"""
from __future__ import annotations
import json
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Literal
@dataclass
class ExperimentEntry:
actor: Literal["human", "agent"]
hypothesis: str
config_path: str
run_id: str
scored_path: str | None = None
comparison_against: str | None = None
verdict: str | None = None # 'accepted', 'rejected', 'mixed', None=pending
verdict_evidence: str | None = None
ts: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
def append_experiment(log_path: Path, entry: ExperimentEntry) -> None:
log_path.parent.mkdir(parents=True, exist_ok=True)
with open(log_path, "a") as f:
f.write(json.dumps(asdict(entry), ensure_ascii=False) + "\n")
def read_experiments(log_path: Path) -> list[dict]:
if not log_path.exists():
return []
out = []
with open(log_path) as f:
for line in f:
line = line.strip()
if line:
out.append(json.loads(line))
return out
- [ ] Step 4: Run tests, verify they pass
uv run python -m pytest tests/test_experiments.py -v
Expected: 2 passed.
- [ ] Step 5: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/experiments.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_experiments.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — experiments.jsonl helpers
Append-only log of hypothesis → config → run → verdict entries. Same primitives drive manual and autoresearch flows.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 13: E2E integration test¶
Files:
- Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/conftest.py
- Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_e2e.py
- [ ] Step 1: Write the e2e test
Create tests/conftest.py:
"""Shared fixtures."""
import sys
from pathlib import Path
# Make the harness modules importable in tests
sys.path.insert(0, str(Path(__file__).parent.parent))
Create tests/test_e2e.py:
"""End-to-end integration test.
One config × one prompt × one constraint, end-to-end with stubbed Claude judge
and stubbed SSE response. Exercises generate.py → score.py → compare.py.
"""
import json
import sys
from pathlib import Path
from unittest.mock import MagicMock, patch
import pytest
HERE = Path(__file__).parent
ROOT = HERE.parent
@pytest.fixture
def fake_sse_response():
return (
'data: {"status": "Vocabulary survival: 80%"}\n'
'data: {"status": " Draft 1: compliant"}\n'
'data: {"result": {"text": "The dog ran across the park.", "compliant": true, "violation_count": 0, "violation_words": [], "boost_coverage": [], "warnings": null, "gen_time_ms": 1500}}\n'
)
def _mock_httpx_stream(body: str):
mock_resp = MagicMock()
mock_resp.iter_text.return_value = [body]
mock_resp.raise_for_status = MagicMock()
mock_stream = MagicMock()
mock_stream.__enter__ = MagicMock(return_value=mock_resp)
mock_stream.__exit__ = MagicMock(return_value=None)
return mock_stream
def _mock_anthropic_response():
fake_response = MagicMock()
fake_response.content = [MagicMock()]
fake_response.content[0].text = json.dumps({
"grammaticality": {"score": 5, "rationale": "Good."},
"coherence": {"score": 4, "rationale": "Clear."},
"prompt_following": {"score": 5, "rationale": "On topic."},
"natural_ending": {"score": 4, "rationale": "Reasonable."},
"stays_in_english": {"score": 5, "rationale": "All English."},
})
fake_response.usage.input_tokens = 1000
fake_response.usage.output_tokens = 100
return fake_response
def test_e2e_one_cell(tmp_path, fake_sse_response, monkeypatch):
"""generate.py → score.py → compare.py with a single cell."""
# Set up a temporary harness dir mirror
harness = tmp_path / "harness"
harness.mkdir()
(harness / "configs").mkdir()
(harness / "tests").mkdir()
(harness / "runs").mkdir()
(harness / "reports").mkdir()
# Copy YAMLs from real harness
for name in ["rubric.yaml", "prompts.yaml", "constraints.yaml", "vocab.txt"]:
(harness / name).write_text((ROOT / name).read_text())
(harness / "configs" / "baseline-v6.yaml").write_text((ROOT / "configs" / "baseline-v6.yaml").read_text())
# Reduce the prompt + constraint pools to 1 each for fast e2e
(harness / "prompts.yaml").write_text(
'- {id: dog_park, text: "Write a short paragraph about a dog playing in the park.", genre: narrative}\n'
)
(harness / "constraints.yaml").write_text(
'- {id: exc_r, type: exclude, phonemes: ["ɹ"]}\n'
)
# Make harness importable
monkeypatch.syspath_prepend(str(harness))
monkeypatch.chdir(harness)
# Force-reimport modules from this dir
for mod in ["generate", "score", "compare", "schemas", "sse_client", "auto_metrics", "judge", "experiments"]:
if mod in sys.modules:
del sys.modules[mod]
# Copy the python modules in too
for mod in ["schemas.py", "sse_client.py", "auto_metrics.py", "judge.py", "generate.py", "score.py", "compare.py", "experiments.py"]:
(harness / mod).write_text((ROOT / mod).read_text())
import generate
import score
import compare
# Patch the SSE call + server-status check
with patch("sse_client.httpx.Client") as mock_httpx_client_cls, \
patch("generate.httpx.get") as mock_status:
# Server status check returns ready
mock_status.return_value = MagicMock(
json=lambda: {"status": "ready", "model": "stub"},
raise_for_status=lambda: None,
)
# SSE client returns canned response
mock_client_instance = MagicMock()
mock_client_instance.stream = MagicMock(return_value=_mock_httpx_stream(fake_sse_response))
mock_client_instance.__enter__ = MagicMock(return_value=mock_client_instance)
mock_client_instance.__exit__ = MagicMock(return_value=None)
mock_httpx_client_cls.return_value = mock_client_instance
generate.run("baseline-v6", "http://localhost:8000", resume=False)
# Verify generations.jsonl exists with one row
runs = list((harness / "runs").iterdir())
assert len(runs) == 1
run_dir = runs[0]
gens_path = run_dir / "generations.jsonl"
assert gens_path.exists()
with open(gens_path) as f:
rows = [json.loads(line) for line in f if line.strip()]
assert len(rows) == 1
assert rows[0]["text"] == "The dog ran across the park."
# Now stub Claude + run score.py
with patch("judge.make_client") as mock_make_client:
fake_client = MagicMock()
fake_client.messages.create.return_value = _mock_anthropic_response()
mock_make_client.return_value = fake_client
score.run(run_dir.name, model_override=None)
scored_path = run_dir / "scored.jsonl"
assert scored_path.exists()
with open(scored_path) as f:
scored = [json.loads(line) for line in f if line.strip()]
assert len(scored) == 1
assert scored[0]["judge"]["dim_scores"]["grammaticality"] == 5
assert scored[0]["auto_metrics"]["real_english_rate"] == 1.0
# Run compare.py — single-run report
compare.run([run_dir.name])
reports = list((harness / "reports").iterdir())
assert len(reports) == 1
report_text = reports[0].read_text()
assert run_dir.name in report_text
assert "grammaticality" in report_text
- [ ] Step 2: Run e2e test, verify it passes
cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python -m pytest tests/test_e2e.py -v
Expected: 1 passed. (May take 10-30s — first run loads GPT-2.)
- [ ] Step 3: Run full test suite as final check
uv run python -m pytest tests/ -v
Expected: ~17 tests pass total across all test files.
- [ ] Step 4: Commit
git add packages/generation/research/2026-04-29-eval-harness-v1/tests/conftest.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_e2e.py
git commit -m "$(cat <<'EOF'
test(generation/research): PHON-57 — e2e integration test
generate → score → compare with stubbed SSE + stubbed Claude. Exercises the full pipeline against a 1×1 cell, verifies generations.jsonl, scored.jsonl, and the markdown report all materialize correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Task 14: First baseline run + notebook entry¶
This task is operational — runs the harness for real against the running generation server, logs the result in the lab notebook. Skip if the server isn't available locally; come back to it after PHON-63 lands.
Files:
- Modify: packages/generation/research/2026-04-29-eval-harness-v1/notebook.md
- [ ] Step 1: Confirm ANTHROPIC_API_KEY is set
echo "${ANTHROPIC_API_KEY:0:10}..."
If empty, source from ~/Repos/eureka (per project memory) or set with:
export ANTHROPIC_API_KEY=sk-ant-...
- [ ] Step 2: Confirm generation server is running
curl http://localhost:8000/api/server/status
Expected: {"status": "ready", ...}. If not running:
cd packages/generation
uv run python scripts/build_lookup.py
uv run uvicorn server.main:app --host 0.0.0.0 --port 8000
(Leave running in another terminal.)
- [ ] Step 3: Confirm vocab.txt is present
ls -lh packages/generation/research/2026-04-29-eval-harness-v1/vocab.txt
If missing, run ./refresh_vocab.sh in that directory.
- [ ] Step 4: Run the baseline sweep
cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python generate.py baseline-v6
Expected: 40 generations (8 constraints × 5 prompts), takes ~10–20 minutes depending on T5Gemma latency. Output: runs/baseline-v6-<timestamp>/generations.jsonl.
- [ ] Step 5: Score the run
uv run python score.py runs/baseline-v6-<timestamp>
Expected: ~$0.12 in Claude API cost, takes ~5–10 minutes.
- [ ] Step 6: Generate the report
uv run python compare.py runs/baseline-v6-<timestamp>
Expected: markdown report in reports/<timestamp>-baseline-v6-...md.
- [ ] Step 7: Add notebook entry
Append to notebook.md:
## Run 1: baseline-v6 — 2026-04-29 (or actual date)
**Config:** `configs/baseline-v6.yaml`
**Run id:** `baseline-v6-<timestamp>`
**Report:** `reports/<timestamp>-baseline-v6-...md`
**Setup:** local generation server, T5Gemma 9B-2B, vocab.txt = D1 207K snapshot.
**Headline numbers (fill in from report):**
- grammaticality: ?
- coherence: ?
- prompt_following: ?
- natural_ending: ?
- stays_in_english: ?
- real_english_rate: ?
- distinct_3: ?
- ppl_gpt2: ?
**Worst cells (top 1 per failure-mode dim):**
- grammaticality: <combo × prompt> — <snippet>
- stays_in_english: <combo × prompt> — <snippet>
- real_english_rate: <combo × prompt> — <snippet>
**First takeaway:** <one paragraph synthesizing the baseline picture>
**Hypotheses generated (fed into PHON-58 / PHON-59):**
- <hypothesis 1>
- <hypothesis 2>
- [ ] Step 8: Append to experiments.jsonl
uv run python -c "
from pathlib import Path
from experiments import append_experiment, ExperimentEntry
append_experiment(
Path('experiments.jsonl'),
ExperimentEntry(
actor='human',
hypothesis='Establish baseline measurement of v6 governed generation under representative prompt × constraint matrix',
config_path='configs/baseline-v6.yaml',
run_id='<actual run_id>',
scored_path='runs/<run_id>/scored.jsonl',
verdict='baseline',
verdict_evidence='See notebook.md Run 1 entry',
)
)
"
- [ ] Step 9: Commit notebook + experiments log
git add packages/generation/research/2026-04-29-eval-harness-v1/notebook.md packages/generation/research/2026-04-29-eval-harness-v1/experiments.jsonl
git commit -m "$(cat <<'EOF'
research(generation): PHON-57 — first baseline-v6 sweep + notebook entry
40 generations × baseline config × baseline prompt+constraint pools, scored end-to-end. Establishes the measurement substrate that PHON-58/59/60/61 will compare future configs against.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Done¶
The eval harness is implemented, tested, and has produced its first measurement of the production system. PHON-57 is complete.
Next steps (other tickets, not this plan): - PHON-63 — server-side decoding-param override API. After this lands, the harness can sweep over real config alternatives. - PHON-58 — research spike. Reads the baseline report, surveys SOTA, audits implementation, proposes candidate configs to feed back into the harness. - PHON-59/60/61 — quality / length / compliance work. Each candidate fix becomes a named config, swept, scored, compared against baseline.
Push the branch and open a PR against develop per project convention (feedback_branch_management.md).