Generation Quality Eval Harness — Implementation Plan¶

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build a fresh eval harness under packages/generation/research/2026-04-29-eval-harness-v1/ that measures governed-generation quality across (config × prompt × constraints) using auto metrics + an LLM-judge rubric.

Architecture: Three deterministic CLI stages — generate.py (sweep → JSONL), score.py (auto metrics + Claude judge → scored JSONL), compare.py (markdown report) — driven by YAML configs, rubric, prompts, constraint pool. Autoresearch-ready (file-based primitives, append-only experiments.jsonl).

Tech Stack: Python 3.11, httpx (SSE client), pyyaml, transformers + torch (GPT-2 PPL), anthropic SDK (judge), pytest (testing).

Spec: docs/superpowers/specs/2026-04-29-generation-quality-eval-harness-design.md

Jira: PHON-57 (parent: PHON-56 Workstream; blocked by PHON-63 for full sweep usage)

File Structure¶

packages/generation/research/2026-04-29-eval-harness-v1/
├── notebook.md                # lab notebook — append-only
├── rubric.yaml                # judge dims + auto metrics + scales
├── prompts.yaml               # 5 prompts (narrative, kids, declamatory, procedural, instructional)
├── constraints.yaml           # 8 constraint cells
├── configs/
│   └── baseline-v6.yaml       # production decoding + governor settings, snapshot 2026-04-29
├── vocab.txt                  # D1 207K word snapshot (committed once, refreshable)
├── generate.py                # CLI: stage 1 (sweep runner)
├── score.py                   # CLI: stage 2 (auto metrics + judge)
├── compare.py                 # CLI: stage 3 (markdown comparison report)
├── auto_metrics.py            # OOV gate + n-gram + PPL
├── judge.py                   # Claude API call + rubric-driven prompting
├── sse_client.py              # POST /api/generate-single, parse SSE events
├── schemas.py                 # YAML loaders + Pydantic validators
├── experiments.py             # experiments.jsonl helpers (autoresearch readiness)
├── tests/
│   ├── conftest.py            # shared fixtures
│   ├── test_auto_metrics.py
│   ├── test_judge.py
│   ├── test_schemas.py
│   ├── test_sse_client.py
│   ├── test_experiments.py
│   └── test_e2e.py
└── runs/, reports/, experiments.jsonl    # runtime artifacts (gitignored)

Dependencies to add (to packages/generation/pyproject.toml): - anthropic>=0.40.0 (judge) - pyyaml>=6.0 (config loaders) - (existing) transformers, torch, httpx, pytest

Task 1: Scaffold the research directory¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/notebook.md - Create: packages/generation/research/2026-04-29-eval-harness-v1/rubric.yaml - Create: packages/generation/research/2026-04-29-eval-harness-v1/prompts.yaml - Create: packages/generation/research/2026-04-29-eval-harness-v1/constraints.yaml - Create: packages/generation/research/2026-04-29-eval-harness-v1/configs/baseline-v6.yaml - Create: packages/generation/research/2026-04-29-eval-harness-v1/.gitignore - Modify: packages/generation/pyproject.toml (add anthropic, pyyaml dependencies)

[ ] Step 1: Create notebook.md

# Eval Harness v1 — Lab Notebook

**Created:** 2026-04-29
**Spec:** `docs/superpowers/specs/2026-04-29-generation-quality-eval-harness-design.md`
**Jira:** PHON-57

## Methodology

Epistemic scavenging — fresh design from first principles. Existing scripts (`constraint_grid_sweep.py`, `analyze_sweep.py`) are historical references for consiliation, not foundations.

## Run index

(populated as runs accumulate)

## Hypothesis log

(populated as experiments are proposed)

## Findings

(populated as runs complete and patterns emerge)

[ ] Step 2: Create rubric.yaml

version: 1
description: First-cut rubric — designed from observed failure modes (friend testing, 2026-04-29)
judge:
  default_model: claude-haiku-4-5
  cache_system_prompt: true
dimensions:
  - id: grammaticality
    description: Is the output well-formed English syntax?
    scale: [1, 2, 3, 4, 5]
    anchors:
      1: "Severe grammar errors throughout (broken syntax, missing core words)"
      3: "Some grammar issues but generally readable"
      5: "Fully grammatical, no syntactic errors"
  - id: coherence
    description: Does the output hold together as discourse and stay on topic?
    scale: [1, 2, 3, 4, 5]
    anchors:
      1: "Incoherent, topic salad, contradictory"
      3: "Mostly coherent with some drift or non-sequiturs"
      5: "Tight coherence end to end"
  - id: prompt_following
    description: Does the output address what the prompt asked for?
    scale: [1, 2, 3, 4, 5]
    anchors:
      1: "Ignores the prompt entirely"
      3: "Partially addresses it"
      5: "Fully addresses the prompt"
  - id: natural_ending
    description: Does it conclude on its own, or pad to budget?
    scale: [1, 2, 3, 4, 5]
    anchors:
      1: "Truncated mid-thought or padded with filler"
      3: "Acceptable ending but somewhat abrupt or padded"
      5: "Concludes naturally"
  - id: stays_in_english
    description: Does it avoid code-switching into other languages?
    scale: [1, 2, 3, 4, 5]
    anchors:
      1: "Heavy code-switching, mostly non-English tokens"
      3: "Some non-English tokens"
      5: "Fully English"
auto_metrics:
  - id: real_english_rate
    description: Fraction of [a-zA-Z]+ tokens present in the D1 207K vocab snapshot
    source: vocab.txt
  - id: distinct_3
    description: distinct-3-gram ratio (1.0 = no 3-gram repeats)
  - id: distinct_5
    description: distinct-5-gram ratio
  - id: max_3gram_rep
    description: max repetition rate of any single 3-gram
  - id: ppl_gpt2
    description: token-level perplexity from GPT-2 base
    model: gpt2

[ ] Step 3: Create prompts.yaml

- id: dog_park
  text: "Write a short paragraph about a dog playing in the park."
  genre: narrative
- id: asteroid_kids
  text: "Tell a kids' story about an asteroid in space."
  genre: kids_story
- id: king_edicts
  text: "Write a proclamation by a king announcing three royal edicts."
  genre: declamation
- id: pb_sandwich
  text: "Describe how to make a peanut butter sandwich in three steps."
  genre: procedural
- id: tying_shoes
  text: "Give a brief instruction for tying shoes."
  genre: instructional

[ ] Step 4: Create constraints.yaml

- id: exc_r
  type: exclude
  phonemes: ["ɹ", "ɝ", "ɚ"]
- id: exc_szshzh
  type: exclude
  phonemes: ["s", "z", "ʃ", "ʒ"]
- id: aoa_le5
  type: bound
  norm: aoa_kuperman
  max: 5.0
- id: aoa_le7
  type: bound
  norm: aoa_kuperman
  max: 7.0
- id: mixed_aoa7_excr
  combine:
    - { type: bound, norm: aoa_kuperman, max: 7.0 }
    - { type: exclude, phonemes: ["ɹ", "ɝ", "ɚ"] }
- id: mixed_aoa5_excsz
  combine:
    - { type: bound, norm: aoa_kuperman, max: 5.0 }
    - { type: exclude, phonemes: ["s", "z"] }
- id: inc_k
  type: include
  phonemes: ["k"]
  target_rate: 0.20
- id: con_sz
  type: contrastive
  pair_type: minpair
  phoneme1: s
  phoneme2: z
  position: any

[ ] Step 5: Create configs/baseline-v6.yaml

name: baseline-v6
description: Production decoding + governor settings, snapshot 2026-04-29
parent_config: null
decoding:
  temperature: 0.8
  top_p: 0.92
  top_k: 80
  repetition_penalty: 1.2
  max_new_tokens: 128
  num_drafts: 4
governor:
  use_punctuation_boost: true
  use_trie_steering: true
  use_lookahead: true

[ ] Step 6: Create .gitignore for runtime artifacts

runs/
reports/
experiments.jsonl
__pycache__/
*.pyc
.pytest_cache/

[ ] Step 7: Add anthropic + pyyaml to packages/generation/pyproject.toml

Open packages/generation/pyproject.toml, find the dependencies list, add:

"anthropic>=0.40.0",
"pyyaml>=6.0",

(Keep alphabetical order; insert after existing entries that come before alphabetically.)

[ ] Step 8: Install new dependencies

Run: uv sync (from repo root) Expected: pulls anthropic + pyyaml into the env.

[ ] Step 9: Commit scaffold

git add packages/generation/research/2026-04-29-eval-harness-v1/ packages/generation/pyproject.toml uv.lock
git commit -m "$(cat <<'EOF'
scaffold(generation/research): PHON-57 — eval harness v1 directory + YAML stubs

Notebook, rubric, prompts, constraints, baseline-v6 config. Adds anthropic + pyyaml deps. No Python code yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 2: Snapshot D1 vocabulary to vocab.txt¶

The OOV gate needs the canonical PhonoLex vocabulary. Snapshot once, commit, refreshable on demand.

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/vocab.txt - Create: packages/generation/research/2026-04-29-eval-harness-v1/refresh_vocab.sh

[ ] Step 1: Create refresh_vocab.sh helper

#!/usr/bin/env bash
# Snapshot the local D1 words table to vocab.txt.
# Run from this directory: ./refresh_vocab.sh
# Requires: local D1 to be seeded (npx wrangler d1 execute phonolex --local --file scripts/d1-seed.sql)

set -e
cd "$(dirname "$0")"

WORKERS_DIR="../../../web/workers"
DB_PATH="$WORKERS_DIR/.wrangler/state/v3/d1/miniflare-D1DatabaseObject"

DB_FILE=$(find "$DB_PATH" -name "*.sqlite" 2>/dev/null | head -1)

if [[ -z "$DB_FILE" ]]; then
  echo "ERROR: local D1 SQLite file not found at $DB_PATH"
  echo "Seed it first: cd packages/web/workers && npx wrangler d1 execute phonolex --local --file scripts/d1-seed.sql"
  exit 1
fi

echo "Snapshotting from $DB_FILE..."
sqlite3 "$DB_FILE" "SELECT word FROM words ORDER BY word" > vocab.txt
echo "Wrote $(wc -l < vocab.txt) words to vocab.txt"

Make executable: chmod +x packages/generation/research/2026-04-29-eval-harness-v1/refresh_vocab.sh

[ ] Step 2: Run the snapshot

cd packages/generation/research/2026-04-29-eval-harness-v1
./refresh_vocab.sh

Expected: "Wrote 207665 words to vocab.txt" (or close — exact count depends on seed version).

[ ] Step 3: Sanity-check vocab.txt

wc -l vocab.txt
grep -c '^the$' vocab.txt
grep -c '^mek$' vocab.txt

Expected: ~207665 lines, ^the$ returns 1, ^mek$ returns 0.

[ ] Step 4: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/vocab.txt packages/generation/research/2026-04-29-eval-harness-v1/refresh_vocab.sh
git commit -m "$(cat <<'EOF'
snapshot(generation/research): PHON-57 — D1 vocabulary snapshot for OOV gate

Snapshot of local D1 words table (207,665 entries) committed as vocab.txt. refresh_vocab.sh re-snapshots on demand. Decouples the eval harness from D1 state during runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 3: OOV gate (real_english_rate auto-metric)¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py - Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/__init__.py - Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py

[ ] Step 1: Write the failing test

Create tests/test_auto_metrics.py:

from pathlib import Path
import pytest

from auto_metrics import VocabGate, real_english_rate


@pytest.fixture
def vocab_gate():
    return VocabGate(Path(__file__).parent.parent / "vocab.txt")


def test_real_english_rate_all_in_vocab(vocab_gate):
    text = "the dog ran in the park"
    assert real_english_rate(text, vocab_gate) == 1.0


def test_real_english_rate_all_oov(vocab_gate):
    text = "mek aan alen vivere"
    assert real_english_rate(text, vocab_gate) == 0.0


def test_real_english_rate_mixed(vocab_gate):
    text = "the mek dog aan ran"   # 3 in-vocab / 5 total
    assert real_english_rate(text, vocab_gate) == pytest.approx(3 / 5)


def test_real_english_rate_case_insensitive(vocab_gate):
    text = "The DOG ran"
    assert real_english_rate(text, vocab_gate) == 1.0


def test_real_english_rate_empty():
    text = ""
    # No tokens — undefined; convention: return 1.0 (vacuously true)
    gate = VocabGate(Path(__file__).parent.parent / "vocab.txt")
    assert real_english_rate(text, gate) == 1.0


def test_real_english_rate_strips_punctuation(vocab_gate):
    text = "The dog, running fast!"
    assert real_english_rate(text, vocab_gate) == 1.0

[ ] Step 2: Run test, verify it fails

cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python -m pytest tests/test_auto_metrics.py -v

Expected: FAIL with "ModuleNotFoundError: No module named 'auto_metrics'"

[ ] Step 3: Implement minimal VocabGate + real_english_rate

Create auto_metrics.py:

"""Auto metrics for the eval harness — local computation, no network."""

from __future__ import annotations

import re
from pathlib import Path


WORD_RE = re.compile(r"[a-zA-Z]+")


class VocabGate:
    """Membership check against a one-word-per-line vocabulary file."""

    def __init__(self, vocab_path: Path) -> None:
        with open(vocab_path) as f:
            self._words = {line.strip().lower() for line in f if line.strip()}

    def __contains__(self, word: str) -> bool:
        return word.lower() in self._words

    def __len__(self) -> int:
        return len(self._words)


def real_english_rate(text: str, gate: VocabGate) -> float:
    """Fraction of [a-zA-Z]+ tokens in `text` that are members of `gate`."""
    tokens = WORD_RE.findall(text)
    if not tokens:
        return 1.0
    in_vocab = sum(1 for t in tokens if t in gate)
    return in_vocab / len(tokens)

[ ] Step 4: Run test, verify it passes

uv run python -m pytest tests/test_auto_metrics.py -v

Expected: 6 passed.

[ ] Step 5: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py packages/generation/research/2026-04-29-eval-harness-v1/tests/
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — OOV gate auto-metric (real_english_rate)

VocabGate loads vocab.txt as a Python set; real_english_rate returns the fraction of [a-zA-Z]+ tokens present in the gate. Catches the friend's "mek"/"aan"/"alen" pseudo-English failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 4: N-gram repetition metrics¶

Files: - Modify: packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py - Modify: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py

[ ] Step 1: Write failing tests for n-gram metrics

Append to tests/test_auto_metrics.py:

from auto_metrics import distinct_n, max_ngram_rep


def test_distinct_n_no_repeats():
    text = "the dog ran fast"
    # 2 trigrams: "the dog ran", "dog ran fast" — both unique
    assert distinct_n(text, n=3) == 1.0


def test_distinct_n_full_repetition():
    text = "mek mek mek mek mek"
    # 3 trigrams, all "mek mek mek" — distinct = 1/3
    assert distinct_n(text, n=3) == pytest.approx(1 / 3)


def test_distinct_n_short_text():
    # Fewer tokens than n: distinct-n is undefined; return 1.0
    text = "hi"
    assert distinct_n(text, n=3) == 1.0


def test_max_3gram_rep_no_repetition():
    text = "the quick brown fox jumps over"
    # 4 trigrams, each appears once — max rep = 1/4
    assert max_ngram_rep(text, n=3) == pytest.approx(1 / 4)


def test_max_3gram_rep_heavy_repetition():
    text = "mek mek mek mek mek"
    # 3 trigrams, "mek mek mek" appears 3 times — max rep = 3/3 = 1.0
    assert max_ngram_rep(text, n=3) == 1.0


def test_max_3gram_rep_partial_repetition():
    text = "the dog ran the dog ran fast"
    # Trigrams: (the dog ran)x2, (dog ran the), (ran the dog), (the dog ran)... wait recount
    # tokens: the dog ran the dog ran fast (7 tokens) -> 5 trigrams
    # (the,dog,ran), (dog,ran,the), (ran,the,dog), (the,dog,ran), (dog,ran,fast)
    # most frequent: "the dog ran" appears 2x out of 5 = 0.4
    assert max_ngram_rep(text, n=3) == pytest.approx(2 / 5)

[ ] Step 2: Run tests, verify they fail

uv run python -m pytest tests/test_auto_metrics.py::test_distinct_n_no_repeats -v

Expected: FAIL with "ImportError: cannot import name 'distinct_n'".

[ ] Step 3: Implement n-gram functions

Append to auto_metrics.py:

from collections import Counter


def _ngrams(text: str, n: int) -> list[tuple[str, ...]]:
    tokens = [t.lower() for t in WORD_RE.findall(text)]
    if len(tokens) < n:
        return []
    return [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]


def distinct_n(text: str, n: int) -> float:
    """distinct-n: ratio of unique n-grams to total n-grams. 1.0 = no repeats."""
    grams = _ngrams(text, n)
    if not grams:
        return 1.0
    return len(set(grams)) / len(grams)


def max_ngram_rep(text: str, n: int) -> float:
    """Max repetition rate of any single n-gram (count of most frequent / total)."""
    grams = _ngrams(text, n)
    if not grams:
        return 0.0
    counts = Counter(grams)
    most_freq = counts.most_common(1)[0][1]
    return most_freq / len(grams)

[ ] Step 4: Run tests, verify they pass

uv run python -m pytest tests/test_auto_metrics.py -v

Expected: all tests pass (12 total at this point).

[ ] Step 5: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — n-gram repetition auto-metrics

distinct_n (1 - repetition rate) and max_ngram_rep (worst-offending n-gram). Quantifies the "mek mek mek" / "Alen Alen Alen" failures from friend testing.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 5: GPT-2 PPL scorer¶

Files: - Modify: packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py - Modify: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py

[ ] Step 1: Write the failing test

Append to tests/test_auto_metrics.py:

from auto_metrics import GPT2PPLScorer


@pytest.fixture(scope="module")
def ppl_scorer():
    return GPT2PPLScorer()


def test_ppl_fluent_lower_than_gibberish(ppl_scorer):
    fluent = "The dog ran across the park and chased a ball."
    gibberish = "mek aan alen oude vivere allaha goed buona"
    assert ppl_scorer.ppl(fluent) < ppl_scorer.ppl(gibberish)


def test_ppl_returns_finite_positive(ppl_scorer):
    text = "The cat sat on the mat."
    val = ppl_scorer.ppl(text)
    assert val > 0
    assert val < float("inf")


def test_ppl_short_text(ppl_scorer):
    # Very short input — should still return a value, not crash
    val = ppl_scorer.ppl("Hi.")
    assert val > 0

[ ] Step 2: Run tests, verify they fail

uv run python -m pytest tests/test_auto_metrics.py::test_ppl_fluent_lower_than_gibberish -v

Expected: FAIL with "ImportError: cannot import name 'GPT2PPLScorer'".

[ ] Step 3: Implement GPT2PPLScorer

Append to auto_metrics.py:

import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast


class GPT2PPLScorer:
    """Token-level perplexity from GPT-2 base. Loaded once; thread-unsafe but fast."""

    def __init__(self, model_name: str = "gpt2") -> None:
        self.tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
        self.model = GPT2LMHeadModel.from_pretrained(model_name)
        self.model.eval()
        self.device = "mps" if torch.backends.mps.is_available() else "cpu"
        self.model.to(self.device)

    @torch.no_grad()
    def ppl(self, text: str) -> float:
        if not text.strip():
            return float("inf")
        enc = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        input_ids = enc["input_ids"].to(self.device)
        if input_ids.shape[1] < 2:
            # Need at least 2 tokens for a meaningful loss
            return float("nan")
        outputs = self.model(input_ids, labels=input_ids)
        return float(torch.exp(outputs.loss).item())

[ ] Step 4: Run tests, verify they pass

uv run python -m pytest tests/test_auto_metrics.py -v

Expected: all pass. First run downloads GPT-2 (~500MB, cached to ~/.cache/huggingface/) — subsequent runs are fast.

[ ] Step 5: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/auto_metrics.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_auto_metrics.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — GPT-2 PPL auto-metric

GPT2PPLScorer wraps gpt2 base for token-level perplexity. Tripwire metric — degenerate outputs (vivere vivere vivere) get absurd PPL.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 6: YAML schema loaders + validators¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/schemas.py - Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_schemas.py

[ ] Step 1: Write failing tests

Create tests/test_schemas.py:

from pathlib import Path
import pytest
import yaml

from schemas import load_rubric, load_config, load_prompts, load_constraints


HERE = Path(__file__).parent
ROOT = HERE.parent


def test_load_rubric_well_formed():
    rubric = load_rubric(ROOT / "rubric.yaml")
    assert rubric.version == 1
    assert len(rubric.dimensions) == 5
    assert {d.id for d in rubric.dimensions} == {
        "grammaticality", "coherence", "prompt_following", "natural_ending", "stays_in_english"
    }
    assert {m.id for m in rubric.auto_metrics} == {
        "real_english_rate", "distinct_3", "distinct_5", "max_3gram_rep", "ppl_gpt2"
    }
    assert rubric.judge.default_model == "claude-haiku-4-5"


def test_load_config_baseline():
    config = load_config(ROOT / "configs" / "baseline-v6.yaml")
    assert config.name == "baseline-v6"
    assert config.decoding.temperature == 0.8
    assert config.decoding.num_drafts == 4
    assert config.governor.use_punctuation_boost is True


def test_load_prompts():
    prompts = load_prompts(ROOT / "prompts.yaml")
    assert len(prompts) == 5
    assert {p.id for p in prompts} == {
        "dog_park", "asteroid_kids", "king_edicts", "pb_sandwich", "tying_shoes"
    }


def test_load_constraints():
    constraints = load_constraints(ROOT / "constraints.yaml")
    assert len(constraints) == 8
    assert {c["id"] for c in constraints} == {
        "exc_r", "exc_szshzh", "aoa_le5", "aoa_le7",
        "mixed_aoa7_excr", "mixed_aoa5_excsz", "inc_k", "con_sz",
    }


def test_load_rubric_invalid_missing_dims(tmp_path):
    bad = tmp_path / "bad.yaml"
    bad.write_text(yaml.safe_dump({"version": 1, "judge": {"default_model": "x"}}))
    with pytest.raises((KeyError, ValueError, Exception)):
        load_rubric(bad)

[ ] Step 2: Run tests, verify they fail

uv run python -m pytest tests/test_schemas.py -v

Expected: FAIL with "ImportError: No module named 'schemas'".

[ ] Step 3: Implement schemas.py

Create schemas.py:

"""YAML loaders for rubric, config, prompts, constraints."""

from __future__ import annotations

from dataclasses import dataclass, field
from pathlib import Path
from typing import Any

import yaml


@dataclass
class JudgeConfig:
    default_model: str
    cache_system_prompt: bool = True


@dataclass
class RubricDimension:
    id: str
    description: str
    scale: list[int]
    anchors: dict[int, str]


@dataclass
class AutoMetric:
    id: str
    description: str = ""
    source: str | None = None
    model: str | None = None


@dataclass
class Rubric:
    version: int
    judge: JudgeConfig
    dimensions: list[RubricDimension]
    auto_metrics: list[AutoMetric]
    description: str = ""


@dataclass
class DecodingConfig:
    temperature: float
    top_p: float
    top_k: int
    repetition_penalty: float
    max_new_tokens: int
    num_drafts: int


@dataclass
class GovernorConfig:
    use_punctuation_boost: bool = True
    use_trie_steering: bool = True
    use_lookahead: bool = True


@dataclass
class ExperimentConfig:
    name: str
    description: str
    decoding: DecodingConfig
    governor: GovernorConfig
    parent_config: str | None = None


@dataclass
class Prompt:
    id: str
    text: str
    genre: str = ""


def _load_yaml(path: Path) -> Any:
    with open(path) as f:
        return yaml.safe_load(f)


def load_rubric(path: Path) -> Rubric:
    raw = _load_yaml(path)
    if "dimensions" not in raw or not raw["dimensions"]:
        raise ValueError(f"rubric.yaml missing dimensions: {path}")
    return Rubric(
        version=raw["version"],
        description=raw.get("description", ""),
        judge=JudgeConfig(**raw["judge"]),
        dimensions=[
            RubricDimension(
                id=d["id"],
                description=d["description"],
                scale=d["scale"],
                anchors={int(k): v for k, v in d["anchors"].items()},
            )
            for d in raw["dimensions"]
        ],
        auto_metrics=[AutoMetric(**m) for m in raw["auto_metrics"]],
    )


def load_config(path: Path) -> ExperimentConfig:
    raw = _load_yaml(path)
    return ExperimentConfig(
        name=raw["name"],
        description=raw["description"],
        parent_config=raw.get("parent_config"),
        decoding=DecodingConfig(**raw["decoding"]),
        governor=GovernorConfig(**raw.get("governor", {})),
    )


def load_prompts(path: Path) -> list[Prompt]:
    raw = _load_yaml(path)
    return [Prompt(**p) for p in raw]


def load_constraints(path: Path) -> list[dict]:
    """Constraints stay as raw dicts — they pass directly to the generation API."""
    raw = _load_yaml(path)
    if not isinstance(raw, list):
        raise ValueError(f"constraints.yaml must be a list: {path}")
    for c in raw:
        if "id" not in c:
            raise ValueError(f"constraint missing id: {c}")
    return raw

[ ] Step 4: Run tests, verify they pass

uv run python -m pytest tests/test_schemas.py -v

Expected: 5 passed.

[ ] Step 5: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/schemas.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_schemas.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — YAML schema loaders for rubric/config/prompts/constraints

Dataclass-based loaders with validation. Constraints stay as raw dicts (passed verbatim to /api/generate-single).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 7: SSE client for /api/generate-single¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/sse_client.py - Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_sse_client.py

[ ] Step 1: Write failing test using a stubbed HTTP client

Create tests/test_sse_client.py:

from unittest.mock import MagicMock, patch

from sse_client import generate_one, GenerationResult


SAMPLE_SSE = (
    'data: {"status": "Building VocabTrie"}\n'
    'data: {"status": "Vocabulary survival: 48%"}\n'
    'data: {"status": "Generating 4 drafts (attempt 1)"}\n'
    'data: {"status": "  Draft 1: compliant"}\n'
    'data: {"result": {"text": "The dog ran.", "compliant": true, "violation_count": 0, "violation_words": [], "boost_coverage": [], "warnings": null, "gen_time_ms": 1234}}\n'
)


def test_generate_one_parses_sse():
    mock_resp = MagicMock()
    mock_resp.iter_text.return_value = [SAMPLE_SSE]
    mock_resp.raise_for_status = MagicMock()

    mock_stream = MagicMock()
    mock_stream.__enter__ = MagicMock(return_value=mock_resp)
    mock_stream.__exit__ = MagicMock(return_value=None)

    mock_client = MagicMock()
    mock_client.stream = MagicMock(return_value=mock_stream)
    mock_client.__enter__ = MagicMock(return_value=mock_client)
    mock_client.__exit__ = MagicMock(return_value=None)

    with patch("sse_client.httpx.Client", return_value=mock_client):
        result = generate_one(
            url="http://localhost:8000",
            prompt="Test prompt",
            constraints=[{"type": "exclude", "phonemes": ["s"]}],
            decoding={"temperature": 0.8},
        )

    assert isinstance(result, GenerationResult)
    assert result.text == "The dog ran."
    assert result.compliant is True
    assert result.survival_ratio == 0.48
    assert result.gen_time_ms == 1234
    assert result.error is None
    assert result.drafts_compliant == 1


def test_generate_one_handles_error_event():
    error_sse = (
        'data: {"status": "starting"}\n'
        'data: {"error": "Server crashed"}\n'
    )
    mock_resp = MagicMock()
    mock_resp.iter_text.return_value = [error_sse]
    mock_resp.raise_for_status = MagicMock()
    mock_stream = MagicMock()
    mock_stream.__enter__ = MagicMock(return_value=mock_resp)
    mock_stream.__exit__ = MagicMock(return_value=None)
    mock_client = MagicMock()
    mock_client.stream = MagicMock(return_value=mock_stream)
    mock_client.__enter__ = MagicMock(return_value=mock_client)
    mock_client.__exit__ = MagicMock(return_value=None)

    with patch("sse_client.httpx.Client", return_value=mock_client):
        result = generate_one(
            url="http://localhost:8000",
            prompt="Test",
            constraints=[],
            decoding={},
        )

    assert result.error == "Server crashed"
    assert result.text == ""
    assert result.compliant is False

[ ] Step 2: Run tests, verify they fail

uv run python -m pytest tests/test_sse_client.py -v

Expected: FAIL with "ImportError: No module named 'sse_client'".

[ ] Step 3: Implement sse_client.py

Create sse_client.py:

"""SSE client for the local generation server's /api/generate-single endpoint.

The server streams `data: {...json...}` events with either {"status": "..."},
{"result": {...}}, or {"error": "..."} payloads. We collect statuses for
pipeline metrics and return the final result (or error).
"""

from __future__ import annotations

import json
import re
from dataclasses import dataclass, field
from typing import Any

import httpx


@dataclass
class GenerationResult:
    text: str = ""
    compliant: bool = False
    violation_count: int = 0
    violation_words: list[str] = field(default_factory=list)
    boost_coverage: list[dict] = field(default_factory=list)
    warnings: str | None = None
    gen_time_ms: int = 0
    survival_ratio: float | None = None
    retry_count: int = 0
    drafts_compliant: int = 0
    hit_escalation: bool = False
    statuses: list[str] = field(default_factory=list)
    error: str | None = None


def generate_one(
    url: str,
    prompt: str,
    constraints: list[dict],
    decoding: dict | None = None,
    governor: dict | None = None,
    timeout: float = 600.0,
) -> GenerationResult:
    """POST to /api/generate-single, parse SSE stream, return structured result."""
    payload: dict[str, Any] = {"prompt": prompt, "constraints": constraints}
    if decoding:
        payload["decoding"] = decoding
    if governor:
        payload["governor"] = governor

    statuses: list[str] = []
    result_payload: dict | None = None
    error_msg: str | None = None

    with httpx.Client(timeout=httpx.Timeout(timeout, connect=30.0)) as client:
        with client.stream(
            "POST",
            f"{url}/api/generate-single",
            json=payload,
            headers={"Content-Type": "application/json"},
        ) as resp:
            resp.raise_for_status()
            buffer = ""
            for chunk in resp.iter_text():
                buffer += chunk
                while "\n" in buffer:
                    line, buffer = buffer.split("\n", 1)
                    line = line.strip()
                    if not line.startswith("data: "):
                        continue
                    data_str = line[6:].strip()
                    if not data_str:
                        continue
                    try:
                        event = json.loads(data_str)
                    except json.JSONDecodeError:
                        continue
                    if "status" in event:
                        statuses.append(event["status"])
                    elif "result" in event:
                        result_payload = event["result"]
                    elif "error" in event:
                        error_msg = event["error"]

    return _build_result(statuses, result_payload, error_msg)


def _build_result(
    statuses: list[str],
    result: dict | None,
    error: str | None,
) -> GenerationResult:
    out = GenerationResult(statuses=statuses, error=error)

    for s in statuses:
        m = re.search(r"Vocabulary survival: (\d+)%", s)
        if m:
            out.survival_ratio = int(m.group(1)) / 100.0
        attempt_m = re.match(r"Generating \d+ drafts \(attempt (\d+)\)", s)
        if attempt_m and int(attempt_m.group(1)) > 1:
            out.retry_count = int(attempt_m.group(1)) - 1
        if re.match(r"\s+Draft \d+: compliant", s):
            out.drafts_compliant += 1
        if "targeted rollout" in s.lower():
            out.hit_escalation = True

    if error or result is None:
        return out

    out.text = result.get("text", "")
    out.compliant = result.get("compliant", False)
    out.violation_count = result.get("violation_count", 0)
    out.violation_words = result.get("violation_words", [])
    out.boost_coverage = result.get("boost_coverage", [])
    out.warnings = result.get("warnings")
    out.gen_time_ms = result.get("gen_time_ms", 0)
    return out

[ ] Step 4: Run tests, verify they pass

uv run python -m pytest tests/test_sse_client.py -v

Expected: 2 passed.

[ ] Step 5: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/sse_client.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_sse_client.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — SSE client for /api/generate-single

generate_one() POSTs prompt+constraints+decoding+governor, parses SSE events into a structured GenerationResult capturing text, compliance, pipeline metrics (survival, retries, escalation), and any error.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 8: generate.py (Stage 1 CLI)¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/generate.py

[ ] Step 1: Implement generate.py

Create generate.py:

"""Stage 1: sweep runner.

Usage:
    uv run python generate.py <config_name> [--server http://localhost:8000] [--resume]
"""

from __future__ import annotations

import argparse
import hashlib
import json
import sys
import time
from datetime import datetime, timezone
from pathlib import Path

import httpx

from schemas import load_config, load_prompts, load_constraints
from sse_client import generate_one


HERE = Path(__file__).parent
RUNS_DIR = HERE / "runs"


def _sha(s: str) -> str:
    return hashlib.sha256(s.encode()).hexdigest()[:12]


def _hash_yaml(path: Path) -> str:
    return _sha(path.read_text())


def _check_server_ready(url: str) -> dict:
    resp = httpx.get(f"{url}/api/server/status", timeout=10.0)
    resp.raise_for_status()
    status = resp.json()
    if status.get("status") != "ready":
        raise RuntimeError(f"Server not ready: {status}")
    return status


def _load_done(path: Path) -> set[str]:
    """Load (combo_id|prompt_id) keys from existing generations.jsonl."""
    if not path.exists():
        return set()
    done = set()
    with open(path) as f:
        for line in f:
            try:
                row = json.loads(line)
                done.add(f"{row['combo_id']}|{row['prompt_id']}")
            except (json.JSONDecodeError, KeyError):
                continue
    return done


def run(config_name: str, server: str, resume: bool) -> None:
    config_path = HERE / "configs" / f"{config_name}.yaml"
    config = load_config(config_path)
    prompts = load_prompts(HERE / "prompts.yaml")
    constraints = load_constraints(HERE / "constraints.yaml")

    server_status = _check_server_ready(server)
    print(f"Server ready: {server_status.get('model', 'unknown')}")

    # Establish run dir
    if resume:
        existing = sorted((RUNS_DIR).glob(f"{config_name}-*"))
        if existing:
            run_dir = existing[-1]
            print(f"Resuming into {run_dir}")
        else:
            print(f"--resume requested but no prior run for {config_name}; starting new")
            timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
            run_dir = RUNS_DIR / f"{config_name}-{timestamp}"
    else:
        timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
        run_dir = RUNS_DIR / f"{config_name}-{timestamp}"

    run_dir.mkdir(parents=True, exist_ok=True)
    run_id = run_dir.name
    generations_path = run_dir / "generations.jsonl"

    # Write meta.json
    meta = {
        "run_id": run_id,
        "config_name": config_name,
        "config_path": str(config_path.relative_to(HERE)),
        "config_hash": _hash_yaml(config_path),
        "prompts_hash": _hash_yaml(HERE / "prompts.yaml"),
        "constraints_hash": _hash_yaml(HERE / "constraints.yaml"),
        "server_status": server_status,
        "started_at": datetime.now(timezone.utc).isoformat(),
    }
    (run_dir / "meta.json").write_text(json.dumps(meta, indent=2))

    done = _load_done(generations_path) if resume else set()
    total = len(constraints) * len(prompts)
    print(f"Sweep: {len(constraints)} constraints × {len(prompts)} prompts = {total} generations")
    if done:
        print(f"Resuming: {len(done)} already done")

    decoding = {
        "temperature": config.decoding.temperature,
        "top_p": config.decoding.top_p,
        "top_k": config.decoding.top_k,
        "repetition_penalty": config.decoding.repetition_penalty,
        "max_new_tokens": config.decoding.max_new_tokens,
        "num_drafts": config.decoding.num_drafts,
    }
    governor_block = {
        "use_punctuation_boost": config.governor.use_punctuation_boost,
        "use_trie_steering": config.governor.use_trie_steering,
        "use_lookahead": config.governor.use_lookahead,
    }

    t_start = time.time()
    completed = 0

    with open(generations_path, "a") as out:
        for combo in constraints:
            combo_id = combo["id"]
            # Constraint payload: 'combine' expands; otherwise the combo dict itself sans id
            if "combine" in combo:
                combo_payload = combo["combine"]
            else:
                combo_payload = [{k: v for k, v in combo.items() if k != "id"}]

            for prompt in prompts:
                key = f"{combo_id}|{prompt.id}"
                if key in done:
                    continue

                progress = completed + 1
                print(f"[{progress}] {combo_id} × {prompt.id} ", end="", flush=True)

                try:
                    result = generate_one(
                        url=server,
                        prompt=prompt.text,
                        constraints=combo_payload,
                        decoding=decoding,
                        governor=governor_block,
                    )
                    err = result.error
                except Exception as e:
                    print(f"ERROR: {e}")
                    err = str(e)
                    result = None

                row = {
                    "run_id": run_id,
                    "config_name": config_name,
                    "prompt_id": prompt.id,
                    "prompt": prompt.text,
                    "combo_id": combo_id,
                    "constraints": combo_payload,
                    "text": result.text if result else "",
                    "pipeline_metrics": {
                        "compliant": result.compliant if result else False,
                        "violation_count": result.violation_count if result else 0,
                        "violation_words": result.violation_words if result else [],
                        "survival_ratio": result.survival_ratio if result else None,
                        "retry_count": result.retry_count if result else 0,
                        "drafts_compliant": result.drafts_compliant if result else 0,
                        "hit_escalation": result.hit_escalation if result else False,
                        "gen_time_ms": result.gen_time_ms if result else 0,
                    },
                    "error": err,
                    "ts": datetime.now(timezone.utc).isoformat(),
                }
                out.write(json.dumps(row, ensure_ascii=False) + "\n")
                out.flush()
                completed += 1

                if err:
                    print(f"ERROR: {err}")
                else:
                    c = "✓" if result.compliant else "✗"
                    surv = result.survival_ratio
                    surv_s = f" surv={surv:.0%}" if surv is not None else ""
                    print(f"{c} {result.gen_time_ms / 1000:.1f}s{surv_s}")

    elapsed = time.time() - t_start
    print(f"\nRun complete: {completed} generations in {elapsed/60:.1f} min")
    print(f"Output: {generations_path}")


def main():
    parser = argparse.ArgumentParser(description="Stage 1: sweep generator")
    parser.add_argument("config_name", help="Name of a configs/<name>.yaml preset")
    parser.add_argument("--server", default="http://localhost:8000")
    parser.add_argument("--resume", action="store_true")
    args = parser.parse_args()
    run(args.config_name, args.server, args.resume)


if __name__ == "__main__":
    main()

[ ] Step 2: Sanity-check imports

cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python -c "import generate; print('ok')"

Expected: ok

[ ] Step 3: Sanity-check CLI help

uv run python generate.py --help

Expected: argparse help with config_name, --server, --resume.

[ ] Step 4: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/generate.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — generate.py (Stage 1 sweep runner)

Loads named config, iterates constraints × prompts, POSTs each via SSE client, writes generations.jsonl with structured pipeline metrics. Resume-safe by (combo_id, prompt_id) key. Writes meta.json with config + pool hashes for reproducibility.

Decoding + governor blocks are passed through to /api/generate-single, but the server endpoint won't honor them until PHON-63 lands. Until then, the harness sends the blocks as documentation/forward-compat; the server uses compiled-in defaults.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 9: Claude judge module¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/judge.py - Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_judge.py

[ ] Step 1: Write failing test using mocked Anthropic client

Create tests/test_judge.py:

import json
from unittest.mock import MagicMock, patch

import pytest

from judge import build_system_prompt, build_user_prompt, score_one
from schemas import load_rubric


from pathlib import Path
HERE = Path(__file__).parent
ROOT = HERE.parent


@pytest.fixture
def rubric():
    return load_rubric(ROOT / "rubric.yaml")


def test_build_system_prompt_includes_all_dims(rubric):
    sysp = build_system_prompt(rubric)
    for dim in rubric.dimensions:
        assert dim.id in sysp
        assert dim.description in sysp


def test_build_user_prompt_includes_text(rubric):
    user = build_user_prompt(prompt="Tell a story", text="Once upon a time...")
    assert "Once upon a time" in user
    assert "Tell a story" in user


def test_score_one_parses_response(rubric):
    fake_response = MagicMock()
    fake_response.content = [MagicMock()]
    fake_response.content[0].text = json.dumps({
        "grammaticality": {"score": 5, "rationale": "Well-formed."},
        "coherence": {"score": 4, "rationale": "Coherent."},
        "prompt_following": {"score": 5, "rationale": "Addresses prompt."},
        "natural_ending": {"score": 3, "rationale": "Abrupt."},
        "stays_in_english": {"score": 5, "rationale": "Fully English."},
    })
    fake_response.usage.input_tokens = 1234
    fake_response.usage.output_tokens = 200

    fake_client = MagicMock()
    fake_client.messages.create.return_value = fake_response

    result = score_one(
        client=fake_client,
        rubric=rubric,
        prompt="Tell a story",
        text="Once upon a time, a dog ran in the park.",
    )

    assert result["dim_scores"]["grammaticality"] == 5
    assert result["dim_scores"]["natural_ending"] == 3
    assert result["dim_rationales"]["coherence"] == "Coherent."
    assert result["input_tokens"] == 1234
    assert result["output_tokens"] == 200
    assert result["model"] == rubric.judge.default_model
    assert result["judge_cost_usd"] > 0


def test_score_one_handles_malformed_response(rubric):
    fake_response = MagicMock()
    fake_response.content = [MagicMock()]
    fake_response.content[0].text = "This is not JSON."
    fake_response.usage.input_tokens = 10
    fake_response.usage.output_tokens = 5

    fake_client = MagicMock()
    fake_client.messages.create.return_value = fake_response

    result = score_one(
        client=fake_client,
        rubric=rubric,
        prompt="x",
        text="y",
    )
    # Bad JSON -> all dim scores null, error captured
    assert all(v is None for v in result["dim_scores"].values())
    assert result["error"] is not None

[ ] Step 2: Run test, verify it fails

uv run python -m pytest tests/test_judge.py -v

Expected: FAIL with "ImportError: No module named 'judge'".

[ ] Step 3: Implement judge.py

Create judge.py:

"""Claude-based rubric judge.

Single API call per generation. System prompt = rubric definitions + scale anchors,
cached via Anthropic prompt caching. User prompt = the prompt+text being scored.
Output: JSON with score+rationale per dimension.
"""

from __future__ import annotations

import json
import os
import re
import time
from typing import Any

from anthropic import Anthropic, APIError

from schemas import Rubric


# Haiku 4.5 pricing (as of 2026-04, $/Mtoken)
PRICING_USD_PER_MTOK = {
    "claude-haiku-4-5": {"input": 1.0, "output": 5.0, "cache_read": 0.1, "cache_write": 1.25},
    "claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.3, "cache_write": 3.75},
}


def build_system_prompt(rubric: Rubric) -> str:
    """System prompt — cached via Anthropic prompt caching when called."""
    lines = [
        "You are an expert evaluator of short text generations from a constrained-generation system.",
        "",
        "Score the generation on each dimension below. Each dimension is independent — score them separately.",
        "",
        "DIMENSIONS:",
        "",
    ]
    for d in rubric.dimensions:
        lines.append(f"## {d.id}: {d.description}")
        lines.append(f"Scale: {min(d.scale)}-{max(d.scale)}")
        for s, anchor in sorted(d.anchors.items()):
            lines.append(f"  {s}: {anchor}")
        lines.append("")
    lines.extend([
        "OUTPUT FORMAT:",
        "Return a JSON object with one key per dimension. Each value is an object with `score` (integer in scale) and `rationale` (one short sentence).",
        "",
        "Example:",
        '```json',
        "{",
    ])
    for i, d in enumerate(rubric.dimensions):
        comma = "," if i < len(rubric.dimensions) - 1 else ""
        lines.append(f'  "{d.id}": {{"score": 4, "rationale": "<short reason>"}}{comma}')
    lines.extend([
        "}",
        "```",
        "",
        "Respond with ONLY the JSON object, no preamble or commentary.",
    ])
    return "\n".join(lines)


def build_user_prompt(prompt: str, text: str) -> str:
    return (
        f"PROMPT GIVEN TO THE GENERATOR:\n{prompt}\n\n"
        f"GENERATED OUTPUT:\n{text}\n\n"
        "Score this generation on each dimension and return the JSON object."
    )


def _extract_json(s: str) -> dict | None:
    """Pull the first JSON object out of a string. Tolerates ```json fences."""
    fence = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", s, re.DOTALL)
    if fence:
        try:
            return json.loads(fence.group(1))
        except json.JSONDecodeError:
            pass
    # Try the whole string
    try:
        return json.loads(s)
    except json.JSONDecodeError:
        pass
    # Try to find a balanced top-level {...}
    m = re.search(r"\{.*\}", s, re.DOTALL)
    if m:
        try:
            return json.loads(m.group(0))
        except json.JSONDecodeError:
            pass
    return None


def _cost_usd(model: str, input_tokens: int, output_tokens: int) -> float:
    p = PRICING_USD_PER_MTOK.get(model)
    if not p:
        return 0.0
    return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000


def score_one(
    client: Anthropic,
    rubric: Rubric,
    prompt: str,
    text: str,
    model: str | None = None,
    max_retries: int = 3,
) -> dict[str, Any]:
    """Score a single (prompt, text) pair with the Claude judge. Retries on transient failures."""
    chosen_model = model or rubric.judge.default_model
    sysp = build_system_prompt(rubric)
    user = build_user_prompt(prompt, text)
    dim_ids = [d.id for d in rubric.dimensions]

    last_error: str | None = None
    response = None
    for attempt in range(max_retries):
        try:
            t0 = time.time()
            kwargs = {
                "model": chosen_model,
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": user}],
            }
            if rubric.judge.cache_system_prompt:
                kwargs["system"] = [
                    {"type": "text", "text": sysp, "cache_control": {"type": "ephemeral"}}
                ]
            else:
                kwargs["system"] = sysp
            response = client.messages.create(**kwargs)
            judge_ms = int((time.time() - t0) * 1000)
            break
        except APIError as e:
            last_error = str(e)
            if attempt == max_retries - 1:
                break
            time.sleep(2 ** attempt)
        except Exception as e:
            last_error = str(e)
            break

    if response is None:
        return {
            "model": chosen_model,
            "dim_scores": {d: None for d in dim_ids},
            "dim_rationales": {d: None for d in dim_ids},
            "judge_ms": 0,
            "judge_cost_usd": 0.0,
            "input_tokens": 0,
            "output_tokens": 0,
            "error": last_error or "unknown",
        }

    raw = response.content[0].text if response.content else ""
    parsed = _extract_json(raw)

    dim_scores: dict[str, int | None] = {d: None for d in dim_ids}
    dim_rationales: dict[str, str | None] = {d: None for d in dim_ids}
    error = None

    if parsed is None:
        error = f"failed to parse judge JSON: {raw[:200]}"
    else:
        for did in dim_ids:
            cell = parsed.get(did)
            if isinstance(cell, dict):
                s = cell.get("score")
                if isinstance(s, int):
                    dim_scores[did] = s
                dim_rationales[did] = cell.get("rationale")

    return {
        "model": chosen_model,
        "dim_scores": dim_scores,
        "dim_rationales": dim_rationales,
        "judge_ms": judge_ms,
        "judge_cost_usd": _cost_usd(
            chosen_model, response.usage.input_tokens, response.usage.output_tokens
        ),
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "error": error,
    }


def make_client() -> Anthropic:
    """Create an Anthropic client. Reads ANTHROPIC_API_KEY from env."""
    key = os.environ.get("ANTHROPIC_API_KEY")
    if not key:
        raise RuntimeError(
            "ANTHROPIC_API_KEY not set. Source from Repos/eureka or your local secret store."
        )
    return Anthropic(api_key=key)

[ ] Step 4: Run tests, verify they pass

uv run python -m pytest tests/test_judge.py -v

Expected: 4 passed.

[ ] Step 5: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/judge.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_judge.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — Claude judge module

score_one() takes (rubric, prompt, text), builds a cacheable system prompt from rubric dims, calls Claude (default haiku-4-5), parses structured JSON output, returns dim scores + rationales + cost. Retries with exponential backoff on transient failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 10: score.py (Stage 2 CLI)¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/score.py

[ ] Step 1: Implement score.py

Create score.py:

"""Stage 2: scoring runner.

Usage:
    uv run python score.py <run_id>
    uv run python score.py runs/baseline-v6-20260429T120000Z

Reads runs/<run_id>/generations.jsonl, augments each row with auto metrics +
judge scores, writes runs/<run_id>/scored.jsonl. Idempotent — skips rows already
scored (matched on (run_id, combo_id, prompt_id) key).
"""

from __future__ import annotations

import argparse
import json
import sys
from datetime import datetime, timezone
from pathlib import Path

from auto_metrics import GPT2PPLScorer, VocabGate, distinct_n, max_ngram_rep, real_english_rate
from judge import make_client, score_one
from schemas import load_rubric


HERE = Path(__file__).parent
RUNS_DIR = HERE / "runs"


def _load_scored(path: Path) -> set[str]:
    if not path.exists():
        return set()
    out = set()
    with open(path) as f:
        for line in f:
            try:
                row = json.loads(line)
                out.add(f"{row['run_id']}|{row['combo_id']}|{row['prompt_id']}")
            except (json.JSONDecodeError, KeyError):
                continue
    return out


def _resolve_run_dir(run_id: str) -> Path:
    """Accept either a run_id or a path to the run directory."""
    candidate = Path(run_id)
    if candidate.is_dir() and (candidate / "generations.jsonl").exists():
        return candidate
    candidate = RUNS_DIR / run_id
    if candidate.is_dir():
        return candidate
    print(f"ERROR: run not found: {run_id}", file=sys.stderr)
    sys.exit(1)


def run(run_id: str, model_override: str | None) -> None:
    run_dir = _resolve_run_dir(run_id)
    generations_path = run_dir / "generations.jsonl"
    scored_path = run_dir / "scored.jsonl"

    rubric = load_rubric(HERE / "rubric.yaml")
    vocab = VocabGate(HERE / "vocab.txt")
    print(f"Vocab gate loaded: {len(vocab):,} entries")

    print("Loading GPT-2 PPL scorer...")
    ppl = GPT2PPLScorer()

    client = make_client()

    done = _load_scored(scored_path)
    rows: list[dict] = []
    with open(generations_path) as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))

    todo = [r for r in rows if f"{r['run_id']}|{r['combo_id']}|{r['prompt_id']}" not in done]
    print(f"{len(rows)} total, {len(done)} already scored, {len(todo)} to score")

    total_cost = 0.0

    with open(scored_path, "a") as out:
        for i, row in enumerate(todo, 1):
            text = row.get("text", "")
            if not text or row.get("error"):
                # Skip judging — write a row with null scores so we don't reprocess
                scored_row = {
                    **row,
                    "auto_metrics": {
                        "real_english_rate": None,
                        "distinct_3": None,
                        "distinct_5": None,
                        "max_3gram_rep": None,
                        "ppl_gpt2": None,
                    },
                    "judge": {
                        "model": rubric.judge.default_model,
                        "dim_scores": {d.id: None for d in rubric.dimensions},
                        "dim_rationales": {d.id: None for d in rubric.dimensions},
                        "judge_ms": 0,
                        "judge_cost_usd": 0.0,
                        "input_tokens": 0,
                        "output_tokens": 0,
                        "error": "skipped (no text)",
                    },
                    "scored_at": datetime.now(timezone.utc).isoformat(),
                }
                out.write(json.dumps(scored_row, ensure_ascii=False) + "\n")
                out.flush()
                continue

            auto = {
                "real_english_rate": real_english_rate(text, vocab),
                "distinct_3": distinct_n(text, 3),
                "distinct_5": distinct_n(text, 5),
                "max_3gram_rep": max_ngram_rep(text, 3),
                "ppl_gpt2": ppl.ppl(text),
            }
            judge = score_one(
                client=client,
                rubric=rubric,
                prompt=row.get("prompt", ""),
                text=text,
                model=model_override,
            )
            total_cost += judge["judge_cost_usd"]

            scored_row = {
                **row,
                "auto_metrics": auto,
                "judge": judge,
                "scored_at": datetime.now(timezone.utc).isoformat(),
            }
            out.write(json.dumps(scored_row, ensure_ascii=False) + "\n")
            out.flush()

            print(
                f"[{i}/{len(todo)}] {row['combo_id']} × {row['prompt_id']} "
                f"OOV={auto['real_english_rate']:.0%} "
                f"PPL={auto['ppl_gpt2']:.0f} "
                f"gram={judge['dim_scores'].get('grammaticality')} "
                f"coh={judge['dim_scores'].get('coherence')} "
                f"${judge['judge_cost_usd']:.4f}"
            )

    print(f"\nScored: {len(todo)} new rows. Cumulative judge cost: ${total_cost:.4f}")
    print(f"Output: {scored_path}")


def main():
    parser = argparse.ArgumentParser(description="Stage 2: score generations")
    parser.add_argument("run_id", help="Run id or path to run dir")
    parser.add_argument("--model", default=None, help="Override judge model (e.g., claude-sonnet-4-6)")
    args = parser.parse_args()
    run(args.run_id, args.model)


if __name__ == "__main__":
    main()

[ ] Step 2: Sanity-check imports + CLI

cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python score.py --help

Expected: argparse help output.

[ ] Step 3: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/score.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — score.py (Stage 2 scoring runner)

For each generation row: compute auto metrics (real_english_rate, distinct_3/5, max_3gram_rep, ppl_gpt2) + Claude judge scores. Idempotent — skips rows already in scored.jsonl. Tracks cumulative judge cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 11: compare.py (Stage 3 CLI)¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/compare.py

[ ] Step 1: Implement compare.py

Create compare.py:

"""Stage 3: comparison report.

Usage:
    uv run python compare.py <run_id>                            # single-run report
    uv run python compare.py <run_id_a> <run_id_b> [<run_id_c>]  # multi-run comparison

Aggregates one or more scored.jsonl files, emits a markdown report under reports/.
"""

from __future__ import annotations

import argparse
import json
import statistics
import sys
from datetime import datetime, timezone
from pathlib import Path


HERE = Path(__file__).parent
RUNS_DIR = HERE / "runs"
REPORTS_DIR = HERE / "reports"


JUDGE_DIMS = ["grammaticality", "coherence", "prompt_following", "natural_ending", "stays_in_english"]
AUTO_METRICS = ["real_english_rate", "distinct_3", "distinct_5", "max_3gram_rep", "ppl_gpt2"]


def _resolve_run_dir(run_id: str) -> Path:
    candidate = Path(run_id)
    if candidate.is_dir():
        return candidate
    return RUNS_DIR / run_id


def load_scored(run_id: str) -> tuple[str, list[dict]]:
    run_dir = _resolve_run_dir(run_id)
    path = run_dir / "scored.jsonl"
    if not path.exists():
        print(f"ERROR: scored.jsonl not found at {path}", file=sys.stderr)
        sys.exit(1)
    rows = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return run_dir.name, rows


def _mean(values: list[float | int | None]) -> float | None:
    nums = [v for v in values if v is not None]
    if not nums:
        return None
    return statistics.fmean(nums)


def aggregate(rows: list[dict]) -> dict[str, float | None]:
    out: dict[str, float | None] = {}
    for d in JUDGE_DIMS:
        out[d] = _mean([r["judge"]["dim_scores"].get(d) for r in rows])
    for m in AUTO_METRICS:
        out[m] = _mean([r["auto_metrics"].get(m) for r in rows])
    out["cost_usd_total"] = sum(r["judge"].get("judge_cost_usd", 0.0) for r in rows)
    out["count"] = len(rows)
    return out


def worst_per_dim(rows: list[dict], dim: str, n: int = 3) -> list[dict]:
    scored = [(r, r["judge"]["dim_scores"].get(dim)) for r in rows]
    scored = [(r, s) for r, s in scored if s is not None]
    scored.sort(key=lambda x: x[1])
    return [r for r, _ in scored[:n]]


def best_per_dim(rows: list[dict], dim: str, n: int = 3) -> list[dict]:
    scored = [(r, r["judge"]["dim_scores"].get(dim)) for r in rows]
    scored = [(r, s) for r, s in scored if s is not None]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [r for r, _ in scored[:n]]


def fmt(value: float | None, decimals: int = 2) -> str:
    if value is None:
        return "—"
    return f"{value:.{decimals}f}"


def render_summary_table(per_run: dict[str, dict]) -> str:
    runs = list(per_run.keys())
    lines = ["| Metric | " + " | ".join(runs) + (" | Δ |" if len(runs) == 2 else " |")]
    lines.append("|" + "---|" * (len(runs) + (2 if len(runs) == 2 else 1)))
    metrics = JUDGE_DIMS + AUTO_METRICS
    for m in metrics:
        cells = []
        for r in runs:
            cells.append(fmt(per_run[r].get(m)))
        if len(runs) == 2:
            a = per_run[runs[0]].get(m)
            b = per_run[runs[1]].get(m)
            if a is not None and b is not None:
                cells.append(f"{b - a:+.2f}")
            else:
                cells.append("—")
        lines.append(f"| {m} | " + " | ".join(cells) + " |")
    # totals
    cost_cells = [fmt(per_run[r].get("cost_usd_total"), 4) for r in runs]
    if len(runs) == 2:
        cost_cells.append("—")
    lines.append(f"| total cost USD | " + " | ".join(cost_cells) + " |")
    return "\n".join(lines)


def render_examples(rows: list[dict], dim: str, kind: str = "worst", n: int = 3) -> str:
    selected = (worst_per_dim if kind == "worst" else best_per_dim)(rows, dim, n)
    lines = []
    for r in selected:
        score = r["judge"]["dim_scores"].get(dim, "?")
        rationale = r["judge"]["dim_rationales"].get(dim, "?")
        snippet = (r.get("text") or "").replace("\n", " ")[:200]
        lines.append(
            f"- **{r['combo_id']} × {r['prompt_id']}** — score {score}\n"
            f"  > {snippet}\n"
            f"  *Rationale:* {rationale}"
        )
    return "\n".join(lines) if lines else "_(no rows)_"


def render_report(per_run: dict[str, list[dict]]) -> str:
    runs = list(per_run.keys())
    aggregates = {r: aggregate(rows) for r, rows in per_run.items()}

    lines = [
        f"# Eval Comparison: {' vs '.join(runs)}",
        "",
        f"_Generated {datetime.now(timezone.utc).isoformat()}_",
        "",
        "## Summary",
        "",
        render_summary_table(aggregates),
        "",
    ]

    # Per-run examples
    for run_id, rows in per_run.items():
        lines.extend([
            f"## Run: {run_id}",
            "",
            f"Generations: {len(rows)} | total judge cost: ${aggregates[run_id]['cost_usd_total']:.4f}",
            "",
        ])
        for dim in JUDGE_DIMS:
            lines.extend([f"### Worst on `{dim}` (run: {run_id})", ""])
            lines.append(render_examples(rows, dim, kind="worst", n=3))
            lines.append("")

    # Auto-vs-judge sanity check (cheap proxy: real_english_rate vs stays_in_english)
    lines.extend(["## Sanity check: real_english_rate vs stays_in_english", ""])
    for run_id, rows in per_run.items():
        rer_se = [
            (r["auto_metrics"].get("real_english_rate"), r["judge"]["dim_scores"].get("stays_in_english"))
            for r in rows
        ]
        rer_se = [(a, b) for a, b in rer_se if a is not None and b is not None]
        if len(rer_se) >= 2:
            try:
                corr = statistics.correlation([a for a, _ in rer_se], [b for _, b in rer_se])
                lines.append(f"- {run_id}: Pearson r = {corr:+.2f} (n={len(rer_se)})")
            except statistics.StatisticsError:
                lines.append(f"- {run_id}: correlation undefined (constant data)")
        else:
            lines.append(f"- {run_id}: insufficient data")
    lines.append("")
    return "\n".join(lines)


def run(run_ids: list[str]) -> None:
    per_run: dict[str, list[dict]] = {}
    for rid in run_ids:
        name, rows = load_scored(rid)
        per_run[name] = rows

    report = render_report(per_run)

    REPORTS_DIR.mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
    name = "-vs-".join(per_run.keys())[:80]
    report_path = REPORTS_DIR / f"{timestamp}-{name}.md"
    report_path.write_text(report)
    print(report)
    print(f"\nReport saved to {report_path}")


def main():
    parser = argparse.ArgumentParser(description="Stage 3: compare scored runs")
    parser.add_argument("run_ids", nargs="+", help="Run id(s) or path(s) to run dir(s)")
    args = parser.parse_args()
    run(args.run_ids)


if __name__ == "__main__":
    main()

[ ] Step 2: Sanity-check CLI

cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python compare.py --help

Expected: argparse help.

[ ] Step 3: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/compare.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — compare.py (Stage 3 markdown report)

Aggregates one or more scored runs into a markdown report: per-dim summary table with deltas, worst-3-per-dim examples per run, auto-vs-judge sanity correlation.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 12: experiments.jsonl helpers (autoresearch readiness)¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/experiments.py - Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_experiments.py

[ ] Step 1: Write failing test

Create tests/test_experiments.py:

import json
from pathlib import Path

from experiments import append_experiment, ExperimentEntry


def test_append_creates_file(tmp_path):
    log = tmp_path / "experiments.jsonl"
    entry = ExperimentEntry(
        actor="human",
        hypothesis="Lower rep_penalty improves naturalness",
        config_path="configs/lower-rep-penalty.yaml",
        run_id="lower-rep-penalty-20260429T120000Z",
        comparison_against="baseline-v6-20260429T100000Z",
        verdict="rejected",
        verdict_evidence="grammaticality unchanged, real_english_rate dropped 8pp",
    )
    append_experiment(log, entry)

    assert log.exists()
    with open(log) as f:
        lines = f.readlines()
    assert len(lines) == 1
    parsed = json.loads(lines[0])
    assert parsed["hypothesis"] == "Lower rep_penalty improves naturalness"
    assert "ts" in parsed


def test_append_appends_to_existing(tmp_path):
    log = tmp_path / "experiments.jsonl"
    log.write_text('{"existing": true}\n')
    entry = ExperimentEntry(
        actor="agent",
        hypothesis="x",
        config_path="y",
        run_id="z",
    )
    append_experiment(log, entry)

    with open(log) as f:
        lines = f.readlines()
    assert len(lines) == 2

[ ] Step 2: Run test, verify it fails

uv run python -m pytest tests/test_experiments.py -v

Expected: FAIL with import error.

[ ] Step 3: Implement experiments.py

Create experiments.py:

"""experiments.jsonl — append-only log of hypothesis → run → verdict entries.

Both human and agent contribute. Same primitives drive manual and autoresearch loops.
"""

from __future__ import annotations

import json
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Literal


@dataclass
class ExperimentEntry:
    actor: Literal["human", "agent"]
    hypothesis: str
    config_path: str
    run_id: str
    scored_path: str | None = None
    comparison_against: str | None = None
    verdict: str | None = None              # 'accepted', 'rejected', 'mixed', None=pending
    verdict_evidence: str | None = None
    ts: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())


def append_experiment(log_path: Path, entry: ExperimentEntry) -> None:
    log_path.parent.mkdir(parents=True, exist_ok=True)
    with open(log_path, "a") as f:
        f.write(json.dumps(asdict(entry), ensure_ascii=False) + "\n")


def read_experiments(log_path: Path) -> list[dict]:
    if not log_path.exists():
        return []
    out = []
    with open(log_path) as f:
        for line in f:
            line = line.strip()
            if line:
                out.append(json.loads(line))
    return out

[ ] Step 4: Run tests, verify they pass

uv run python -m pytest tests/test_experiments.py -v

Expected: 2 passed.

[ ] Step 5: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/experiments.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_experiments.py
git commit -m "$(cat <<'EOF'
feat(generation/research): PHON-57 — experiments.jsonl helpers

Append-only log of hypothesis → config → run → verdict entries. Same primitives drive manual and autoresearch flows.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 13: E2E integration test¶

Files: - Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/conftest.py - Create: packages/generation/research/2026-04-29-eval-harness-v1/tests/test_e2e.py

[ ] Step 1: Write the e2e test

Create tests/conftest.py:

"""Shared fixtures."""

import sys
from pathlib import Path

# Make the harness modules importable in tests
sys.path.insert(0, str(Path(__file__).parent.parent))

Create tests/test_e2e.py:

"""End-to-end integration test.

One config × one prompt × one constraint, end-to-end with stubbed Claude judge
and stubbed SSE response. Exercises generate.py → score.py → compare.py.
"""

import json
import sys
from pathlib import Path
from unittest.mock import MagicMock, patch

import pytest


HERE = Path(__file__).parent
ROOT = HERE.parent


@pytest.fixture
def fake_sse_response():
    return (
        'data: {"status": "Vocabulary survival: 80%"}\n'
        'data: {"status": "  Draft 1: compliant"}\n'
        'data: {"result": {"text": "The dog ran across the park.", "compliant": true, "violation_count": 0, "violation_words": [], "boost_coverage": [], "warnings": null, "gen_time_ms": 1500}}\n'
    )


def _mock_httpx_stream(body: str):
    mock_resp = MagicMock()
    mock_resp.iter_text.return_value = [body]
    mock_resp.raise_for_status = MagicMock()
    mock_stream = MagicMock()
    mock_stream.__enter__ = MagicMock(return_value=mock_resp)
    mock_stream.__exit__ = MagicMock(return_value=None)
    return mock_stream


def _mock_anthropic_response():
    fake_response = MagicMock()
    fake_response.content = [MagicMock()]
    fake_response.content[0].text = json.dumps({
        "grammaticality": {"score": 5, "rationale": "Good."},
        "coherence": {"score": 4, "rationale": "Clear."},
        "prompt_following": {"score": 5, "rationale": "On topic."},
        "natural_ending": {"score": 4, "rationale": "Reasonable."},
        "stays_in_english": {"score": 5, "rationale": "All English."},
    })
    fake_response.usage.input_tokens = 1000
    fake_response.usage.output_tokens = 100
    return fake_response


def test_e2e_one_cell(tmp_path, fake_sse_response, monkeypatch):
    """generate.py → score.py → compare.py with a single cell."""
    # Set up a temporary harness dir mirror
    harness = tmp_path / "harness"
    harness.mkdir()
    (harness / "configs").mkdir()
    (harness / "tests").mkdir()
    (harness / "runs").mkdir()
    (harness / "reports").mkdir()

    # Copy YAMLs from real harness
    for name in ["rubric.yaml", "prompts.yaml", "constraints.yaml", "vocab.txt"]:
        (harness / name).write_text((ROOT / name).read_text())
    (harness / "configs" / "baseline-v6.yaml").write_text((ROOT / "configs" / "baseline-v6.yaml").read_text())

    # Reduce the prompt + constraint pools to 1 each for fast e2e
    (harness / "prompts.yaml").write_text(
        '- {id: dog_park, text: "Write a short paragraph about a dog playing in the park.", genre: narrative}\n'
    )
    (harness / "constraints.yaml").write_text(
        '- {id: exc_r, type: exclude, phonemes: ["ɹ"]}\n'
    )

    # Make harness importable
    monkeypatch.syspath_prepend(str(harness))
    monkeypatch.chdir(harness)

    # Force-reimport modules from this dir
    for mod in ["generate", "score", "compare", "schemas", "sse_client", "auto_metrics", "judge", "experiments"]:
        if mod in sys.modules:
            del sys.modules[mod]

    # Copy the python modules in too
    for mod in ["schemas.py", "sse_client.py", "auto_metrics.py", "judge.py", "generate.py", "score.py", "compare.py", "experiments.py"]:
        (harness / mod).write_text((ROOT / mod).read_text())

    import generate
    import score
    import compare

    # Patch the SSE call + server-status check
    with patch("sse_client.httpx.Client") as mock_httpx_client_cls, \
         patch("generate.httpx.get") as mock_status:
        # Server status check returns ready
        mock_status.return_value = MagicMock(
            json=lambda: {"status": "ready", "model": "stub"},
            raise_for_status=lambda: None,
        )
        # SSE client returns canned response
        mock_client_instance = MagicMock()
        mock_client_instance.stream = MagicMock(return_value=_mock_httpx_stream(fake_sse_response))
        mock_client_instance.__enter__ = MagicMock(return_value=mock_client_instance)
        mock_client_instance.__exit__ = MagicMock(return_value=None)
        mock_httpx_client_cls.return_value = mock_client_instance

        generate.run("baseline-v6", "http://localhost:8000", resume=False)

    # Verify generations.jsonl exists with one row
    runs = list((harness / "runs").iterdir())
    assert len(runs) == 1
    run_dir = runs[0]
    gens_path = run_dir / "generations.jsonl"
    assert gens_path.exists()
    with open(gens_path) as f:
        rows = [json.loads(line) for line in f if line.strip()]
    assert len(rows) == 1
    assert rows[0]["text"] == "The dog ran across the park."

    # Now stub Claude + run score.py
    with patch("judge.make_client") as mock_make_client:
        fake_client = MagicMock()
        fake_client.messages.create.return_value = _mock_anthropic_response()
        mock_make_client.return_value = fake_client

        score.run(run_dir.name, model_override=None)

    scored_path = run_dir / "scored.jsonl"
    assert scored_path.exists()
    with open(scored_path) as f:
        scored = [json.loads(line) for line in f if line.strip()]
    assert len(scored) == 1
    assert scored[0]["judge"]["dim_scores"]["grammaticality"] == 5
    assert scored[0]["auto_metrics"]["real_english_rate"] == 1.0

    # Run compare.py — single-run report
    compare.run([run_dir.name])
    reports = list((harness / "reports").iterdir())
    assert len(reports) == 1
    report_text = reports[0].read_text()
    assert run_dir.name in report_text
    assert "grammaticality" in report_text

[ ] Step 2: Run e2e test, verify it passes

cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python -m pytest tests/test_e2e.py -v

Expected: 1 passed. (May take 10-30s — first run loads GPT-2.)

[ ] Step 3: Run full test suite as final check

uv run python -m pytest tests/ -v

Expected: ~17 tests pass total across all test files.

[ ] Step 4: Commit

git add packages/generation/research/2026-04-29-eval-harness-v1/tests/conftest.py packages/generation/research/2026-04-29-eval-harness-v1/tests/test_e2e.py
git commit -m "$(cat <<'EOF'
test(generation/research): PHON-57 — e2e integration test

generate → score → compare with stubbed SSE + stubbed Claude. Exercises the full pipeline against a 1×1 cell, verifies generations.jsonl, scored.jsonl, and the markdown report all materialize correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Task 14: First baseline run + notebook entry¶

This task is operational — runs the harness for real against the running generation server, logs the result in the lab notebook. Skip if the server isn't available locally; come back to it after PHON-63 lands.

Files: - Modify: packages/generation/research/2026-04-29-eval-harness-v1/notebook.md

[ ] Step 1: Confirm ANTHROPIC_API_KEY is set

echo "${ANTHROPIC_API_KEY:0:10}..."

If empty, source from ~/Repos/eureka (per project memory) or set with:

export ANTHROPIC_API_KEY=sk-ant-...

[ ] Step 2: Confirm generation server is running

curl http://localhost:8000/api/server/status

Expected: {"status": "ready", ...}. If not running:

cd packages/generation
uv run python scripts/build_lookup.py
uv run uvicorn server.main:app --host 0.0.0.0 --port 8000

(Leave running in another terminal.)

[ ] Step 3: Confirm vocab.txt is present

ls -lh packages/generation/research/2026-04-29-eval-harness-v1/vocab.txt

If missing, run ./refresh_vocab.sh in that directory.

[ ] Step 4: Run the baseline sweep

cd packages/generation/research/2026-04-29-eval-harness-v1
uv run python generate.py baseline-v6

Expected: 40 generations (8 constraints × 5 prompts), takes ~10–20 minutes depending on T5Gemma latency. Output: runs/baseline-v6-<timestamp>/generations.jsonl.

[ ] Step 5: Score the run

uv run python score.py runs/baseline-v6-<timestamp>

Expected: ~$0.12 in Claude API cost, takes ~5–10 minutes.

[ ] Step 6: Generate the report

uv run python compare.py runs/baseline-v6-<timestamp>

Expected: markdown report in reports/<timestamp>-baseline-v6-...md.

[ ] Step 7: Add notebook entry

Append to notebook.md:

## Run 1: baseline-v6 — 2026-04-29 (or actual date)

**Config:** `configs/baseline-v6.yaml`
**Run id:** `baseline-v6-<timestamp>`
**Report:** `reports/<timestamp>-baseline-v6-...md`

**Setup:** local generation server, T5Gemma 9B-2B, vocab.txt = D1 207K snapshot.

**Headline numbers (fill in from report):**
- grammaticality: ?
- coherence: ?
- prompt_following: ?
- natural_ending: ?
- stays_in_english: ?
- real_english_rate: ?
- distinct_3: ?
- ppl_gpt2: ?

**Worst cells (top 1 per failure-mode dim):**
- grammaticality: <combo × prompt> — <snippet>
- stays_in_english: <combo × prompt> — <snippet>
- real_english_rate: <combo × prompt> — <snippet>

**First takeaway:** <one paragraph synthesizing the baseline picture>

**Hypotheses generated (fed into PHON-58 / PHON-59):**
- <hypothesis 1>
- <hypothesis 2>

[ ] Step 8: Append to experiments.jsonl

uv run python -c "
from pathlib import Path
from experiments import append_experiment, ExperimentEntry
append_experiment(
    Path('experiments.jsonl'),
    ExperimentEntry(
        actor='human',
        hypothesis='Establish baseline measurement of v6 governed generation under representative prompt × constraint matrix',
        config_path='configs/baseline-v6.yaml',
        run_id='<actual run_id>',
        scored_path='runs/<run_id>/scored.jsonl',
        verdict='baseline',
        verdict_evidence='See notebook.md Run 1 entry',
    )
)
"

[ ] Step 9: Commit notebook + experiments log

git add packages/generation/research/2026-04-29-eval-harness-v1/notebook.md packages/generation/research/2026-04-29-eval-harness-v1/experiments.jsonl
git commit -m "$(cat <<'EOF'
research(generation): PHON-57 — first baseline-v6 sweep + notebook entry

40 generations × baseline config × baseline prompt+constraint pools, scored end-to-end. Establishes the measurement substrate that PHON-58/59/60/61 will compare future configs against.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"

Done¶

The eval harness is implemented, tested, and has produced its first measurement of the production system. PHON-57 is complete.

Next steps (other tickets, not this plan): - PHON-63 — server-side decoding-param override API. After this lands, the harness can sweep over real config alternatives. - PHON-58 — research spike. Reads the baseline report, surveys SOTA, audits implementation, proposes candidate configs to feed back into the harness. - PHON-59/60/61 — quality / length / compliance work. Each candidate fix becomes a named config, swept, scored, compared against baseline.

Push the branch and open a PR against develop per project convention (feedback_branch_management.md).