Generation Quality Eval Harness — Design¶
Date: 2026-04-29
Status: Spec — pending user review
Branch: TBD — feature branch off develop, created at writing-plans handoff
Tickets: Epic + 7 children, Jira keys reserved at writing-plans handoff (per feedback_verify_jira_state.md — JQL the next free PHON-XX rather than guessing)
Problem¶
Friend testing of the v6 governed-generation system surfaced quality regressions that are real, multi-causal, and currently unmeasurable:
- Heavy phoneme exclusion (e.g., sibilant cluster) collapses output into code-switched degeneration: German, Italian, Latin, Hebrew, Arabic tokens; severe repetition (
mek mek mek,Alen Alen Alen); "compliant" non-English pseudo-words (mek,aan,alen) that the dictionary doesn't recognize. - Even successful runs aren't grammatical: "lived an old little" (no head noun), "On this new acquaintance was a smiling smiley face!" (mangled syntax).
- Output runs to ~
max_new_tokensrather than stopping at a natural conclusion. - Compliance-pass conflates "no banned phonemes detected" with "is a real English word." A token sequence the trie didn't trip can fail dictionary lookup.
The existing constraint_grid_sweep.py captures pipeline metrics (compliance, violations, survival, retries) but no quality dimension at all — no PPL, no judge scoring, no OOV detection, no repetition rate, no code-switching detection. We can't tell whether a candidate fix actually helps.
Goal: build a fresh eval harness that systematically measures generation quality across (config × prompt × constraints), supports A/B comparison between named experiment configs, and is structured so an autoresearch loop can drive it as easily as a human can.
Scope¶
This spec covers child #1 of the Generation Quality Overhaul epic.
Epic decomposition (parent + 7 children): 1. Eval harness — this spec 2. Research spike: landscape (constrained-generation SOTA, eval-methodology SOTA) + implementation audit (are we using our stack correctly?) 3. Quality / grammaticality / coherence under constraint 4. Length / stop control 5. Compliance trustworthiness (OOV gating) 6. UX polish (punch list, no separate brainstorm) 7. Server-side decoding-param override API (small; blocks #1's actual usage but not its design/scaffolding)
Each remaining child gets its own brainstorm session; #1 unlocks measurement for all of them.
Methodology principle¶
Epistemic scavenging. This is a fresh design from first principles, not an extension of constraint_grid_sweep.py or analyze_sweep.py. Those existing scripts stay as historical artifacts — references to consiliate against when looking back across iterations, not foundations to build on. Rediscovering shapes that exist in the prior art is fine and validates the underlying truth; landing somewhere different is also fine and escapes the local minimum.
Future eval iterations will likewise spawn new artifacts under packages/generation/research/<date>-<topic>/, never in-place edits to this one.
This applies to methodology (harness shape, rubric, scoring protocol). Off-the-shelf models (Claude as judge, GPT-2 as PPL scorer) are tools, not methodology.
Architecture¶
Three deterministic CLI stages plus three artifact types (configs, rubric, prompt/constraint pools).
packages/generation/research/2026-04-29-eval-harness-v1/
├── notebook.md # lab notebook: hypotheses, findings, run index
├── rubric.yaml # judge dims + auto metrics + scales (versioned)
├── prompts.yaml # prompt set we sweep against
├── constraints.yaml # constraint pool we sweep over
├── configs/
│ ├── baseline-v6.yaml # production decoding + governor settings, snapshot
│ └── ... # other named presets as we hypothesize
├── generate.py # stage 1: sweep runner
├── score.py # stage 2: scoring runner
├── compare.py # stage 3: comparison/aggregation
├── experiments.jsonl # append-only log: hypothesis → run → verdict
├── runs/
│ └── <run_id>/ # run_id = <config_name>-<timestamp>
│ ├── meta.json # config snapshot, pool hashes, server version
│ ├── generations.jsonl # one row per (combo × prompt) generation
│ └── scored.jsonl # generations.jsonl + auto metrics + judge scores
└── reports/
└── <comparison-name>.md # comparison(s) between runs
Stage 1 — Generate¶
generate.py <config_name>
Iterates (combo × prompt), POSTs each to /api/generate-single with the config's decoding-param overrides, captures text + SSE pipeline metrics → runs/<run_id>/generations.jsonl.
- Resume-safe via row-key checking
- Reads
prompts.yaml,constraints.yaml,configs/<config_name>.yaml - Writes
meta.jsonsnapshot at start - Server must be running locally on
:8000; harness checks/api/server/statusfirst
Stage 2 — Score¶
score.py <run_id>
For each generation row, computes:
- Auto metrics (local Python, no network):
- real_english_rate — fraction of [a-zA-Z]+ tokens present in the D1 words table (207,665 entries, verified 2026-04-29)
- distinct_3, distinct_5 — distinct-n-gram ratios (1.0 = no repeats)
- max_3gram_rep — max repetition rate of any 3-gram in the output
- ppl_t5gemma_2b_cond — conditional perplexity of generated text given prompt, scored by held-out google/t5gemma-2b-2b-prefixlm-it. Same Gemma family as the production 9B-2B generator (matching tokenizer + architecture) so cross-family calibration noise doesn't pollute the signal. The encoder gets the prompt, the decoder scores the generated text — answers "did the held-out LM expect this output given the prompt?" rather than "is this English-shaped?"
- Judge dimensions (Claude API, single call per row, structured JSON output):
- grammaticality, coherence, prompt_following, natural_ending, stays_in_english (1–5 scales)
- System prompt = rubric definitions + scale anchors + few-shot examples → cached via Anthropic prompt caching
- User content = prompt + generated text → cache miss per row (unavoidable)
- Default model: claude-haiku-4-5 (fast/cheap iterations); claude-sonnet-4-6 for definitive runs (config field on rubric)
Output → runs/<run_id>/scored.jsonl. Idempotent: re-runs skip already-scored rows by run_id|combo_id|prompt_id key.
Stage 3 — Compare¶
compare.py <run_id> [<run_id> ...]
Aggregates one or more scored runs. Emits markdown report → reports/<comparison-name>.md.
Report sections:
- Header: configs being compared, run paths, timestamps
- Summary table: per-dimension means + Δ between configs
- Worst-example list per dim (top 3 lowest-scoring cells per dim, per config, with generated text)
- Best-example list (top 3 highest-scoring constrained cells)
- Auto-metric vs judge-dim correlation (sanity check on the judge — if real_english_rate doesn't correlate with stays_in_english, the judge is unreliable)
- Suggested next experiment (manual fill or, in autoresearch mode, agent-generated)
Data contracts¶
rubric.yaml (v1 schema)¶
version: 1
judge:
model: claude-haiku-4-5 # default; sonnet for definitive runs
cache_system_prompt: true
dimensions: # LLM-judge scored, 1-5 scale
- id: grammaticality
description: Is the output well-formed English syntax?
anchors:
1: "Severe grammar errors throughout (broken syntax, missing core words)"
3: "Some grammar issues but generally readable"
5: "Fully grammatical, no syntactic errors"
- id: coherence
description: Does the output hold together as discourse and stay on topic?
anchors:
1: "Incoherent, topic salad, contradictory"
3: "Mostly coherent with some drift or non-sequiturs"
5: "Tight coherence end to end"
- id: prompt_following
description: Does the output address what the prompt asked for?
anchors:
1: "Ignores the prompt entirely"
3: "Partially addresses it"
5: "Fully addresses the prompt"
- id: natural_ending
description: Does it conclude on its own, or pad to budget?
anchors:
1: "Truncated mid-thought or padded with filler"
3: "Acceptable ending but somewhat abrupt or padded"
5: "Concludes naturally"
- id: stays_in_english
description: Does it avoid code-switching into other languages?
anchors:
1: "Heavy code-switching, mostly non-English tokens"
3: "Some non-English tokens"
5: "Fully English"
auto_metrics:
- { id: real_english_rate, source: d1_words }
- { id: distinct_3 }
- { id: distinct_5 }
- { id: max_3gram_rep }
- { id: ppl_t5gemma_2b_cond, model: google/t5gemma-2b-2b-prefixlm-it }
configs/<name>.yaml (one experiment-config preset)¶
name: baseline-v6
description: Production decoding + governor settings, snapshot 2026-04-29
parent_config: null # for diff tracking across iterations
decoding:
temperature: 0.8
top_p: 0.92
top_k: 80
repetition_penalty: 1.2
max_new_tokens: 128
num_drafts: 4
governor:
use_punctuation_boost: true
use_trie_steering: true
use_lookahead: true
The decoding block gets passed through to the server-side override API (child #7). The governor block lists feature toggles that the server may or may not currently expose; until #7 lands, these fields are documentation-only — the server runs with its compiled-in defaults and the harness sweeps only over the (prompt × constraints) axis. After #7, both blocks are honored. Configs written today are forward-compatible with that handoff.
prompts.yaml¶
- id: dog_park
text: "Write a short paragraph about a dog playing in the park."
genre: narrative
- id: asteroid_kids
text: "Tell a kids' story about an asteroid in space."
genre: kids_story
- id: king_edicts
text: "Write a proclamation by a king announcing three royal edicts."
genre: declamation
- id: pb_sandwich
text: "Describe how to make a peanut butter sandwich in three steps."
genre: procedural
- id: tying_shoes
text: "Give a brief instruction for tying shoes."
genre: instructional
Span: narrative, kid-genre, declamatory, procedural, instructional. Different natural-ending signals so the natural_ending dim has signal to detect.
constraints.yaml¶
- id: exc_r
type: exclude
phonemes: ["ɹ", "ɝ", "ɚ"]
- id: exc_szshzh
type: exclude
phonemes: ["s", "z", "ʃ", "ʒ"]
- id: aoa_le5
type: bound
norm: aoa_kuperman
max: 5.0
- id: aoa_le7
type: bound
norm: aoa_kuperman
max: 7.0
- id: mixed_aoa7_excr
combine:
- { type: bound, norm: aoa_kuperman, max: 7.0 }
- { type: exclude, phonemes: ["ɹ", "ɝ", "ɚ"] }
- id: mixed_aoa5_excsz
combine:
- { type: bound, norm: aoa_kuperman, max: 5.0 }
- { type: exclude, phonemes: ["s", "z"] }
- id: inc_k
type: include
phonemes: ["k"]
target_rate: 0.20
- id: con_sz
type: contrastive
pair_type: minpair
phoneme1: s
phoneme2: z
position: any
8 cells × 5 prompts = 40 SSE calls per config (each call produces num_drafts internal drafts, returns the best). Judge cost back-of-envelope: 40 calls × ~$0.003/call at Haiku 4.5 rates ≈ $0.12 per sweep. Affordable to iterate.
generations.jsonl row¶
{
"run_id": "baseline-v6-20260429T120000",
"config_name": "baseline-v6",
"prompt_id": "asteroid_kids",
"prompt": "Tell a kids' story about an asteroid in space.",
"combo_id": "exc_szshzh",
"constraints": [...],
"text": "Once upon a time...",
"drafts": [...],
"alternatives": [...],
"pipeline_metrics": {
"compliant": true,
"violation_count": 0,
"violation_words": [],
"survival_ratio": 0.48,
"retry_count": 0,
"hit_escalation": false,
"gen_time_ms": 8243
},
"ts": "2026-04-29T12:34:56Z"
}
scored.jsonl row¶
Generation row + augmented:
{
...generation row fields...,
"auto_metrics": {
"real_english_rate": 0.78,
"distinct_3": 0.62,
"distinct_5": 0.81,
"max_3gram_rep": 0.04,
"ppl_t5gemma_2b_cond": 142.3
},
"judge": {
"model": "claude-haiku-4-5",
"dim_scores": { "grammaticality": 3, "coherence": 4, "prompt_following": 4, "natural_ending": 2, "stays_in_english": 5 },
"dim_rationales": { "grammaticality": "...", ... },
"judge_ms": 1842,
"judge_cost_usd": 0.0048,
"input_tokens": 1234,
"output_tokens": 187
}
}
experiments.jsonl row (shared log; both human and agent append)¶
{
"ts": "2026-04-29T12:34:56Z",
"actor": "human",
"hypothesis": "Lower repetition_penalty improves naturalness on heavy-constraint runs",
"config_path": "configs/lower-rep-penalty.yaml",
"run_id": "lower-rep-penalty-20260429T123456",
"scored_path": "runs/lower-rep-penalty-20260429T123456/scored.jsonl",
"comparison_against": "baseline-v6-20260429T100000",
"verdict": "rejected",
"verdict_evidence": "grammaticality unchanged, real_english_rate dropped 8pp, ppl +26"
}
Autoresearch readiness¶
The harness is designed so an agent can drive it with the same primitives a human uses:
- Configs are first-class YAML files in configs/ — agent writes, agent runs
- All three stages are deterministic CLI scripts with file IO; no in-memory state to recover
- experiments.jsonl is append-only; both human and agent contribute rows
- notebook.md is a free-form lab journal that both contribute to
- The autoresearch loop itself becomes a thin orchestrator (out of scope for this child ticket; ties into the existing "Adversarial Autoresearch" memory entry as a future child): read latest experiments.jsonl → propose new config → run generate→score→compare → append verdict → repeat
No additional primitives need to exist for autoresearch. v1 here is the substrate.
Operational concerns¶
- Idempotency:
score.pyskips already-scored rows by(run_id, combo_id, prompt_id).compare.pyis read-only.generate.pyresumes from existinggenerations.jsonlif--resume. - Judge failure: exponential backoff retry (3x); on final fail, dim values =
null, error logged to row. Comparison aggregation tolerates nulls. - PPL scorer failure: mark
null, log; don't crash the run. - Cost tracking:
judge_cost_usdper row;compare.pyreports total per run. - Reproducibility:
meta.jsonper run captures: config snapshot, prompt-set hash, constraint-set hash, server build (read from/api/server/status), git SHA, timestamps. - API key:
ANTHROPIC_API_KEYfrom environment. Source fromRepos/eurekaif not already in user's local env (per user note 2026-04-29). - Server dependency: harness requires the local generation server running on
:8000. Decoding-param overrides need child #7 (server-side API) for full sweep; until then, baseline-config-only runs are possible.
Testing¶
- Unit tests (
tests/): test_n_gram_metrics.py— distinct-n, max-3gram-rep over fixturestest_oov_gate.py— real-English-rate against fixture vocabularytest_rubric_validation.py— well-formed/malformedrubric.yamltest_config_validation.py— well-formed/malformedconfigs/*.yamltest_aggregation.py— mean / Δ / null-tolerance over fixture scored rows- Integration test (
tests/test_e2e.py): 1 config × 1 prompt × 1 constraint, end-to-end with stubbed Claude judge, real (small) GPT-2 PPL, real local OOV gate, real local server.
Run with: cd packages/generation && uv run python -m pytest research/2026-04-29-eval-harness-v1/tests/ -v
Out of scope (deferred)¶
- Full autoresearch agent loop (own child ticket later, builds on this substrate)
- Multi-judge ensembles
- Rubric drift / inter-annotator agreement studies
- Web UI for browsing runs (markdown reports + raw JSONL is the v1 surface)
- CI integration (too expensive to run per-PR; manual-run for now)
- Human comparison gold standard (worth doing later in the spike)
Build sequence (informs writing-plans)¶
- Scaffold
packages/generation/research/2026-04-29-eval-harness-v1/directory + initialnotebook.md+prompts.yaml+constraints.yaml+rubric.yaml+configs/baseline-v6.yaml - Implement
generate.pyagainst current server (no decoding-param override yet — uses defaults) - Implement
score.py: auto metrics first (local, fast feedback), then judge call with prompt caching - Implement
compare.py(single-run report first, then 2+ run comparison with deltas) - Tests (unit + e2e with stubbed judge)
- First baseline run on the current production system → first comparison report → first
experiments.jsonlentry - (After child #7 lands) re-run with decoding-param overrides; design first non-baseline config; first real A/B comparison
Open questions for spike (#2) to address¶
- Which auto-metrics best correlate with the judge dims? (informs whether we can skip the judge in some cases for cheaper iteration)
- Is the rubric well-calibrated against human judgment? (small human comparison study)
- Are there constrained-generation eval methodologies we should adopt (G-Eval, Prometheus, MT-Bench-style)?
- Are there better PPL scorers for short outputs than GPT-2 base?
- Are we using T5Gemma + LogitsProcessorList correctly? (impl audit)