Generation Quality Eval Harness — Design¶

Date: 2026-04-29 Status: Spec — pending user review Branch: TBD — feature branch off develop, created at writing-plans handoff Tickets: Epic + 7 children, Jira keys reserved at writing-plans handoff (per feedback_verify_jira_state.md — JQL the next free PHON-XX rather than guessing)

Problem¶

Friend testing of the v6 governed-generation system surfaced quality regressions that are real, multi-causal, and currently unmeasurable:

Heavy phoneme exclusion (e.g., sibilant cluster) collapses output into code-switched degeneration: German, Italian, Latin, Hebrew, Arabic tokens; severe repetition (mek mek mek, Alen Alen Alen); "compliant" non-English pseudo-words (mek, aan, alen) that the dictionary doesn't recognize.
Even successful runs aren't grammatical: "lived an old little" (no head noun), "On this new acquaintance was a smiling smiley face!" (mangled syntax).
Output runs to ~max_new_tokens rather than stopping at a natural conclusion.
Compliance-pass conflates "no banned phonemes detected" with "is a real English word." A token sequence the trie didn't trip can fail dictionary lookup.

The existing constraint_grid_sweep.py captures pipeline metrics (compliance, violations, survival, retries) but no quality dimension at all — no PPL, no judge scoring, no OOV detection, no repetition rate, no code-switching detection. We can't tell whether a candidate fix actually helps.

Goal: build a fresh eval harness that systematically measures generation quality across (config × prompt × constraints), supports A/B comparison between named experiment configs, and is structured so an autoresearch loop can drive it as easily as a human can.

Scope¶

This spec covers child #1 of the Generation Quality Overhaul epic.

Epic decomposition (parent + 7 children): 1. Eval harness — this spec 2. Research spike: landscape (constrained-generation SOTA, eval-methodology SOTA) + implementation audit (are we using our stack correctly?) 3. Quality / grammaticality / coherence under constraint 4. Length / stop control 5. Compliance trustworthiness (OOV gating) 6. UX polish (punch list, no separate brainstorm) 7. Server-side decoding-param override API (small; blocks #1's actual usage but not its design/scaffolding)

Each remaining child gets its own brainstorm session; #1 unlocks measurement for all of them.

Methodology principle¶

Epistemic scavenging. This is a fresh design from first principles, not an extension of constraint_grid_sweep.py or analyze_sweep.py. Those existing scripts stay as historical artifacts — references to consiliate against when looking back across iterations, not foundations to build on. Rediscovering shapes that exist in the prior art is fine and validates the underlying truth; landing somewhere different is also fine and escapes the local minimum.

Future eval iterations will likewise spawn new artifacts under packages/generation/research/<date>-<topic>/, never in-place edits to this one.

This applies to methodology (harness shape, rubric, scoring protocol). Off-the-shelf models (Claude as judge, GPT-2 as PPL scorer) are tools, not methodology.

Architecture¶

Three deterministic CLI stages plus three artifact types (configs, rubric, prompt/constraint pools).

packages/generation/research/2026-04-29-eval-harness-v1/
├── notebook.md                    # lab notebook: hypotheses, findings, run index
├── rubric.yaml                    # judge dims + auto metrics + scales (versioned)
├── prompts.yaml                   # prompt set we sweep against
├── constraints.yaml               # constraint pool we sweep over
├── configs/
│   ├── baseline-v6.yaml           # production decoding + governor settings, snapshot
│   └── ...                        # other named presets as we hypothesize
├── generate.py                    # stage 1: sweep runner
├── score.py                       # stage 2: scoring runner
├── compare.py                     # stage 3: comparison/aggregation
├── experiments.jsonl              # append-only log: hypothesis → run → verdict
├── runs/
│   └── <run_id>/                  # run_id = <config_name>-<timestamp>
│       ├── meta.json              # config snapshot, pool hashes, server version
│       ├── generations.jsonl      # one row per (combo × prompt) generation
│       └── scored.jsonl           # generations.jsonl + auto metrics + judge scores
└── reports/
    └── <comparison-name>.md       # comparison(s) between runs

Stage 1 — Generate¶

generate.py <config_name>

Iterates (combo × prompt), POSTs each to /api/generate-single with the config's decoding-param overrides, captures text + SSE pipeline metrics → runs/<run_id>/generations.jsonl.

Resume-safe via row-key checking
Reads prompts.yaml, constraints.yaml, configs/<config_name>.yaml
Writes meta.json snapshot at start
Server must be running locally on :8000; harness checks /api/server/status first

Stage 2 — Score¶

score.py <run_id>

For each generation row, computes: - Auto metrics (local Python, no network): - real_english_rate — fraction of [a-zA-Z]+ tokens present in the D1 words table (207,665 entries, verified 2026-04-29) - distinct_3, distinct_5 — distinct-n-gram ratios (1.0 = no repeats) - max_3gram_rep — max repetition rate of any 3-gram in the output - ppl_t5gemma_2b_cond — conditional perplexity of generated text given prompt, scored by held-out google/t5gemma-2b-2b-prefixlm-it. Same Gemma family as the production 9B-2B generator (matching tokenizer + architecture) so cross-family calibration noise doesn't pollute the signal. The encoder gets the prompt, the decoder scores the generated text — answers "did the held-out LM expect this output given the prompt?" rather than "is this English-shaped?" - Judge dimensions (Claude API, single call per row, structured JSON output): - grammaticality, coherence, prompt_following, natural_ending, stays_in_english (1–5 scales) - System prompt = rubric definitions + scale anchors + few-shot examples → cached via Anthropic prompt caching - User content = prompt + generated text → cache miss per row (unavoidable) - Default model: claude-haiku-4-5 (fast/cheap iterations); claude-sonnet-4-6 for definitive runs (config field on rubric)

Output → runs/<run_id>/scored.jsonl. Idempotent: re-runs skip already-scored rows by run_id|combo_id|prompt_id key.

Stage 3 — Compare¶

compare.py <run_id> [<run_id> ...]

Aggregates one or more scored runs. Emits markdown report → reports/<comparison-name>.md.

Report sections: - Header: configs being compared, run paths, timestamps - Summary table: per-dimension means + Δ between configs - Worst-example list per dim (top 3 lowest-scoring cells per dim, per config, with generated text) - Best-example list (top 3 highest-scoring constrained cells) - Auto-metric vs judge-dim correlation (sanity check on the judge — if real_english_rate doesn't correlate with stays_in_english, the judge is unreliable) - Suggested next experiment (manual fill or, in autoresearch mode, agent-generated)

Data contracts¶

`rubric.yaml` (v1 schema)¶

version: 1
judge:
  model: claude-haiku-4-5      # default; sonnet for definitive runs
  cache_system_prompt: true
dimensions:                     # LLM-judge scored, 1-5 scale
  - id: grammaticality
    description: Is the output well-formed English syntax?
    anchors:
      1: "Severe grammar errors throughout (broken syntax, missing core words)"
      3: "Some grammar issues but generally readable"
      5: "Fully grammatical, no syntactic errors"
  - id: coherence
    description: Does the output hold together as discourse and stay on topic?
    anchors:
      1: "Incoherent, topic salad, contradictory"
      3: "Mostly coherent with some drift or non-sequiturs"
      5: "Tight coherence end to end"
  - id: prompt_following
    description: Does the output address what the prompt asked for?
    anchors:
      1: "Ignores the prompt entirely"
      3: "Partially addresses it"
      5: "Fully addresses the prompt"
  - id: natural_ending
    description: Does it conclude on its own, or pad to budget?
    anchors:
      1: "Truncated mid-thought or padded with filler"
      3: "Acceptable ending but somewhat abrupt or padded"
      5: "Concludes naturally"
  - id: stays_in_english
    description: Does it avoid code-switching into other languages?
    anchors:
      1: "Heavy code-switching, mostly non-English tokens"
      3: "Some non-English tokens"
      5: "Fully English"
auto_metrics:
  - { id: real_english_rate, source: d1_words }
  - { id: distinct_3 }
  - { id: distinct_5 }
  - { id: max_3gram_rep }
  - { id: ppl_t5gemma_2b_cond, model: google/t5gemma-2b-2b-prefixlm-it }

`configs/<name>.yaml` (one experiment-config preset)¶

name: baseline-v6
description: Production decoding + governor settings, snapshot 2026-04-29
parent_config: null              # for diff tracking across iterations
decoding:
  temperature: 0.8
  top_p: 0.92
  top_k: 80
  repetition_penalty: 1.2
  max_new_tokens: 128
  num_drafts: 4
governor:
  use_punctuation_boost: true
  use_trie_steering: true
  use_lookahead: true

The decoding block gets passed through to the server-side override API (child #7). The governor block lists feature toggles that the server may or may not currently expose; until #7 lands, these fields are documentation-only — the server runs with its compiled-in defaults and the harness sweeps only over the (prompt × constraints) axis. After #7, both blocks are honored. Configs written today are forward-compatible with that handoff.

`prompts.yaml`¶

- id: dog_park
  text: "Write a short paragraph about a dog playing in the park."
  genre: narrative
- id: asteroid_kids
  text: "Tell a kids' story about an asteroid in space."
  genre: kids_story
- id: king_edicts
  text: "Write a proclamation by a king announcing three royal edicts."
  genre: declamation
- id: pb_sandwich
  text: "Describe how to make a peanut butter sandwich in three steps."
  genre: procedural
- id: tying_shoes
  text: "Give a brief instruction for tying shoes."
  genre: instructional

Span: narrative, kid-genre, declamatory, procedural, instructional. Different natural-ending signals so the natural_ending dim has signal to detect.

`constraints.yaml`¶

- id: exc_r
  type: exclude
  phonemes: ["ɹ", "ɝ", "ɚ"]
- id: exc_szshzh
  type: exclude
  phonemes: ["s", "z", "ʃ", "ʒ"]
- id: aoa_le5
  type: bound
  norm: aoa_kuperman
  max: 5.0
- id: aoa_le7
  type: bound
  norm: aoa_kuperman
  max: 7.0
- id: mixed_aoa7_excr
  combine:
    - { type: bound, norm: aoa_kuperman, max: 7.0 }
    - { type: exclude, phonemes: ["ɹ", "ɝ", "ɚ"] }
- id: mixed_aoa5_excsz
  combine:
    - { type: bound, norm: aoa_kuperman, max: 5.0 }
    - { type: exclude, phonemes: ["s", "z"] }
- id: inc_k
  type: include
  phonemes: ["k"]
  target_rate: 0.20
- id: con_sz
  type: contrastive
  pair_type: minpair
  phoneme1: s
  phoneme2: z
  position: any

8 cells × 5 prompts = 40 SSE calls per config (each call produces num_drafts internal drafts, returns the best). Judge cost back-of-envelope: 40 calls × ~$0.003/call at Haiku 4.5 rates ≈ $0.12 per sweep. Affordable to iterate.

`generations.jsonl` row¶

{
  "run_id": "baseline-v6-20260429T120000",
  "config_name": "baseline-v6",
  "prompt_id": "asteroid_kids",
  "prompt": "Tell a kids' story about an asteroid in space.",
  "combo_id": "exc_szshzh",
  "constraints": [...],
  "text": "Once upon a time...",
  "drafts": [...],
  "alternatives": [...],
  "pipeline_metrics": {
    "compliant": true,
    "violation_count": 0,
    "violation_words": [],
    "survival_ratio": 0.48,
    "retry_count": 0,
    "hit_escalation": false,
    "gen_time_ms": 8243
  },
  "ts": "2026-04-29T12:34:56Z"
}

`scored.jsonl` row¶

Generation row + augmented:

{
  ...generation row fields...,
  "auto_metrics": {
    "real_english_rate": 0.78,
    "distinct_3": 0.62,
    "distinct_5": 0.81,
    "max_3gram_rep": 0.04,
    "ppl_t5gemma_2b_cond": 142.3
  },
  "judge": {
    "model": "claude-haiku-4-5",
    "dim_scores": { "grammaticality": 3, "coherence": 4, "prompt_following": 4, "natural_ending": 2, "stays_in_english": 5 },
    "dim_rationales": { "grammaticality": "...", ... },
    "judge_ms": 1842,
    "judge_cost_usd": 0.0048,
    "input_tokens": 1234,
    "output_tokens": 187
  }
}

`experiments.jsonl` row (shared log; both human and agent append)¶

{
  "ts": "2026-04-29T12:34:56Z",
  "actor": "human",
  "hypothesis": "Lower repetition_penalty improves naturalness on heavy-constraint runs",
  "config_path": "configs/lower-rep-penalty.yaml",
  "run_id": "lower-rep-penalty-20260429T123456",
  "scored_path": "runs/lower-rep-penalty-20260429T123456/scored.jsonl",
  "comparison_against": "baseline-v6-20260429T100000",
  "verdict": "rejected",
  "verdict_evidence": "grammaticality unchanged, real_english_rate dropped 8pp, ppl +26"
}

Autoresearch readiness¶

The harness is designed so an agent can drive it with the same primitives a human uses: - Configs are first-class YAML files in configs/ — agent writes, agent runs - All three stages are deterministic CLI scripts with file IO; no in-memory state to recover - experiments.jsonl is append-only; both human and agent contribute rows - notebook.md is a free-form lab journal that both contribute to - The autoresearch loop itself becomes a thin orchestrator (out of scope for this child ticket; ties into the existing "Adversarial Autoresearch" memory entry as a future child): read latest experiments.jsonl → propose new config → run generate→score→compare → append verdict → repeat

No additional primitives need to exist for autoresearch. v1 here is the substrate.

Operational concerns¶

Idempotency: score.py skips already-scored rows by (run_id, combo_id, prompt_id). compare.py is read-only. generate.py resumes from existing generations.jsonl if --resume.
Judge failure: exponential backoff retry (3x); on final fail, dim values = null, error logged to row. Comparison aggregation tolerates nulls.
PPL scorer failure: mark null, log; don't crash the run.
Cost tracking: judge_cost_usd per row; compare.py reports total per run.
Reproducibility: meta.json per run captures: config snapshot, prompt-set hash, constraint-set hash, server build (read from /api/server/status), git SHA, timestamps.
API key: ANTHROPIC_API_KEY from environment. Source from Repos/eureka if not already in user's local env (per user note 2026-04-29).
Server dependency: harness requires the local generation server running on :8000. Decoding-param overrides need child #7 (server-side API) for full sweep; until then, baseline-config-only runs are possible.

Testing¶

Unit tests (tests/):
test_n_gram_metrics.py — distinct-n, max-3gram-rep over fixtures
test_oov_gate.py — real-English-rate against fixture vocabulary
test_rubric_validation.py — well-formed/malformed rubric.yaml
test_config_validation.py — well-formed/malformed configs/*.yaml
test_aggregation.py — mean / Δ / null-tolerance over fixture scored rows
Integration test (tests/test_e2e.py): 1 config × 1 prompt × 1 constraint, end-to-end with stubbed Claude judge, real (small) GPT-2 PPL, real local OOV gate, real local server.

Run with: cd packages/generation && uv run python -m pytest research/2026-04-29-eval-harness-v1/tests/ -v

Out of scope (deferred)¶

Full autoresearch agent loop (own child ticket later, builds on this substrate)
Multi-judge ensembles
Rubric drift / inter-annotator agreement studies
Web UI for browsing runs (markdown reports + raw JSONL is the v1 surface)
CI integration (too expensive to run per-PR; manual-run for now)
Human comparison gold standard (worth doing later in the spike)

Build sequence (informs writing-plans)¶

Scaffold packages/generation/research/2026-04-29-eval-harness-v1/ directory + initial notebook.md + prompts.yaml + constraints.yaml + rubric.yaml + configs/baseline-v6.yaml
Implement generate.py against current server (no decoding-param override yet — uses defaults)
Implement score.py: auto metrics first (local, fast feedback), then judge call with prompt caching
Implement compare.py (single-run report first, then 2+ run comparison with deltas)
Tests (unit + e2e with stubbed judge)
First baseline run on the current production system → first comparison report → first experiments.jsonl entry
(After child #7 lands) re-run with decoding-param overrides; design first non-baseline config; first real A/B comparison

Open questions for spike (#2) to address¶

Which auto-metrics best correlate with the judge dims? (informs whether we can skip the judge in some cases for cheaper iteration)
Is the rubric well-calibrated against human judgment? (small human comparison study)
Are there constrained-generation eval methodologies we should adopt (G-Eval, Prometheus, MT-Bench-style)?
Are there better PPL scorers for short outputs than GPT-2 base?
Are we using T5Gemma + LogitsProcessorList correctly? (impl audit)