C1 Track Evaluation + Integration — Handoff Spec (PHON-102)¶

Date: 2026-05-07 Status: Spec — handoff for next session Branch: off release/v5.2.0 (working branch named at writing-plans handoff) Ticket: PHON-102 (to be filed; next free key) Predecessors: - PHON-66 (governed-generation rethink) — chose C1 (combinatorial) as the path - PHON-95 (MLM iterative editor + argstruc CFG enumerator, merged PR #88) — the C1 v1 substrate - PHON-96 / OQ1 (continuous PMI bias, merged PR #89) — opt-in α·ppmi logit shift - PHON-97 / OQ2 (adverbial CFG expansion, merged PR #93) — 6-token NP V NP AdvP - PHON-98 / OQ3 (subject-verb agreement, merged PR #90) — agreement_correct post-processor - PHON-99 / OQ4 (diversity metrics, merged PR #91) — measurement infrastructure - PHON-100 / OQ5 (PLL robustness probe, merged PR #92) — N=10 acceptance gate - PHON-101 / OQ6 (CDS fine-tune) — not started; heaviest, separate ticket

§0 — Frame¶

phonolex_generators is on release/v5.2.0. Five OQ follow-ups landed alongside the PHON-95 base; all five preserve PHON-95 default behavior (each is opt-in via a separate function or non-default kwarg). 49 default-suite tests pass; 18 slow-marked tests gate model-loading work behind explicit -m slow.

What we don't know:

Compositional behavior. Each OQ was validated in isolation. We have not measured what enumerate_seeds_with_adverb + pmi_bias_fn(α=0.5) + agreement_correct + diversity_score produces compared to enumerate_seeds + edit alone. We don't know if the ensemble is strictly better, strictly worse, or behavior-mixed across verbs/specs.
v6 vs C1 head-to-head. The whole point of PHON-66's C1 selection was to replace v6's chatbot-shaped logit steering. We have NOT compared C1 outputs against v6 outputs on overlapping spec types.
Integration story. None of the merged PRs touched packages/generation/server/ or packages/web/. The production /api/generate-single endpoint still serves v6's T5Gemma stack. There is no path-to-production for the C1 track yet.
Long-tail verb coverage. Spot checks ran on cut, melt, chase, ate, chased. PHON-94's selectional table has thousands of verbs with count_v_r_* ≥ 50. We don't know how the editor performs across a representative sample.
PMI calibration. PHON-96 chose α=0.5 conservatively. The 5-seed acceptance run shows it shifts outputs on melt but not cut. We have no calibration sweep.

This ticket is the empirical investigation that makes the next architectural decision concrete.

§1 — Three questions to answer¶

Q1 — Compositional gain¶

Across a representative seed matrix, does composing the OQs (bias + morph + adverbial + diversity-aware sampling) produce measurably better output than the PHON-95 baseline? What does "better" mean here, and which configurations dominate?

Q2 — v6 head-to-head¶

On a curated set of v6 spec types (e.g., exclude /ɹ/, bound aoa < 5, bound_boost concrete > 4.5), does the C1 stack produce outputs that a clinician-grade rater (or a frontier-LLM-judge proxy) prefers?

Q3 — Integration shape¶

What's the minimum viable wiring of phonolex_generators into the production endpoint? Three options to investigate: - Replace v6 entirely at /api/generate-single. Big bang. - Run alongside under a feature flag / new endpoint (/api/generate-c1). A/B testing path. - Different entry points — v6 stays for free-form chat, C1 owns batch / catalog content (PHON-37).

§2 — Investigation matrix¶

Concrete experimental design. Sized for one productive session of work, not a full week.

§2.1 — Seed matrix¶

10 verbs × 2 specs × 1 band = 20 (verb, spec, band) cells. Per cell, enumerate up to 8 seeds via enumerate_seeds. Total ≤ 160 seeds.

Verb selection (10 verbs spanning transitivity, frequency, semantic class): - High-frequency transitive: cut, chase, eat, see, make - Mid-frequency transitive: melt, fill, kick, paint, wash

(Pick from PHON-94 selectional table where count_v_r_* ≥ 1000 for both nsubj and dobj. The full pool is much larger; v2 expansion is OQ work.)

Spec selection: - spec1 (/k/-initial NOUN/VERB ≤ 2 syll, 1798 words) — the canonical narrow spec - spec6 (NOUN/VERB/ADJ ≤ 2 syll, iconicity ≥ 1.8, imageability ≥ 4.5, 649 words) — a "psycholinguistic" spec

Band: fineweb_adult (no register variation in this matrix; OQ6 / PHON-101 territory).

§2.2 — Configuration matrix¶

5 configurations per seed:

ID	Configuration	Description
C0	baseline	`edit(seed)` with PHON-95 defaults
C1	+ bias	`pmi_alpha=0.5` with `make_pmi_bias_fn(...)`
C2	+ morph	`agreement_correct(result.best, ...)`
C3	+ bias + morph	C1 then C2
C4	adverbial + bias + morph	`enumerate_seeds_with_adverb` then C3

5 configs × 160 seeds = 800 edit calls. Wall-clock estimate: 800 × ~0.4s/seed warm = ~5–6 minutes per full sweep.

§2.3 — Per-output metrics (cheap, automatic)¶

For every output: capture - coherence_seed and coherence_best - unique_outputs count - pairwise_normalized_edit_distance(result.unique_outputs) - content_word_ttr(result.unique_outputs, content_indices) - Spec compliance per content slot (boolean) - Verb lock preserved (boolean) - Wall-clock per edit() call - For C2/C3: did agreement_correct change anything? (boolean)

§2.4 — LLM-as-judge eval (per OQ5 v2 candidate)¶

For a stratified 50-output sample across configurations, send to a frontier LLM (Claude Sonnet or GPT-4) with the prompt:

"Rate this generated sentence on three axes from 1–5: (a) grammaticality, (b) semantic coherence, (c) age-appropriateness for SLP use. The sentence was produced by a constrained generation system; the verb is fixed. Return JSON only."

Aggregate ratings per configuration. Use as the human-proxy quality signal.

This is the only piece that requires an external API key. If unavailable in next session, fall back to coherence + diversity metrics only and note the gap.

§3 — v6 vs C1 head-to-head (Q2)¶

Smaller eval. Pick 3 v6 spec types from the live web app: - exclude /ɹ/ (phonological ban) - bound aoa < 5 (norm range) - bound_boost concrete > 4.5 (norm range with boost)

For each: identify the equivalent C1 spec definition. Some translate cleanly (exclude /ɹ/ ↔ phoneme filter on phonemes_str); some don't (bound_boost is a soft constraint with no clean C1 equivalent — note this as an integration design constraint).

Run both stacks on 5 prompts per spec type. Send outputs to the same LLM-judge.

Honest framing: v6 is a 7-step generation pipeline (T5Gemma + reranker + GUARD + best-of-3 + ...). C1 is a 3-step pipeline (CFG + editor + scorer). They're not comparable at the architecture level — they make different assumptions about what "generation" means. The eval is about user-facing output quality, not architectural symmetry.

§4 — Integration design (Q3)¶

This is architectural design work, not implementation. Output is a written recommendation, not code. Compare three options against three criteria:

Option	Wall-clock per request	Surface area change	Risk to v6 users
Replace v6 entirely	~10s (RoBERTa load + edit batch)	Single endpoint, breaking change	High — chat-shaped requests break
Parallel endpoint (`/api/generate-c1`)	Same	Two endpoints	None — v6 unchanged
Use-case split (chat=v6, batch=C1)	Same	Two endpoints + routing	Low — v6 keeps its niche

The recommendation memo should also address: - Where does C1 live operationally? RunPod serverless (like v6) or a different deploy shape (CPU-OK if MPS, or maybe local-only)? - Does the FastAPI server need a new route handler, or does C1 just expose a Python callable for batch use? - How does selectional.parquet get into the deploy environment? (It's already LFS-tracked; the v6 server doesn't load it. C1 needs it.)

§5 — Methodology decisions¶

§5.1 — Where does the lab notebook live?¶

packages/generation/research/2026-05-08-phon-102-c1-eval/ (date is next session's day; rename if needed). Mirror the shape of 2026-05-07-phon-95-editor/ and 2026-04-29-eval-harness-v1/: - run_matrix.py — single command that produces the full results JSONL - analyze.py — produces summary tables + plots (matplotlib, simple) - findings.md — written analysis + decision recommendation - outputs/results-{date}.jsonl — raw per-output records - outputs/llm-judge-{date}.jsonl — frontier-LLM ratings (gitignore the API key, commit the ratings)

§5.2 — Reproducibility¶

Pin RNG seeds. Pin model versions (roberta-large is not version-stable across HuggingFace updates, but pin the date of the snapshot used). Capture device (mps vs cpu vs cuda).

§5.3 — Decision-recommendation framing¶

Per feedback_no_substitute_in_frame: rank assumptions, point forward to candidate paths. Don't pre-commit to a winning option. Engineering names options; the user picks.

§6 — Acceptance criteria¶

Next session is "done" when:

run_matrix.py runs end-to-end and produces results JSONL covering all 5 configurations × 20 cells × 8 seeds ≤ 160 cells.
analyze.py produces a summary table showing per-configuration mean coherence, unique-output count, edit-distance, content-TTR, and spec compliance rate.
LLM-judge ratings collected on a 50-output stratified sample (or documented absence with a stand-in metric).
v6-vs-C1 head-to-head ran on 3 spec types × 5 prompts each.
findings.md exists with: results summary, observed compositional gain (or lack thereof), v6 vs C1 quality delta, recommended integration option (one of the three from §4), and open questions surfaced from the data.

Out of scope for next session: - Implementing the integration. The output is a memo; implementation is a follow-up ticket. - Calibration sweep across α values for PMI bias (separate empirical study). - Long-tail verb expansion (200+ verbs). - CDS register evaluation (PHON-101 territory).

§7 — Open questions¶

Coherence ceiling. PHON-100 showed joint-mask PLL has known artifacts (repeat-content, function-word degenerate). Does the LLM-judge agree with PLL ranking on the full matrix, or does it diverge? The answer determines whether OQ5 v2 (LLM-judge backup) is critical-path or nice-to-have.
Diversity-at-N=8 vs N=16. Should run_matrix.py also sweep n_trajectories or hold it fixed at 8? Holding it fixed keeps the matrix small; sweeping adds another dimension. Recommendation: hold fixed at 8 for v1 of this investigation; add n_trajectories sweep as a follow-up if findings warrant.
What "compositional gain" looks like quantitatively. Suggested gate: ≥ 10% improvement in mean LLM-judge rating from C0 → C4 across the matrix would justify shipping all OQs as the production path. Smaller gains suggest individual OQs ship as opt-in.
Per-slot constraints (PHON-97 OQ2 v2). Do the findings show the editor's slot-blindness is a real problem? If C4 (adverbial) outputs are notably better than C0–C3, the locked-adverb workaround is sufficient; if C4 is mixed, per-slot trie views become higher priority.
What does "replace v6" actually cost? Even if C1 outputs are better, v6 has a year of integration work behind it (frontend tools, GUARD, telemetry). The integration recommendation needs to weigh that.

§8 — Plan handoff¶

Successor plan should cover:

Bootstrap research dir — packages/generation/research/2026-05-08-phon-102-c1-eval/ with run_matrix.py, analyze.py, findings.md skeleton.
Implement run_matrix.py — drives the 5-config × 20-cell × 8-seed matrix, captures all metrics from §2.3.
Wire LLM-judge — score_with_llm.py that reads results JSONL, samples N=50 stratified, calls Claude/GPT-4, writes ratings JSONL.
Run v6-vs-C1 head-to-head — pick 3 v6 specs, replicate in C1, run both, capture for LLM-judge.
Implement analyze.py — produces summary tables (markdown + CSV) for the findings memo.
Write findings.md — results + decision recommendation per §4.
File follow-up tickets for the work surfaced (integration implementation, calibration sweeps, etc.).

Wall-clock budget: one full session (4–6 hours). The matrix run itself is ~10 minutes; the LLM-judge ratings + findings analysis is the bulk.

Notes on what we're NOT doing in this spike¶

Not training anything. No CDS fine-tune (that's PHON-101). No new MLM. RoBERTa-large stays.
Not changing the editor's algorithm. No per-slot trie views, no diverse sampling, no LLM-judge integration into the editor's loop. All such changes are tracked under existing OQ tickets and stay deferred.
Not shipping integration code. The output of this work is a decision-recommendation memo, not a wired-up endpoint.

The goal is information, not features.