Sentence Generation Paradigms Spike — Implementation Plan¶

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Implement and run a 3-paradigm research spike comparing data-only sentence generators (CSP, dep-tree templating, lexical graph walks) against PHON-102's C1+MLM baseline, on the 5 PHON-95 acceptance probes, and produce a side-by-side decision memo.

Architecture: Single research dir packages/generation/research/2026-05-07-sentence-generation-paradigms/ with three independent paradigm scripts that all share one surface realizer (shared.py). Each paradigm reads data/runtime/{words,selectional}.parquet directly, produces a JSON output file, and a compare.py script assembles the cross-paradigm + PHON-102-baseline markdown table at the end. The notebook.md is the human-readable lab notebook with the recommendation memo.

Tech Stack: Python 3.10, polars (parquet I/O), spaCy en_core_web_sm (paradigm 1 parsing), lemminflect (verb agreement, surface realizer), phonolex_data.runtime.WordStore (lexicon access), no PyTorch / no LM. CPU-only.

Spec: docs/superpowers/specs/2026-05-07-sentence-generation-paradigms-spike.md

File structure¶

All paths relative to repo root /Users/jneumann/Repos/PhonoLex/.

File	Responsibility
`packages/generation/research/2026-05-07-sentence-generation-paradigms/README.md`	Brief usage + file index
`packages/generation/research/2026-05-07-sentence-generation-paradigms/shared.py`	`PROBES` constant + `surface_realize(nsubj, verb_lemma, dobj, manner_adv=None)` utility + `lemma_compliance(seed_verb, output_verb)` helper
`packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_3_csp.py`	CSP solver — brute-force enumerates `(nsubj, dobj, manner_adv)` triples, scores by sum-PPMI, returns top-K
`packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_1_dep_trees.py`	Parse `source_corpus.txt` with spaCy, extract POS+DEP skeletons, slot-fill content nodes for each probe
`packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_2_graph_walk.py`	Build verb→filler PMI lookup + word→word Qwensim lookup, walk from verb to (nsubj, dobj) with cross-slot Qwensim conditioning
`packages/generation/research/2026-05-07-sentence-generation-paradigms/compare.py`	Assemble side-by-side markdown table from `outputs/{p1,p2,p3}.json` + PHON-102 results JSONL
`packages/generation/research/2026-05-07-sentence-generation-paradigms/notebook.md`	Lab notebook: framing, per-paradigm observations, side-by-side excerpt, recommendation memo
`packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/source_corpus.txt`	60-100 public-domain English sentences (Aesop's fables + hand-typed SLP-style examples)
`packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p1.json`	Paradigm 1 outputs (generated at run-time)
`packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p2.json`	Paradigm 2 outputs (generated at run-time)
`packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p3.json`	Paradigm 3 outputs (generated at run-time)
`packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/comparison.md`	Side-by-side table (generated at run-time)

Test policy: Per feedback_research_workflow, research spikes use inline __main__ smoke assertions, not a separate pytest file. Each paradigm script asserts invariants (≥1 output per probe, all spec-compliant by construction) at the end of main().

Task 1: Bootstrap dir + source corpus + shared utilities¶

Files: - Create: packages/generation/research/2026-05-07-sentence-generation-paradigms/README.md - Create: packages/generation/research/2026-05-07-sentence-generation-paradigms/shared.py - Create: packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/source_corpus.txt

[ ] Step 1.1: Create directory structure

mkdir -p packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs

[ ] Step 1.2: Create outputs/source_corpus.txt with 60+ public-domain sentences

The corpus must contain attested English sentences with diverse dependency structures (NP-V-NP, NP-V-NP-PP, NP-V-NP-Adv, transitive, intransitive). Sources: Aesop's fables (Project Gutenberg, public domain), simple SLP-style hand-typed sentences. Target ~60-100 sentences.

Once a wolf saw a lamb drinking at a stream.
The wolf wanted to find an excuse to eat the lamb.
The lamb said it had not done anything wrong.
A hare laughed at a tortoise for being slow.
The tortoise challenged the hare to a race.
The hare quickly ran ahead of the tortoise.
The tortoise plodded along the road steadily.
A boy watched sheep on a hillside.
The boy cried out that a wolf was coming.
The villagers ran up the hill to help.
A fox saw a crow with cheese in its beak.
The fox flattered the crow until it sang.
The cheese fell from the beak of the crow.
A mouse ran across the face of a sleeping lion.
The lion woke up and caught the mouse.
The mouse begged the lion to spare its life.
A farmer found a goose that laid a golden egg.
The farmer killed the goose to find more eggs.
A dog carried a bone over a bridge.
The dog saw its reflection in the water.
A goat helped a fox out of a deep well.
The fox climbed onto the back of the goat.
A milkmaid carried a pail of milk on her head.
The milkmaid imagined buying many fine things.
A traveler met a dog on a country road.
The dog barked loudly at the traveler.
The traveler picked up a stone to throw.
A child kicked a ball across the yard.
The teacher read a book to the class.
The students drew pictures in their notebooks.
A baker mixed flour and water in a bowl.
The baker shaped the dough into round loaves.
A girl painted a picture of a flower garden.
The girl gave the painting to her mother.
A boy filled a glass with cold water.
The boy drank the water in big gulps.
A man cut the bread with a sharp knife.
The man ate the bread with butter.
A cat chased a mouse around the room.
The mouse ran into a small hole.
The cat waited near the hole patiently.
A dog ate a bone in the garden.
The owner watched the dog from the window.
A bird sang a song in the tree.
The children listened to the song happily.
A horse pulled a wagon up the hill.
The driver guided the horse with reins.
A boy threw a ball to his friend.
The friend caught the ball easily.
A woman washed the dishes after dinner.
The woman dried the dishes with a towel.
A worker fixed a broken window in the house.
The window let the warm sun into the room.
A frog jumped from one rock to another.
The pond was full of green lily pads.
A bee gathered nectar from a bright flower.
The flower swayed gently in the breeze.
A swan glided across the calm lake.
A child kicked the small soccer ball.
The ball rolled across the green grass.
A teacher kicked a stuck door open.
The door swung wide with a loud creak.
A student melted some butter in a pan.
The butter sizzled and bubbled gently.
A wolf chased a deer through the forest.
The deer leaped over a fallen log.
A girl filled a bucket with sand.
She tipped the sand onto the beach.
The painter painted the front door red.
The red door looked bright in the sun.
The cleaner washed the floor of the kitchen.
The kitchen floor sparkled when it dried.
The bird ate a worm from the soft soil.
The worm wriggled in the bird's beak.
The doctor saw the patient in the small office.
The patient sat quietly on the chair.
The cook made a stew with potatoes and carrots.
The stew filled the kitchen with a warm smell.
The students made a mural for the classroom.
The mural showed many colorful animals.
The mechanic cut a rusty pipe with a saw.
The pipe broke into two short pieces.
The dog kicked up dust as it ran.
The runner saw the finish line ahead.
The fisherman cast a long line into the lake.
The line floated on the calm surface.

[ ] Step 1.3: Create shared.py with PROBES + surface_realize() + lemma_compliance()

"""Shared utilities for the 3-paradigm sentence generation spike.

- PROBES: 5 (verb_lemma, spec_id) tuples reused across all paradigms.
- surface_realize(): produces a single grammatical sentence from
  (nsubj, verb_lemma, dobj, [adv]). Inserts determiners, inflects the
  verb for 3sg/non-3sg via lemminflect.
- lemma_compliance(): checks that an output verb form is an inflectional
  form of the seed lemma (matches PHON-102's lemma-aware metric).
"""

from __future__ import annotations

import functools

# Verb lemma + spec_id; band fixed at fineweb_adult per spec.
PROBES = [
    ("melt", "spec6"),
    ("chase", "spec1"),
    ("fill", "spec1"),
    ("cut", "spec1"),
    ("eat", "spec1"),
]
BAND = "fineweb_adult"

# Reuse PHON-97's MANNER_ADVERBS list (kept in-tree to avoid coupling).
MANNER_ADVERBS = [
    "quickly", "slowly", "carefully", "loudly", "quietly",
    "gently", "harshly", "easily", "barely", "happily",
    "sadly", "lazily", "eagerly", "calmly", "smoothly",
    "roughly", "softly", "rapidly", "casually", "frantically",
    "patiently", "kindly", "warmly", "coldly", "fiercely",
    "playfully", "swiftly", "steadily", "wildly", "neatly",
]


@functools.lru_cache(maxsize=128)
def _verb_inflections(lemma: str) -> set[str]:
    from lemminflect import getInflection
    forms = {lemma}
    for tag in ("VBZ", "VBP", "VBD", "VBG", "VBN", "VB"):
        for f in (getInflection(lemma, tag=tag) or ()):
            forms.add(f)
    return forms


def lemma_compliance(seed_lemma: str, output_word: str) -> bool:
    """True iff output_word is an inflectional form of seed_lemma.

    Mirrors PHON-102's lemma_aware_verb_lock metric.
    """
    return output_word == seed_lemma or output_word in _verb_inflections(seed_lemma)


def _is_plural_subject(noun: str) -> bool:
    """Cheap heuristic: noun is plural if it ends in 's' (and not in 'ss').
    Wrong for irregulars (people, men, mice) but good enough for the
    spike — refine in productionization if needed."""
    if noun.endswith("ss") or noun.endswith("us"):
        return False
    return noun.endswith("s")


def _conjugate_verb(lemma: str, plural_subject: bool) -> str:
    """Return verb in 3sg-present (singular subject) or base form (plural)."""
    from lemminflect import getInflection
    target_tag = "VBP" if plural_subject else "VBZ"
    forms = getInflection(lemma, tag=target_tag)
    return forms[0] if forms else lemma


def surface_realize(
    nsubj: str,
    verb_lemma: str,
    dobj: str,
    manner_adv: str | None = None,
) -> str:
    """`(nsubj, verb_lemma, dobj[, adv])` → grammatical sentence.

    Inserts `the` before nsubj and dobj. Conjugates the verb to agree
    with subject number (singular nouns → 3sg; plural nouns → base form).
    Capitalizes first letter; appends period.
    """
    plural = _is_plural_subject(nsubj)
    verb = _conjugate_verb(verb_lemma, plural_subject=plural)
    parts = ["The", nsubj, verb, "the", dobj]
    if manner_adv:
        parts.append(manner_adv)
    text = " ".join(parts)
    return text[0].upper() + text[1:] + "."


if __name__ == "__main__":
    # Smoke checks
    assert surface_realize("cat", "chase", "ball") == "The cat chases the ball."
    assert surface_realize("cats", "chase", "ball") == "The cats chase the ball."
    assert surface_realize("dog", "eat", "bone", "quickly") == "The dog eats the bone quickly."
    assert lemma_compliance("chase", "chases")
    assert lemma_compliance("chase", "chase")
    assert not lemma_compliance("chase", "kick")
    print("shared.py smoke checks OK")

[ ] Step 1.4: Run shared.py smoke check

uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/shared.py

Expected output: shared.py smoke checks OK

[ ] Step 1.5: Create README.md

# Sentence Generation Paradigms — Research Spike

Three non-LM sentence generators tested side-by-side against PHON-102's C1+MLM
baseline on the 5 PHON-95 acceptance probes. Spec at
`docs/superpowers/specs/2026-05-07-sentence-generation-paradigms-spike.md`.

## Files

- `shared.py` — PROBES + surface_realize() + lemma_compliance()
- `paradigm_3_csp.py` — CSP enumeration over (nsubj, dobj, manner_adv)
- `paradigm_1_dep_trees.py` — spaCy dep-tree skeleton mining + slot-fill
- `paradigm_2_graph_walk.py` — verb→filler PMI walk + Qwensim cross-slot
- `compare.py` — assemble side-by-side comparison.md
- `notebook.md` — observations + recommendation memo
- `outputs/source_corpus.txt` — public-domain corpus for paradigm 1
- `outputs/{p1,p2,p3}.json` — per-paradigm outputs
- `outputs/comparison.md` — final side-by-side table

## Usage

```bash
# Run all three paradigms in order (CSP first per spec §6 risk 4)
uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_3_csp.py
uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_1_dep_trees.py
uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_2_graph_walk.py
uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/compare.py

- [ ] **Step 1.6: Commit**

```bash
git add packages/generation/research/2026-05-07-sentence-generation-paradigms/
git commit -m "spike bootstrap: dir + shared surface realizer + source corpus

Three-paradigm sentence generation spike scaffold.

- shared.py: PROBES (5 PHON-95 probes, verb lemma + spec_id), surface_realize()
  for (nsubj, verb, dobj, [adv]) → grammatical sentence with lemminflect-driven
  agreement, lemma_compliance() helper matching PHON-102's metric
- outputs/source_corpus.txt: ~85 public-domain English sentences (Aesop's
  fables + hand-typed SLP-style examples) for paradigm 1's dep-tree mining
- README.md: usage notes

Spec: docs/superpowers/specs/2026-05-07-sentence-generation-paradigms-spike.md
Plan: docs/superpowers/plans/2026-05-07-sentence-generation-paradigms-spike.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 2: Paradigm 3 — CSP solver (built first per spec §6 risk 4)¶

Files: - Create: packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_3_csp.py - Create (output): packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p3.json

[ ] Step 2.1: Create paradigm_3_csp.py

"""Paradigm 3 — Constraint Satisfaction Problem.

Variables: (nsubj, dobj, manner_adv).  verb is locked to the user-chosen lemma.
Domains:
  nsubj := spec_lexicon ∩ pmi_admit(verb, "nsubj", band)
  dobj  := spec_lexicon ∩ pmi_admit(verb, "dobj",  band)
  manner_adv := MANNER_ADVERBS (no spec restriction — adverb is "connecting tissue")
Hard constraints:
  nsubj != dobj
  PMI(verb, "nsubj", nsubj) > 0  (already enforced by domain construction)
  PMI(verb, "dobj",  dobj)  > 0  (already enforced by domain construction)
Soft objective (rank only):
  maximize sum_PPMI = PMI(verb, "nsubj", nsubj) + PMI(verb, "dobj", dobj)
            + 0.5 * adv_factor   (adv has no PMI table; small constant)

Solver: brute-force enumeration over the cartesian product, sorted by sum_PPMI.
Output: top-K assignments with PMI breakdowns. Decoupled from presentation
intentionally — the user can rerank/sample/filter the K candidates downstream.
"""

from __future__ import annotations

import json
import time
from pathlib import Path

import polars as pl
from phonolex_data.runtime.store import WordStore

from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS

from shared import BAND, MANNER_ADVERBS, PROBES, lemma_compliance, surface_realize

REPO_ROOT = Path(__file__).resolve().parents[4]
WORDS_PARQUET = REPO_ROOT / "data" / "runtime" / "words.parquet"
SELECTIONAL_PARQUET = REPO_ROOT / "data" / "runtime" / "selectional.parquet"
OUT_PATH = Path(__file__).parent / "outputs" / "p3.json"

TOP_K = 8  # how many candidates to surface per probe


def pmi_lookup(sel_df: pl.DataFrame, verb: str, role: str, band: str) -> dict[str, float]:
    df = sel_df.filter(
        (pl.col("verb") == verb)
        & (pl.col("role") == role)
        & (pl.col("band") == band)
        & (pl.col("ppmi") > 0.0)
    )
    return dict(zip(df.get_column("filler").to_list(), df.get_column("ppmi").to_list()))


def spec_lexicon(store: WordStore, spec_id: str) -> set[str]:
    return set(
        store.subset(SPEC_FILTERS[spec_id])
        .get_column("word")
        .str.to_lowercase()
        .to_list()
    )


def solve(
    verb: str,
    spec_id: str,
    spec_words: set[str],
    sel_df: pl.DataFrame,
    *,
    top_k: int = TOP_K,
    include_adverb: bool = True,
) -> tuple[list[dict], dict]:
    """Return (top_K candidates as dicts, stats dict)."""
    nsubj_pmi = pmi_lookup(sel_df, verb, "nsubj", BAND)
    dobj_pmi = pmi_lookup(sel_df, verb, "dobj", BAND)

    nsubj_domain = sorted(set(nsubj_pmi.keys()) & spec_words)
    dobj_domain = sorted(set(dobj_pmi.keys()) & spec_words)
    adv_domain = list(MANNER_ADVERBS) if include_adverb else [None]

    stats = {
        "verb": verb,
        "spec_id": spec_id,
        "nsubj_domain_size": len(nsubj_domain),
        "dobj_domain_size": len(dobj_domain),
        "adv_domain_size": len([a for a in adv_domain if a is not None]),
        "candidate_count": 0,
    }

    candidates: list[tuple[float, str, str, str | None]] = []
    for n in nsubj_domain:
        for d in dobj_domain:
            if n == d:
                continue
            base_score = nsubj_pmi[n] + dobj_pmi[d]
            for adv in adv_domain:
                # adv contributes a tiny sentinel boost so we don't lose the
                # adverb in tie-breaking; pure score is base_score.
                score = base_score + (0.001 if adv else 0.0)
                candidates.append((score, n, d, adv))
                stats["candidate_count"] += 1

    candidates.sort(key=lambda t: t[0], reverse=True)
    top = []
    for score, n, d, adv in candidates[:top_k]:
        top.append({
            "nsubj": n,
            "dobj": d,
            "manner_adv": adv,
            "sum_pmi": float(score),
            "pmi_nsubj": float(nsubj_pmi[n]),
            "pmi_dobj": float(dobj_pmi[d]),
            "sentence": surface_realize(n, verb, d, adv),
        })
    return top, stats


def main() -> None:
    print(f"Loading WordStore + selectional from {REPO_ROOT}/data/runtime/ ...")
    store = WordStore.from_parquet(WORDS_PARQUET)
    sel_df = pl.read_parquet(SELECTIONAL_PARQUET)

    out: dict[str, dict] = {}
    overall_t0 = time.perf_counter()
    for verb, spec_id in PROBES:
        print(f"\n[probe] verb={verb} spec={spec_id}")
        spec_words = spec_lexicon(store, spec_id)
        t0 = time.perf_counter()
        top, stats = solve(verb, spec_id, spec_words, sel_df)
        elapsed = time.perf_counter() - t0
        out[f"{verb}_{spec_id}"] = {
            "verb": verb,
            "spec_id": spec_id,
            "stats": stats,
            "top_k": top,
            "wall_clock_s": elapsed,
        }
        for c in top[:3]:
            print(
                f"  sum_pmi={c['sum_pmi']:+.2f}  {c['sentence']}  "
                f"(nsubj={c['pmi_nsubj']:+.2f}, dobj={c['pmi_dobj']:+.2f})"
            )
        print(f"  ... ({stats['candidate_count']:,} candidates evaluated in {elapsed:.2f}s)")

    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
    OUT_PATH.write_text(json.dumps(out, indent=2))
    overall = time.perf_counter() - overall_t0
    print(f"\nWrote {OUT_PATH}")
    print(f"Total wall-clock: {overall:.1f}s")

    # Inline invariant checks (in lieu of pytest for research code)
    for key, payload in out.items():
        assert len(payload["top_k"]) >= 1, f"{key}: no candidates returned"
        for c in payload["top_k"]:
            assert c["nsubj"] != c["dobj"], f"{key}: nsubj == dobj"
            assert c["pmi_nsubj"] > 0, f"{key}: PMI(nsubj) <= 0"
            assert c["pmi_dobj"] > 0, f"{key}: PMI(dobj) <= 0"
            assert lemma_compliance(payload["verb"], c["sentence"].split()[2].lower()), (
                f"{key}: verb form '{c['sentence'].split()[2]}' is not an inflection of '{payload['verb']}'"
            )
    print("\nInvariant checks OK")


if __name__ == "__main__":
    main()

[ ] Step 2.2: Run paradigm 3 end-to-end

uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_3_csp.py

Expected output (last lines):

Wrote .../outputs/p3.json
Total wall-clock: <a few seconds>s
Invariant checks OK

[ ] Step 2.3: Visual sanity check of outputs/p3.json

python3 -c "import json; d = json.load(open('packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p3.json')); [print(k, '→', d[k]['top_k'][0]['sentence']) for k in d]"

Each line should be a grammatical English sentence with the locked verb (or its inflection) and 100% spec-compliant nsubj/dobj.

[ ] Step 2.4: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_3_csp.py packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p3.json
git commit -m "spike paradigm 3: CSP enumeration over (nsubj, dobj, adv)

Brute-force CSP over the cartesian product of (spec ∩ pmi_admit(verb, role))
for nsubj and dobj × MANNER_ADVERBS for the adverbial slot. Hard constraints
on nsubj != dobj and PMI > 0 (latter enforced by domain construction). Soft
objective: maximize sum-PPMI for ranking only; the candidate set is decoupled
from presentation.

Top-8 per probe written to outputs/p3.json with full PMI breakdowns so
downstream rerankers (lexical novelty, phonotactic complexity, clinician
preferences) have everything they need.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 3: Paradigm 1 — Dependency-tree templating¶

Files: - Create: packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_1_dep_trees.py - Create (output): packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p1.json

[ ] Step 3.1: Create paradigm_1_dep_trees.py

"""Paradigm 1 — Dependency-tree templating.

1. Parse outputs/source_corpus.txt with spaCy en_core_web_sm.
2. For each parsed sentence, extract a structural skeleton:
   tuple of (head_pos, dep_label, child_pos, head_idx, child_idx) per dep arc.
   We anonymize at the structural level — lexical content is dropped.
3. Dedupe skeletons by sorted-tuple hash. Expect ~20-50 distinct templates.
4. For each probe, find skeletons whose VERB slot is fillable by the locked
   verb (i.e., the skeleton has a token with pos_=='VERB' and at least one
   nsubj/dobj child slot that maps to a PMI role we have data for).
5. Slot-fill content nodes (NOUN/ADJ/ADV) with PMI-admit ∩ spec_lexicon
   ranked by max-PMI(verb, role, filler) where role is derived from the
   dependency label. Slots with no PMI data fall back to spec_lexicon ∩
   POS-filter ranked by lemma frequency.
6. Surface realize via shared.surface_realize() when the skeleton is a
   simple NP-V-NP shape; fall back to a token-by-token realization
   otherwise.

The point of this paradigm is "real attested grammar > hand-written CFG."
"""

from __future__ import annotations

import json
import time
from collections import Counter
from pathlib import Path

import polars as pl
import spacy
from phonolex_data.runtime.store import WordStore

from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS

from shared import BAND, PROBES, lemma_compliance, surface_realize

REPO_ROOT = Path(__file__).resolve().parents[4]
WORDS_PARQUET = REPO_ROOT / "data" / "runtime" / "words.parquet"
SELECTIONAL_PARQUET = REPO_ROOT / "data" / "runtime" / "selectional.parquet"
SOURCE_CORPUS = Path(__file__).parent / "outputs" / "source_corpus.txt"
OUT_PATH = Path(__file__).parent / "outputs" / "p1.json"

# selectional roles we have PMI data for; other dep_labels fall back to lemma freq.
PMI_ROLES = {
    "nsubj", "dobj", "iobj", "xcomp", "ccomp",
    "pobj_with", "pobj_on", "pobj_in", "pobj_to",
}
TOP_K = 8


def parse_corpus(nlp, corpus_text: str) -> list:
    docs = []
    for line in corpus_text.splitlines():
        line = line.strip()
        if not line:
            continue
        doc = nlp(line)
        docs.append(doc)
    return docs


def extract_skeleton(doc) -> tuple[tuple, ...]:
    """Anonymized POS+DEP skeleton: tuple of (head_pos, dep_label, child_pos)
    per non-root token, plus the index relationships, sorted for canonical hash.
    """
    arcs = []
    for tok in doc:
        if tok.is_space or tok.is_punct:
            continue
        if tok.head is tok:  # root
            arcs.append(("ROOT", "root", tok.pos_, tok.i))
            continue
        arcs.append((tok.head.pos_, tok.dep_, tok.pos_, tok.i, tok.head.i))
    return tuple(arcs)


def find_verb_with_subj_obj(doc) -> dict | None:
    """Return {'verb_idx': i, 'nsubj_idx': i, 'dobj_idx': i} for the FIRST
    main-verb token that has both an nsubj and a dobj child.
    Used to identify skeletons that match our (V + NP-NP) probe shape."""
    for tok in doc:
        if tok.pos_ != "VERB":
            continue
        nsubj_idx = None
        dobj_idx = None
        for child in tok.children:
            if child.dep_ == "nsubj" and child.pos_ in {"NOUN", "PROPN", "PRON"}:
                nsubj_idx = child.i
            elif child.dep_ == "dobj" and child.pos_ in {"NOUN", "PROPN", "PRON"}:
                dobj_idx = child.i
        if nsubj_idx is not None and dobj_idx is not None:
            return {"verb_idx": tok.i, "nsubj_idx": nsubj_idx, "dobj_idx": dobj_idx}
    return None


def pmi_lookup(sel_df: pl.DataFrame, verb: str, role: str, band: str) -> dict[str, float]:
    df = sel_df.filter(
        (pl.col("verb") == verb)
        & (pl.col("role") == role)
        & (pl.col("band") == band)
        & (pl.col("ppmi") > 0.0)
    )
    return dict(zip(df.get_column("filler").to_list(), df.get_column("ppmi").to_list()))


def spec_lexicon(store: WordStore, spec_id: str) -> set[str]:
    return set(
        store.subset(SPEC_FILTERS[spec_id])
        .get_column("word")
        .str.to_lowercase()
        .to_list()
    )


def slot_fill_for_probe(
    verb: str,
    spec_id: str,
    spec_words: set[str],
    sel_df: pl.DataFrame,
    skeletons: list,
    *,
    top_k: int = TOP_K,
) -> list[dict]:
    """For probes that match an NP-V-NP skeleton, slot-fill (nsubj, dobj)
    with PMI-ranked admit ∩ spec_lexicon. We are NOT permuting skeletons in
    this v1 — the matching skeleton tells us NP-V-NP is attested, then we
    slot-fill that shape. Future v2 could vary skeleton per output.
    """
    # Skeletons not used for slot selection in v1 — we only used them as
    # evidence that NP-V-NP is attested in real English. We now PMI-rank
    # admit fills.
    nsubj_pmi = pmi_lookup(sel_df, verb, "nsubj", BAND)
    dobj_pmi = pmi_lookup(sel_df, verb, "dobj", BAND)
    nsubj_domain = sorted(
        (n for n in nsubj_pmi if n in spec_words),
        key=lambda w: nsubj_pmi[w],
        reverse=True,
    )
    dobj_domain = sorted(
        (d for d in dobj_pmi if d in spec_words),
        key=lambda w: dobj_pmi[w],
        reverse=True,
    )

    candidates = []
    # Greedy: take top-K nsubj × top-K dobj, score by sum_pmi, return top-K
    for n in nsubj_domain[:top_k * 2]:
        for d in dobj_domain[:top_k * 2]:
            if n == d:
                continue
            score = nsubj_pmi[n] + dobj_pmi[d]
            candidates.append({
                "nsubj": n,
                "dobj": d,
                "manner_adv": None,
                "sum_pmi": float(score),
                "pmi_nsubj": float(nsubj_pmi[n]),
                "pmi_dobj": float(dobj_pmi[d]),
                "sentence": surface_realize(n, verb, d),
            })
    candidates.sort(key=lambda c: c["sum_pmi"], reverse=True)
    return candidates[:top_k]


def main() -> None:
    print(f"Loading source corpus from {SOURCE_CORPUS} ...")
    text = SOURCE_CORPUS.read_text()
    print(f"  {len([l for l in text.splitlines() if l.strip()])} sentences")

    print("Loading spaCy en_core_web_sm ...")
    nlp = spacy.load("en_core_web_sm")

    t0 = time.perf_counter()
    docs = parse_corpus(nlp, text)
    print(f"  parsed {len(docs)} docs in {time.perf_counter()-t0:.1f}s")

    # Extract & dedupe skeletons
    skeletons = [extract_skeleton(d) for d in docs]
    skeleton_counter = Counter(skeletons)
    print(f"  {len(skeletons)} skeletons; {len(skeleton_counter)} distinct shapes")

    # Identify NP-V-NP-bearing parses (most common shape we'll slot-fill)
    np_v_np_count = sum(1 for d in docs if find_verb_with_subj_obj(d) is not None)
    print(f"  {np_v_np_count} docs are NP-V-NP-shaped (this is the slot-fillable shape)")

    print(f"\nLoading WordStore + selectional ...")
    store = WordStore.from_parquet(WORDS_PARQUET)
    sel_df = pl.read_parquet(SELECTIONAL_PARQUET)

    out: dict[str, dict] = {}
    for verb, spec_id in PROBES:
        print(f"\n[probe] verb={verb} spec={spec_id}")
        spec_words = spec_lexicon(store, spec_id)
        t0 = time.perf_counter()
        top = slot_fill_for_probe(verb, spec_id, spec_words, sel_df, skeletons)
        elapsed = time.perf_counter() - t0
        out[f"{verb}_{spec_id}"] = {
            "verb": verb,
            "spec_id": spec_id,
            "skeleton_count": len(skeleton_counter),
            "np_v_np_doc_count": np_v_np_count,
            "top_k": top,
            "wall_clock_s": elapsed,
        }
        for c in top[:3]:
            print(f"  sum_pmi={c['sum_pmi']:+.2f}  {c['sentence']}")
        print(f"  ({len(top)} returned in {elapsed:.2f}s)")

    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
    OUT_PATH.write_text(json.dumps(out, indent=2))
    print(f"\nWrote {OUT_PATH}")

    # Invariant checks
    for key, payload in out.items():
        assert len(payload["top_k"]) >= 1, f"{key}: no candidates returned"
        for c in payload["top_k"]:
            assert c["nsubj"] != c["dobj"], f"{key}: nsubj == dobj"
            assert lemma_compliance(payload["verb"], c["sentence"].split()[2].lower()), (
                f"{key}: verb form not an inflection of {payload['verb']}"
            )
    print("Invariant checks OK")


if __name__ == "__main__":
    main()

[ ] Step 3.2: Run paradigm 1 end-to-end

uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_1_dep_trees.py

Expected output: per-probe top-3 sentences, then Wrote .../p1.json and Invariant checks OK. spaCy parse should take a few seconds.

[ ] Step 3.3: Visual sanity check of skeleton diversity

python3 -c "import json; d = json.load(open('packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p1.json')); k0 = next(iter(d)); print('skeleton_count:', d[k0]['skeleton_count'], 'np_v_np_docs:', d[k0]['np_v_np_doc_count'])"

Expected: skeleton_count should be ≥ 20 (otherwise the corpus is too uniform; expand source_corpus.txt). np_v_np_doc_count should be ≥ 30.

[ ] Step 3.4: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_1_dep_trees.py packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p1.json
git commit -m "spike paradigm 1: dep-tree templating from real spaCy parses

Parse outputs/source_corpus.txt with spaCy en_core_web_sm, extract anonymized
POS+DEP skeletons (lexical content dropped), dedupe at structural level. v1
slot-fills the NP-V-NP shape (well-attested in the corpus); the skeletons
themselves serve as evidence that the shape is real attested English, not
hand-written CFG.

Slot-fill nsubj/dobj from spec_lexicon ∩ pmi_admit(verb, role, BAND), greedy
top-K × top-K, ranked by sum-PPMI. Skeleton-driven slot variation is v2 work.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 4: Paradigm 2 — Lexical graph walks¶

Files: - Create: packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_2_graph_walk.py - Create (output): packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p2.json

[ ] Step 4.1: Create paradigm_2_graph_walk.py

"""Paradigm 2 — Lexical graph walks with cross-slot Qwensim conditioning.

Graph:
  Nodes: words.
  Verb→filler edges: weight = PPMI(verb, role, filler, band), role-typed.
  Word↔word edges: weight = Qwensim similarity.

Walk strategy for one sentence:
  1. Start from the locked verb.
  2. Pick top-N (default 10) nsubj candidates from spec ∩ admit, ordered by PMI.
  3. For each chosen nsubj, pick a dobj from the verb's top-N dobj candidates
     whose Qwensim similarity to the chosen nsubj is closest to the MEDIAN of
     that top-N pool — heuristic: prefer "thematically related but not
     near-synonymous" pairings, e.g., (kid, ball) over (kid, kids).
  4. Emit (verb, nsubj, dobj). Repeat for the next nsubj to get N candidates.

Surface realize via shared.surface_realize().
"""

from __future__ import annotations

import json
import time
from pathlib import Path

import polars as pl
from phonolex_data.runtime.store import WordStore

from phonolex_generators.cfg_seed.spec_filters import SPEC_FILTERS

from shared import BAND, PROBES, lemma_compliance, surface_realize

REPO_ROOT = Path(__file__).resolve().parents[4]
WORDS_PARQUET = REPO_ROOT / "data" / "runtime" / "words.parquet"
SELECTIONAL_PARQUET = REPO_ROOT / "data" / "runtime" / "selectional.parquet"
EDGES_PARQUET = REPO_ROOT / "data" / "runtime" / "edges.parquet"
OUT_PATH = Path(__file__).parent / "outputs" / "p2.json"

TOP_N_PER_ROLE = 10
TOP_K = 8


def pmi_lookup(sel_df: pl.DataFrame, verb: str, role: str, band: str) -> dict[str, float]:
    df = sel_df.filter(
        (pl.col("verb") == verb)
        & (pl.col("role") == role)
        & (pl.col("band") == band)
        & (pl.col("ppmi") > 0.0)
    )
    return dict(zip(df.get_column("filler").to_list(), df.get_column("ppmi").to_list()))


def spec_lexicon(store: WordStore, spec_id: str) -> set[str]:
    return set(
        store.subset(SPEC_FILTERS[spec_id])
        .get_column("word")
        .str.to_lowercase()
        .to_list()
    )


def build_qwensim_lookup(edges_df: pl.DataFrame) -> dict[tuple[str, str], float]:
    """Build (w1, w2) → similarity dict from edges.parquet.

    edges schema is (source, target, weight, type) per PHON-81 conventions;
    we restrict to qwensim-typed edges. Symmetrized: store both (a,b) and (b,a).
    """
    qs = edges_df
    # Try to locate the type column; tolerate schema differences across PHON-93.
    if "edge_type" in qs.columns:
        qs = qs.filter(pl.col("edge_type") == "qwensim")
    elif "type" in qs.columns:
        qs = qs.filter(pl.col("type") == "qwensim")
    # else: assume the file is already qwensim-only
    src = qs.get_column("source").str.to_lowercase().to_list() if "source" in qs.columns else qs.get_column("from").str.to_lowercase().to_list()
    tgt = qs.get_column("target").str.to_lowercase().to_list() if "target" in qs.columns else qs.get_column("to").str.to_lowercase().to_list()
    wts = qs.get_column("weight").to_list() if "weight" in qs.columns else qs.get_column("similarity").to_list()
    out: dict[tuple[str, str], float] = {}
    for s, t, w in zip(src, tgt, wts):
        out[(s, t)] = float(w)
        out[(t, s)] = float(w)
    return out


def qwensim(lookup: dict[tuple[str, str], float], a: str, b: str) -> float:
    """Qwensim similarity; returns 0.0 if no edge exists."""
    return lookup.get((a, b), 0.0)


def walk(
    verb: str,
    spec_id: str,
    spec_words: set[str],
    sel_df: pl.DataFrame,
    qs_lookup: dict[tuple[str, str], float],
    *,
    top_n: int = TOP_N_PER_ROLE,
    top_k: int = TOP_K,
) -> list[dict]:
    """Generate top-K (nsubj, dobj) walks with median-Qwensim cross-conditioning."""
    nsubj_pmi = pmi_lookup(sel_df, verb, "nsubj", BAND)
    dobj_pmi = pmi_lookup(sel_df, verb, "dobj", BAND)

    nsubj_pool = sorted(
        (n for n in nsubj_pmi if n in spec_words),
        key=lambda w: nsubj_pmi[w],
        reverse=True,
    )[:top_n]
    dobj_pool = sorted(
        (d for d in dobj_pmi if d in spec_words),
        key=lambda w: dobj_pmi[w],
        reverse=True,
    )[:top_n]

    if not nsubj_pool or not dobj_pool:
        return []

    walks: list[dict] = []
    for n in nsubj_pool:
        # For this nsubj, compute Qwensim similarity to each dobj candidate
        # and pick the one closest to the median of those similarities.
        sims = [(d, qwensim(qs_lookup, n, d)) for d in dobj_pool if d != n]
        if not sims:
            continue
        sorted_sims = sorted(sims, key=lambda t: t[1])
        median_idx = len(sorted_sims) // 2
        chosen_dobj, chosen_sim = sorted_sims[median_idx]
        score = nsubj_pmi[n] + dobj_pmi[chosen_dobj]
        walks.append({
            "nsubj": n,
            "dobj": chosen_dobj,
            "manner_adv": None,
            "sum_pmi": float(score),
            "pmi_nsubj": float(nsubj_pmi[n]),
            "pmi_dobj": float(dobj_pmi[chosen_dobj]),
            "qwensim_nsubj_dobj": float(chosen_sim),
            "sentence": surface_realize(n, verb, chosen_dobj),
        })

    # Dedupe by (nsubj, dobj); keep the highest sum_pmi
    seen: dict[tuple[str, str], dict] = {}
    for w in walks:
        key = (w["nsubj"], w["dobj"])
        if key not in seen or w["sum_pmi"] > seen[key]["sum_pmi"]:
            seen[key] = w
    walks = list(seen.values())
    walks.sort(key=lambda w: w["sum_pmi"], reverse=True)
    return walks[:top_k]


def main() -> None:
    print(f"Loading WordStore + selectional + edges ...")
    store = WordStore.from_parquet(WORDS_PARQUET)
    sel_df = pl.read_parquet(SELECTIONAL_PARQUET)
    edges_df = pl.read_parquet(EDGES_PARQUET)
    print(f"  edges columns: {edges_df.columns}")

    print(f"Building Qwensim lookup ...")
    t0 = time.perf_counter()
    qs_lookup = build_qwensim_lookup(edges_df)
    print(f"  {len(qs_lookup):,} edges (symmetrized) in {time.perf_counter()-t0:.1f}s")

    out: dict[str, dict] = {}
    for verb, spec_id in PROBES:
        print(f"\n[probe] verb={verb} spec={spec_id}")
        spec_words = spec_lexicon(store, spec_id)
        t0 = time.perf_counter()
        top = walk(verb, spec_id, spec_words, sel_df, qs_lookup)
        elapsed = time.perf_counter() - t0
        out[f"{verb}_{spec_id}"] = {
            "verb": verb,
            "spec_id": spec_id,
            "top_k": top,
            "wall_clock_s": elapsed,
        }
        for c in top[:3]:
            print(
                f"  sum_pmi={c['sum_pmi']:+.2f}  qs={c['qwensim_nsubj_dobj']:.3f}  "
                f"{c['sentence']}"
            )
        print(f"  ({len(top)} returned in {elapsed:.2f}s)")

    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
    OUT_PATH.write_text(json.dumps(out, indent=2))
    print(f"\nWrote {OUT_PATH}")

    # Invariant checks
    for key, payload in out.items():
        assert len(payload["top_k"]) >= 1, f"{key}: no walks returned"
        for c in payload["top_k"]:
            assert c["nsubj"] != c["dobj"], f"{key}: nsubj == dobj"
            assert lemma_compliance(payload["verb"], c["sentence"].split()[2].lower()), (
                f"{key}: verb form not an inflection of {payload['verb']}"
            )
    print("Invariant checks OK")


if __name__ == "__main__":
    main()

[ ] Step 4.2: Inspect edges.parquet schema first (we want to be sure the column names in build_qwensim_lookup are correct before running)

uv run python -c "import polars as pl; df = pl.read_parquet('data/runtime/edges.parquet'); print(df.columns); print(df.head(3))"

If columns don't match (source, target, weight, edge_type) or (from, to, similarity, type), edit build_qwensim_lookup to match the actual schema before running.

[ ] Step 4.3: Run paradigm 2 end-to-end

uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_2_graph_walk.py

Expected output: 3-line preview per probe (with sum_pmi, qs, sentence), then Wrote .../p2.json and Invariant checks OK.

[ ] Step 4.4: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_2_graph_walk.py packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p2.json
git commit -m "spike paradigm 2: lexical graph walks with Qwensim cross-conditioning

For each verb, pick top-N nsubj candidates by PPMI; for each, pick the dobj
from the verb's top-N dobj pool whose Qwensim similarity to the chosen nsubj
is closest to the pool median. Heuristic: thematically related but not
lexically near-synonymous (prefer 'kid → ball' over 'kid → kids').

The cross-slot conditioning is the novel lever vs CFG/CSP, which treat the
two NP slots as conditionally independent given the verb.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 5: compare.py — Side-by-side comparison table¶

Files: - Create: packages/generation/research/2026-05-07-sentence-generation-paradigms/compare.py - Create (output): packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/comparison.md

[ ] Step 5.1: Create compare.py

"""Assemble a side-by-side comparison markdown table from paradigm outputs
plus the existing PHON-102 C1+MLM baseline.

Reads:
  outputs/p1.json, outputs/p2.json, outputs/p3.json  — this spike
  ../2026-05-07-phon-102-c1-eval/outputs/results-2026-05-07.jsonl  — baseline

Writes:
  outputs/comparison.md  — one section per probe, with C1+MLM, P1, P2, P3
                          top-3 outputs side-by-side, plus aggregate summary.
"""

from __future__ import annotations

import json
from collections import defaultdict
from pathlib import Path

DIR = Path(__file__).parent
P1_PATH = DIR / "outputs" / "p1.json"
P2_PATH = DIR / "outputs" / "p2.json"
P3_PATH = DIR / "outputs" / "p3.json"
PHON102_RESULTS = DIR.parent / "2026-05-07-phon-102-c1-eval" / "outputs" / "results-2026-05-07.jsonl"
OUT_PATH = DIR / "outputs" / "comparison.md"

# Mapping from spike PROBES to the (verb, spec_id) used in PHON-102 results.
# PHON-102 used 10 verbs × 2 specs × 8 seeds × 5 configs; we sample C4 (the
# best-performing config) for the baseline's top-3 outputs.
PROBES = [
    ("melt", "spec6"),
    ("chase", "spec1"),
    ("fill", "spec1"),
    ("cut", "spec1"),
    ("eat", "spec1"),
]


def load_phon102_baseline() -> dict[tuple[str, str], list[str]]:
    """Top-3 distinct C4 outputs per (verb, spec) from PHON-102."""
    if not PHON102_RESULTS.exists():
        print(f"warn: PHON-102 baseline not found at {PHON102_RESULTS}; baseline column will be empty")
        return {}
    by_key: dict[tuple[str, str], list[str]] = defaultdict(list)
    with PHON102_RESULTS.open() as f:
        for line in f:
            r = json.loads(line)
            if r.get("config") != "C4":
                continue
            key = (r["verb"], r["spec_id"])
            best = r.get("best", "")
            if best and best not in by_key[key]:
                by_key[key].append(best)
    # cap at 3 per key
    return {k: v[:3] for k, v in by_key.items()}


def load_paradigm(path: Path) -> dict[str, dict]:
    if not path.exists():
        return {}
    return json.loads(path.read_text())


def top_sentences(payload: dict, n: int = 3) -> list[str]:
    """Extract top-N sentences from a paradigm payload."""
    if not payload:
        return []
    return [c["sentence"] for c in payload.get("top_k", [])][:n]


def main() -> None:
    p1 = load_paradigm(P1_PATH)
    p2 = load_paradigm(P2_PATH)
    p3 = load_paradigm(P3_PATH)
    baseline = load_phon102_baseline()

    lines: list[str] = [
        "# Sentence Generation Paradigms — Side-by-side Comparison",
        "",
        "Five PHON-95 acceptance probes against four paradigms. Top-3 distinct",
        "outputs per cell. Spec at `docs/superpowers/specs/2026-05-07-sentence-generation-paradigms-spike.md`.",
        "",
        "Configs:",
        "- **C1+MLM** = PHON-102's C4 config (CFG enumerator + MLM editor + bias + morph + adverbial)",
        "- **P1** = dependency-tree templating (data-only)",
        "- **P2** = lexical graph walks with Qwensim cross-conditioning (data-only)",
        "- **P3** = CSP enumeration with sum-PPMI ranking (data-only)",
        "",
    ]
    # Per-probe section
    for verb, spec_id in PROBES:
        key = f"{verb}_{spec_id}"
        baseline_outputs = baseline.get((verb, spec_id), [])
        p1_outputs = top_sentences(p1.get(key, {}))
        p2_outputs = top_sentences(p2.get(key, {}))
        p3_outputs = top_sentences(p3.get(key, {}))

        lines.append(f"## Probe: verb=`{verb}` spec=`{spec_id}`")
        lines.append("")
        lines.append("| # | C1+MLM (PHON-102 C4) | P1 dep-tree | P2 graph walk | P3 CSP |")
        lines.append("|---|----------------------|-------------|---------------|--------|")
        for i in range(3):
            row = [str(i + 1)]
            for outs in (baseline_outputs, p1_outputs, p2_outputs, p3_outputs):
                row.append(f"`{outs[i]}`" if i < len(outs) else "—")
            lines.append("| " + " | ".join(row) + " |")
        lines.append("")

    # Wallclock summary
    lines.append("## Wallclock per probe (seconds)")
    lines.append("")
    lines.append("| probe | P1 | P2 | P3 |")
    lines.append("|-------|----|----|----|")
    for verb, spec_id in PROBES:
        key = f"{verb}_{spec_id}"
        cells = []
        for d in (p1, p2, p3):
            payload = d.get(key, {})
            wc = payload.get("wall_clock_s")
            cells.append(f"{wc:.2f}" if wc is not None else "—")
        lines.append(f"| {verb}_{spec_id} | {cells[0]} | {cells[1]} | {cells[2]} |")
    lines.append("")

    OUT_PATH.write_text("\n".join(lines) + "\n")
    print(f"Wrote {OUT_PATH}")


if __name__ == "__main__":
    main()

[ ] Step 5.2: Run compare.py

uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/compare.py

Expected: Wrote .../comparison.md. Read the file:

cat packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/comparison.md

Verify: each probe section has 3-row table, all 4 paradigm columns populated for at least row 1.

[ ] Step 5.3: Commit

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/compare.py packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/comparison.md
git commit -m "spike compare.py: side-by-side comparison + PHON-102 baseline integration

Assembles per-probe tables from paradigm 1/2/3 JSON outputs + the existing
PHON-102 C4 results JSONL (sampled as the C1+MLM baseline). Top-3 distinct
outputs per cell.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 6: notebook.md — observations + recommendation memo¶

Files: - Create: packages/generation/research/2026-05-07-sentence-generation-paradigms/notebook.md

This is the human-readable lab notebook. It captures what you saw, what surprised you, and which direction to take. Structure: framing → per-paradigm observations → cross-paradigm comparison → recommendation memo.

[ ] Step 6.1: Read the comparison output

cat packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/comparison.md

Eyeball the side-by-side. Note for each probe: - Which paradigm produced the most natural-sounding output? - Did any paradigm produce something obviously broken? - How different are the outputs across paradigms?

[ ] Step 6.2: Write notebook.md

# Sentence Generation Paradigms — Lab Notebook

**Date:** 2026-05-07
**Spec:** `docs/superpowers/specs/2026-05-07-sentence-generation-paradigms-spike.md`
**Plan:** `docs/superpowers/plans/2026-05-07-sentence-generation-paradigms-spike.md`

## Frame

PHON-102 measured that the existing C1 stack (CFG + MLM editor) reaches 100%
lemma compliance on a 10-verb × 2-spec × 8-seed matrix, with PMI density doing
most of the coherence work. The MLM contributes fluency-beyond-PMI but at
significant cost (1.4GB model, ~0.85s/edit, MPS/GPU dependence, stochastic
outputs, known degenerate failure modes).

This spike asks: can the data layer alone produce naturalistic well-formed
sentences without an LM? Tests three paradigms (CSP, dep-tree templating,
graph walks) side-by-side against the C1+MLM baseline on the 5 PHON-95
acceptance probes.

## Observations

### Paradigm 3 — CSP
[Fill in after running. Examples to address:]
- Are the top-K sum-PPMI outputs naturalistic?
- How big are the domains (typically)?
- Wallclock per probe?
- Any surprising winners (e.g., low-PMI but readable pairings missed by greedy)?

### Paradigm 1 — Dependency-tree templating
[Fill in after running.]
- How many distinct skeletons did 60-100 sentences yield?
- How many docs were NP-V-NP shaped (the slot-fillable shape)?
- Did the slot-fill outputs differ from CSP's? Why or why not?
- v2 hint: did the diversity of skeletons suggest skeleton-driven slot variation would unlock more output variety?

### Paradigm 2 — Graph walks
[Fill in after running.]
- Did the median-Qwensim heuristic produce thematically coherent (nsubj, dobj) pairs?
- Were the outputs more "natural" (in the eyeball sense) than CSP's pure-PMI greedy?
- Did Qwensim coverage in edges.parquet feel sufficient for the verb set?

## Cross-paradigm comparison

[Pull excerpts from comparison.md showing 2-3 probes side-by-side. Note which paradigm produced the most natural output.]

## Memo: which direction?

Per spec §4 deliverables, answer four questions:

1. **Which paradigm produced the most natural-sounding outputs?** [your read]
2. **Which felt most extensible to paragraph generation?** [your read; CSP's decoupled-from-presentation property is one frame; tree skeletons providing real attested grammar is another]
3. **Is the data-only hypothesis vindicated for sentences?** [yes/no/partially, with evidence]
4. **What's the next step?**
   - A specific paradigm to productionize
   - A wider matrix (more verbs, more specs) of the chosen paradigm
   - Folding an LM back in for some role
   - Closing the door on data-only

Be specific. The recommendation per spec §5.6 is *not* "more research is needed."

[ ] Step 6.3: Fill in the notebook based on what the actual outputs showed (each [Fill in after running] block).

After eyeballing comparison.md, write 2-4 sentences per observation block + a clear recommendation in the memo.

[ ] Step 6.4: Commit notebook.md

git add packages/generation/research/2026-05-07-sentence-generation-paradigms/notebook.md
git commit -m "spike notebook + memo: paradigm comparison observations

Per-paradigm observations + cross-paradigm side-by-side excerpt + the
recommendation memo answering the four questions in spec §4.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 7: Final verification + plan close-out¶

[ ] Step 7.1: Verify acceptance criteria

Run every script once to confirm end-to-end:

uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_3_csp.py
uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_1_dep_trees.py
uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/paradigm_2_graph_walk.py
uv run python packages/generation/research/2026-05-07-sentence-generation-paradigms/compare.py

All four should complete without errors.

[ ] Step 7.2: Confirm spec §5 acceptance criteria

For each, eyeball the artifact:

ls packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/

Expected files: source_corpus.txt, p1.json, p2.json, p3.json, comparison.md.

grep -c "top_k" packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p1.json
grep -c "top_k" packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p2.json
grep -c "top_k" packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/p3.json

Each should return ≥ 5 (one per probe).

[ ] Step 7.3: Confirm comparison.md is comprehensible

Read it and verify each probe section has 3 rows × 4 columns and the wallclock table at the end is populated.

cat packages/generation/research/2026-05-07-sentence-generation-paradigms/outputs/comparison.md

[ ] Step 7.4: Confirm notebook.md memo is decisive

Read notebook.md's memo section. Per spec §5.6, the recommendation must NOT be "more research is needed." Confirm the memo picks a direction or closes a door.

[ ] Step 7.5: Final invocation of finishing-a-development-branch skill

Per the executing-plans skill flow, after all tasks are complete:

"I'm using the finishing-a-development-branch skill to complete this work."

Use superpowers:finishing-a-development-branch to verify tests pass on the spike (running existing test suites in packages/generators/tests/ and packages/data/tests/ to confirm no regressions from any incidental changes), present integration options (PR to develop, tag, or hold), and execute the user's choice.