Sentence Generation Paradigms — Research Spike¶
Date: 2026-05-07
Status: Spec — handoff for execution
Branch: feature/phon-102-c1-eval-spec (continues from PHON-102 results)
Ticket: to be filed (next free PHON-XXX)
Predecessor: PHON-102 — C1 track evaluation surfaced that PMI density (not the MLM) is doing the coherence work in the existing C1 stack. The MLM is expensive scaffolding around a PMI-coherent core. This spike asks: what does sentence generation look like without a language model, given the data layer we have?
§0 — Frame¶
PhonoLex's data layer (after PHON-72…PHON-101) is unusually rich for an SLP/phonological tool: - 125K phonology-tagged words × ~160 norm columns - 5.4M (verb, role, filler, band) PPMI selectional rows - 1.6M Qwensim association edges - POS/lemma/morphology features
PHON-102 measured that the existing C1 stack (CFG seed enumerator + RoBERTa-large MLM iterative editor) achieves 100% lemma compliance on a 10-verb × 2-spec × 8-seed matrix, with PMI density as the dominant coherence signal. The MLM contributes fluency-beyond-PMI — but at 1.4GB model size, ~0.85s/edit, MPS-or-GPU dependence, stochastic outputs, and known degenerate failure modes (PHON-100).
Strategic question driving this spike: is the MLM earning its keep, or is the data layer sufficient for naturalistic well-formed sentence generation if we use it differently?
The user's product target (clarified 2026-05-07): given any constraint set an SLP or teacher might select, produce coherent text for a patient/student/child to read in a clinical or practice context. Adult and child contexts both. Generalize to paragraphs is a stretch goal.
This spike is sentence-level only. Paragraphs deferred.
The user has explicitly flagged CSP framing as the architectural pattern they've always wanted to consider for this problem — generation as constraint satisfaction with rerank-and-present-as-needed. This spike treats CSP as a peer of the other two paradigms, not a preordained winner; the side-by-side comparison decides whether CSP delivers on that intuition or whether one of the other paradigms is a better fit. Per §3.3 of feedback_no_substitute_in_frame, the recommendation memo (§4) ranks assumptions and points at candidates without naming a winner before the data lands.
§1 — Three paradigms under test¶
CFG (the existing C1 architecture) is the implicit baseline; PHON-102 already produced its outputs. We compare against three non-CFG paradigms:
Paradigm 1 — Dependency-tree templating from real parses¶
Premise: the grammar of English sentences is attested in parsed corpora; we don't need to hand-write CFG rules. Mining anonymized dependency tree skeletons from real text gives us a richer structural template bank than what CFG can express.
Method:
1. Parse a fresh small sample (50–200 sentences) of redistributable SLP-style English text with spaCy en_core_web_sm. Sources: simple Aesop fables (public domain), hand-curated SLP example sentences from public sources, and the existing PHON-95 acceptance probes themselves. No CHILDES/PhonBank/FineWeb-Edu content. All inputs are public-domain or trivial seed text.
2. Extract per-sentence dependency tree skeletons: tuples of (head_pos, dep_label, child_pos, head_idx, child_idx). Anonymize: keep only POS + DEP, drop lexical content. The skeleton is a structural template.
3. Dedupe skeletons by canonical hash. Expect ~20–80 distinct templates from a 200-sentence sample.
4. For each probe (verb, spec, band), select skeletons that contain a VERB slot fillable by the locked verb. Slot-fill content nodes (NOUN, ADJ, ADV) with PMI-admit ∩ spec_lexicon, ranked by max-PMI(verb, role, filler). The selectional.parquet table covers nine roles: nsubj, dobj, iobj, xcomp, ccomp, pobj_with, pobj_on, pobj_in, pobj_to. Skeleton slots whose dep_label matches one of these get PMI-ranked fill; slots labeled amod/advmod/etc. (no PMI data) fall back to spec_lexicon ∩ POS-filter, ranked by lemma frequency.
5. Surface realization: traverse the skeleton in dependency order, insert determiners from a small map, run lemminflect for agreement on the locked verb.
Why this paradigm: the CFG critique is exactly that hand-written rules don't generalize. Real parses give us a learned grammar as a side-effect of having parsed corpora. The skeletons are not copyrightable structural abstractions of attested English.
Paradigm 2 — Lexical graph walks¶
Premise: if generation is "find a coherent path through the lexicon," the data already encodes the graph. Walk it.
Method:
1. Construct a directed graph from data/runtime/{words,edges,selectional}.parquet:
- Nodes: words (filtered to spec_lexicon).
- Edges:
- selectional: verb → filler, weight = PPMI(verb, role, filler, band), role-typed.
- qwensim: word ↔ word, weight = similarity score, undirected.
2. Walk strategy for a single sentence: start from the locked verb. Pick the highest-weight nsubj outgoing edge into the spec lexicon. From that nsubj node, pick the dobj from the top-N (e.g., N=10) highest-PMI dobj candidates whose Qwensim similarity to the chosen nsubj is closest to the median of that top-N pool — heuristic: prefer "thematically related but not lexically near-synonymous" pairings (a kid → ball style pairing, not a kid → kids style pairing). Optionally pick a manner advmod from the verb's adverb set.
3. Walks return tuples (verb, nsubj, dobj[, adv]). Surface realization: same lightweight wrapping as paradigm 1.
Why this paradigm: decouples slot fillers from independent lookups. Cross-slot conditioning (nsubj ↔ dobj via Qwensim) is the novel lever. Less template-y output than CFG or paradigm 1.
Paradigm 3 — Constraint Satisfaction (CSP)¶
Premise (the user's longstanding intuition): treating generation as CSP separates finding all valid sentences from ranking and presenting them. The product gets to choose how to surface options independently of how they're generated.
Method:
1. Variables: nsubj, verb, dobj, optionally manner_adv.
2. Domains:
- verb: locked to user-chosen lemma.
- nsubj: spec_lexicon ∩ pmi_admit(verb, "nsubj", band).
- dobj: spec_lexicon ∩ pmi_admit(verb, "dobj", band).
- manner_adv: a small curated MANNER_ADVERBS list (the existing PHON-97 list).
3. Hard constraints:
- nsubj != dobj
- PMI(verb, "nsubj", nsubj) > 0
- PMI(verb, "dobj", dobj) > 0
- Subject-verb agreement: verb_form consistent with nsubj_number (lemminflect-derivable; CSP enforces by domain restriction or post-solve filter).
4. Soft constraints (objective for ranking): maximize sum-PPMI(verb, role, filler) across all (verb, role, filler) triples in the assignment.
5. Solver: brute-force enumeration over Cartesian product of domains (typical sizes 50–500 each → ≤250K candidates → ms-scale on CPU). Optional library fallback to python-constraint if domains explode.
6. Output: the top-K assignments by sum-PPMI objective, with their full PMI scores attached. The product can rerank/sample/filter the K candidates as needed: by Δ-PMI from optimum, by lexical novelty, by clinician-curated criteria, by phonological complexity, anything. CSP returns the set of valid sentences, not a single best.
Why this paradigm: it's the cleanest decoupling. Generation produces a candidate set; presentation chooses how to use it. Reproducible by construction. Auditable. Composable with future constraints (add a phonotactic constraint, add a register constraint, etc., without rewriting the generator).
§2 — Probes and comparison¶
Probes: the 5 PHON-95 acceptance seeds, for direct comparability with PHON-102 results:
| # | verb | spec | seed sentence (CFG-MLM baseline) |
|---|---|---|---|
| 1 | melt | spec6 | "the puppy melt the baby" |
| 2 | chase | spec1 | "the cat chased the ball" |
| 3 | fill | spec1 | "the snow filled the cup" |
| 4 | cut | spec1 | "the coleslaw cut the control" |
| 5 | eat (was: ate) | spec1 | "the dog ate the bone" |
For each probe, each paradigm generates 5–8 candidate sentences under the same constraint set (spec_lexicon + verb_lock + band=fineweb_adult).
Side-by-side output: a markdown table per probe with columns: - C1+MLM (top-3 from PHON-102's results JSONL) - Paradigm 1 (dep-tree) - Paradigm 2 (graph walk) - Paradigm 3 (CSP top-3)
Plus aggregate metrics: - Wallclock per generated sentence - Distinct outputs - Compliance (lemma-aware, per PHON-102 metric) - Eyeball notes (you and me)
Decision criteria (eyeball): - Naturalness: does it read like English a person might say? - Readability: would a patient/student be able to read it aloud? - Diversity per probe - Compliance (already structurally guaranteed for paradigms 1–3 by construction) - Architectural appeal: which paradigm feels most extensible to paragraphs and to richer constraint vocabularies?
§3 — Methodology details¶
§3.1 — Source-text policy for paradigm 1¶
Use: public-domain text only — Aesop's fables (Project Gutenberg), trivially-licensed SLP example sentences, the PHON-95 seeds themselves. Hand-typed if needed; any 100–200 attested English sentences will do for a 50-template skeleton bank.
Do not use: CHILDES, PhonBank, FineWeb-Edu, or any other licensed corpus. Per feedback_consistent_license_standard and feedback_distinguish_direct_vs_trained — even though tree skeletons are anonymized structural abstractions, we keep this spike clean of any potentially-restricted source.
§3.2 — PMI lookup interface¶
Reuse phonolex_generators.editor.pmi_bias.make_pmi_bias_fn and phonolex_generators.cfg_seed.argstruc_enumerator.pmi_admit for parity with the C1 stack. Both operate on selectional.parquet directly via Polars.
§3.3 — Surface realization¶
All three paradigms share a small surface realizer:
- Insert the before NOUN heads (skip mass nouns by simple POS+countability heuristic).
- Inflect the verb via lemminflect for subject-verb agreement.
- Capitalize first letter; append period.
This is a 30-line utility, not a separate paradigm.
§3.4 — Determinism¶
All three paradigms must be deterministic. Seed any RNG with a fixed value if used. No torch sampling, no temperature.
§4 — Deliverables¶
Location: packages/generation/research/2026-05-07-sentence-generation-paradigms/
Files:
- notebook.md — lab notebook: framing, per-paradigm observations, side-by-side comparison, recommendation memo
- paradigm_1_dep_trees.py — runs paradigm 1 across the 5 probes, writes outputs/p1.json
- paradigm_2_graph_walk.py — paradigm 2, writes outputs/p2.json
- paradigm_3_csp.py — paradigm 3, writes outputs/p3.json
- compare.py — assembles side-by-side comparison table from p1/p2/p3 JSON + the existing PHON-102 results JSONL
- outputs/source_corpus.txt — the 100-200 public-domain sentences used by paradigm 1 (so the artifact is self-contained)
Memo (in notebook.md): 1–2 paragraphs answering: - Which paradigm produced the most natural-sounding outputs? - Which paradigm felt most extensible to paragraph generation? - Is the data-only hypothesis vindicated for sentences? (Compliance + naturalness with no MLM.) - What's the next step? (Pick a paradigm and productionize. Or run a wider matrix. Or fold an LM back in for some role.)
Out of scope: - Paragraph-level generation (deferred until we pick a sentence paradigm) - Productionization, including FastAPI route work, Worker proxying, frontend integration - Calibration sweeps within paradigms - LLM-as-judge or external review (see PHON-102 user note: rapid iteration first; external review reserved for "when we think we have something") - Any LM (n-gram, MLM, decoder LM) — the spike's premise is non-LM
§5 — Acceptance criteria¶
The spike is "done" when:
- All three paradigm scripts run end-to-end on the 5 probes.
outputs/{p1,p2,p3}.jsonexist with 5–8 candidate sentences per probe.compare.pyproduces a complete side-by-side markdown table including the C1+MLM baseline.notebook.mdhas the memo answering the four questions in §4.- Wallclock per paradigm captured (rough — order-of-magnitude is enough).
- The recommendation in the memo is not "more research is needed." It picks a direction or explicitly closes the door on data-only.
§6 — Risks and open questions¶
- Paradigm 1 corpus too small. 100–200 sentences may not yield enough distinct dependency tree skeletons to cover the 5 probes. Mitigation: extend the seed corpus on the fly if needed, or fall back to spaCy-parsed hand-typed exemplars. Not a blocker.
- Paradigm 2 surface realization is the weak point. Walks return tuples, not trees. Lightweight wrapping may produce ungrammatical output. Mitigation: keep walks short (verb + 2-3 fillers), use the same surface realizer as paradigms 1 and 3.
- CSP combinatorial explosion. If a probe has nsubj domain = 1798 (spec1's full noun list), brute-force over 1798 × 1798 = 3.2M candidates is still ms-scale, but if we add 4+ variables it could blow up. Mitigation: cap variables at 4; use library
python-constraintonly if needed. - One-day deliverable assumption. Spike is sized for 3–4 hours. If paradigm 1's mining or paradigm 2's graph construction takes longer than expected, paradigm 3 (the user's most-wanted) could get squeezed. Mitigation: implement paradigm 3 first, then 1, then 2. CSP is the must-keep.
- Comparison subjectivity. "Naturalness" is eyeballed. Two-person eyeball (you and me) is the spike-iteration agreement; we can fold in LLM-judge or clinician review later if a paradigm advances.
§7 — Plan handoff¶
The successor implementation plan should cover:
- Bootstrap research dir —
packages/generation/research/2026-05-07-sentence-generation-paradigms/with skeleton files andoutputs/subdir. - Implement paradigm 3 (CSP) first — the user's priority. Brute-force enumeration over
(nsubj, dobj)× verb × adv. Top-K by sum-PMI. - Implement paradigm 1 (dep-tree) — parse a small public-domain corpus, mine skeletons, slot-fill. Surface realize.
- Implement paradigm 2 (graph walk) — construct the graph from runtime parquets, walk, surface realize.
- Write
compare.py— assemble the side-by-side table including PHON-102 baseline. - Write
notebook.md— observations + memo per §4. - File follow-up tickets for whichever paradigm survives the comparison.
Wallclock budget: 3–4 hours for all of the above.