Skip to content

Sentence Generation Paradigms — Research Spike

Date: 2026-05-07 Status: Spec — handoff for execution Branch: feature/phon-102-c1-eval-spec (continues from PHON-102 results) Ticket: to be filed (next free PHON-XXX) Predecessor: PHON-102 — C1 track evaluation surfaced that PMI density (not the MLM) is doing the coherence work in the existing C1 stack. The MLM is expensive scaffolding around a PMI-coherent core. This spike asks: what does sentence generation look like without a language model, given the data layer we have?


§0 — Frame

PhonoLex's data layer (after PHON-72…PHON-101) is unusually rich for an SLP/phonological tool: - 125K phonology-tagged words × ~160 norm columns - 5.4M (verb, role, filler, band) PPMI selectional rows - 1.6M Qwensim association edges - POS/lemma/morphology features

PHON-102 measured that the existing C1 stack (CFG seed enumerator + RoBERTa-large MLM iterative editor) achieves 100% lemma compliance on a 10-verb × 2-spec × 8-seed matrix, with PMI density as the dominant coherence signal. The MLM contributes fluency-beyond-PMI — but at 1.4GB model size, ~0.85s/edit, MPS-or-GPU dependence, stochastic outputs, and known degenerate failure modes (PHON-100).

Strategic question driving this spike: is the MLM earning its keep, or is the data layer sufficient for naturalistic well-formed sentence generation if we use it differently?

The user's product target (clarified 2026-05-07): given any constraint set an SLP or teacher might select, produce coherent text for a patient/student/child to read in a clinical or practice context. Adult and child contexts both. Generalize to paragraphs is a stretch goal.

This spike is sentence-level only. Paragraphs deferred.

The user has explicitly flagged CSP framing as the architectural pattern they've always wanted to consider for this problem — generation as constraint satisfaction with rerank-and-present-as-needed. This spike treats CSP as a peer of the other two paradigms, not a preordained winner; the side-by-side comparison decides whether CSP delivers on that intuition or whether one of the other paradigms is a better fit. Per §3.3 of feedback_no_substitute_in_frame, the recommendation memo (§4) ranks assumptions and points at candidates without naming a winner before the data lands.


§1 — Three paradigms under test

CFG (the existing C1 architecture) is the implicit baseline; PHON-102 already produced its outputs. We compare against three non-CFG paradigms:

Paradigm 1 — Dependency-tree templating from real parses

Premise: the grammar of English sentences is attested in parsed corpora; we don't need to hand-write CFG rules. Mining anonymized dependency tree skeletons from real text gives us a richer structural template bank than what CFG can express.

Method: 1. Parse a fresh small sample (50–200 sentences) of redistributable SLP-style English text with spaCy en_core_web_sm. Sources: simple Aesop fables (public domain), hand-curated SLP example sentences from public sources, and the existing PHON-95 acceptance probes themselves. No CHILDES/PhonBank/FineWeb-Edu content. All inputs are public-domain or trivial seed text. 2. Extract per-sentence dependency tree skeletons: tuples of (head_pos, dep_label, child_pos, head_idx, child_idx). Anonymize: keep only POS + DEP, drop lexical content. The skeleton is a structural template. 3. Dedupe skeletons by canonical hash. Expect ~20–80 distinct templates from a 200-sentence sample. 4. For each probe (verb, spec, band), select skeletons that contain a VERB slot fillable by the locked verb. Slot-fill content nodes (NOUN, ADJ, ADV) with PMI-admit ∩ spec_lexicon, ranked by max-PMI(verb, role, filler). The selectional.parquet table covers nine roles: nsubj, dobj, iobj, xcomp, ccomp, pobj_with, pobj_on, pobj_in, pobj_to. Skeleton slots whose dep_label matches one of these get PMI-ranked fill; slots labeled amod/advmod/etc. (no PMI data) fall back to spec_lexicon ∩ POS-filter, ranked by lemma frequency. 5. Surface realization: traverse the skeleton in dependency order, insert determiners from a small map, run lemminflect for agreement on the locked verb.

Why this paradigm: the CFG critique is exactly that hand-written rules don't generalize. Real parses give us a learned grammar as a side-effect of having parsed corpora. The skeletons are not copyrightable structural abstractions of attested English.

Paradigm 2 — Lexical graph walks

Premise: if generation is "find a coherent path through the lexicon," the data already encodes the graph. Walk it.

Method: 1. Construct a directed graph from data/runtime/{words,edges,selectional}.parquet: - Nodes: words (filtered to spec_lexicon). - Edges: - selectional: verb → filler, weight = PPMI(verb, role, filler, band), role-typed. - qwensim: word ↔ word, weight = similarity score, undirected. 2. Walk strategy for a single sentence: start from the locked verb. Pick the highest-weight nsubj outgoing edge into the spec lexicon. From that nsubj node, pick the dobj from the top-N (e.g., N=10) highest-PMI dobj candidates whose Qwensim similarity to the chosen nsubj is closest to the median of that top-N pool — heuristic: prefer "thematically related but not lexically near-synonymous" pairings (a kid → ball style pairing, not a kid → kids style pairing). Optionally pick a manner advmod from the verb's adverb set. 3. Walks return tuples (verb, nsubj, dobj[, adv]). Surface realization: same lightweight wrapping as paradigm 1.

Why this paradigm: decouples slot fillers from independent lookups. Cross-slot conditioning (nsubj ↔ dobj via Qwensim) is the novel lever. Less template-y output than CFG or paradigm 1.

Paradigm 3 — Constraint Satisfaction (CSP)

Premise (the user's longstanding intuition): treating generation as CSP separates finding all valid sentences from ranking and presenting them. The product gets to choose how to surface options independently of how they're generated.

Method: 1. Variables: nsubj, verb, dobj, optionally manner_adv. 2. Domains: - verb: locked to user-chosen lemma. - nsubj: spec_lexicon ∩ pmi_admit(verb, "nsubj", band). - dobj: spec_lexicon ∩ pmi_admit(verb, "dobj", band). - manner_adv: a small curated MANNER_ADVERBS list (the existing PHON-97 list). 3. Hard constraints: - nsubj != dobj - PMI(verb, "nsubj", nsubj) > 0 - PMI(verb, "dobj", dobj) > 0 - Subject-verb agreement: verb_form consistent with nsubj_number (lemminflect-derivable; CSP enforces by domain restriction or post-solve filter). 4. Soft constraints (objective for ranking): maximize sum-PPMI(verb, role, filler) across all (verb, role, filler) triples in the assignment. 5. Solver: brute-force enumeration over Cartesian product of domains (typical sizes 50–500 each → ≤250K candidates → ms-scale on CPU). Optional library fallback to python-constraint if domains explode. 6. Output: the top-K assignments by sum-PPMI objective, with their full PMI scores attached. The product can rerank/sample/filter the K candidates as needed: by Δ-PMI from optimum, by lexical novelty, by clinician-curated criteria, by phonological complexity, anything. CSP returns the set of valid sentences, not a single best.

Why this paradigm: it's the cleanest decoupling. Generation produces a candidate set; presentation chooses how to use it. Reproducible by construction. Auditable. Composable with future constraints (add a phonotactic constraint, add a register constraint, etc., without rewriting the generator).


§2 — Probes and comparison

Probes: the 5 PHON-95 acceptance seeds, for direct comparability with PHON-102 results:

# verb spec seed sentence (CFG-MLM baseline)
1 melt spec6 "the puppy melt the baby"
2 chase spec1 "the cat chased the ball"
3 fill spec1 "the snow filled the cup"
4 cut spec1 "the coleslaw cut the control"
5 eat (was: ate) spec1 "the dog ate the bone"

For each probe, each paradigm generates 5–8 candidate sentences under the same constraint set (spec_lexicon + verb_lock + band=fineweb_adult).

Side-by-side output: a markdown table per probe with columns: - C1+MLM (top-3 from PHON-102's results JSONL) - Paradigm 1 (dep-tree) - Paradigm 2 (graph walk) - Paradigm 3 (CSP top-3)

Plus aggregate metrics: - Wallclock per generated sentence - Distinct outputs - Compliance (lemma-aware, per PHON-102 metric) - Eyeball notes (you and me)

Decision criteria (eyeball): - Naturalness: does it read like English a person might say? - Readability: would a patient/student be able to read it aloud? - Diversity per probe - Compliance (already structurally guaranteed for paradigms 1–3 by construction) - Architectural appeal: which paradigm feels most extensible to paragraphs and to richer constraint vocabularies?


§3 — Methodology details

§3.1 — Source-text policy for paradigm 1

Use: public-domain text only — Aesop's fables (Project Gutenberg), trivially-licensed SLP example sentences, the PHON-95 seeds themselves. Hand-typed if needed; any 100–200 attested English sentences will do for a 50-template skeleton bank.

Do not use: CHILDES, PhonBank, FineWeb-Edu, or any other licensed corpus. Per feedback_consistent_license_standard and feedback_distinguish_direct_vs_trained — even though tree skeletons are anonymized structural abstractions, we keep this spike clean of any potentially-restricted source.

§3.2 — PMI lookup interface

Reuse phonolex_generators.editor.pmi_bias.make_pmi_bias_fn and phonolex_generators.cfg_seed.argstruc_enumerator.pmi_admit for parity with the C1 stack. Both operate on selectional.parquet directly via Polars.

§3.3 — Surface realization

All three paradigms share a small surface realizer: - Insert the before NOUN heads (skip mass nouns by simple POS+countability heuristic). - Inflect the verb via lemminflect for subject-verb agreement. - Capitalize first letter; append period.

This is a 30-line utility, not a separate paradigm.

§3.4 — Determinism

All three paradigms must be deterministic. Seed any RNG with a fixed value if used. No torch sampling, no temperature.


§4 — Deliverables

Location: packages/generation/research/2026-05-07-sentence-generation-paradigms/

Files: - notebook.md — lab notebook: framing, per-paradigm observations, side-by-side comparison, recommendation memo - paradigm_1_dep_trees.py — runs paradigm 1 across the 5 probes, writes outputs/p1.json - paradigm_2_graph_walk.py — paradigm 2, writes outputs/p2.json - paradigm_3_csp.py — paradigm 3, writes outputs/p3.json - compare.py — assembles side-by-side comparison table from p1/p2/p3 JSON + the existing PHON-102 results JSONL - outputs/source_corpus.txt — the 100-200 public-domain sentences used by paradigm 1 (so the artifact is self-contained)

Memo (in notebook.md): 1–2 paragraphs answering: - Which paradigm produced the most natural-sounding outputs? - Which paradigm felt most extensible to paragraph generation? - Is the data-only hypothesis vindicated for sentences? (Compliance + naturalness with no MLM.) - What's the next step? (Pick a paradigm and productionize. Or run a wider matrix. Or fold an LM back in for some role.)

Out of scope: - Paragraph-level generation (deferred until we pick a sentence paradigm) - Productionization, including FastAPI route work, Worker proxying, frontend integration - Calibration sweeps within paradigms - LLM-as-judge or external review (see PHON-102 user note: rapid iteration first; external review reserved for "when we think we have something") - Any LM (n-gram, MLM, decoder LM) — the spike's premise is non-LM


§5 — Acceptance criteria

The spike is "done" when:

  1. All three paradigm scripts run end-to-end on the 5 probes.
  2. outputs/{p1,p2,p3}.json exist with 5–8 candidate sentences per probe.
  3. compare.py produces a complete side-by-side markdown table including the C1+MLM baseline.
  4. notebook.md has the memo answering the four questions in §4.
  5. Wallclock per paradigm captured (rough — order-of-magnitude is enough).
  6. The recommendation in the memo is not "more research is needed." It picks a direction or explicitly closes the door on data-only.

§6 — Risks and open questions

  1. Paradigm 1 corpus too small. 100–200 sentences may not yield enough distinct dependency tree skeletons to cover the 5 probes. Mitigation: extend the seed corpus on the fly if needed, or fall back to spaCy-parsed hand-typed exemplars. Not a blocker.
  2. Paradigm 2 surface realization is the weak point. Walks return tuples, not trees. Lightweight wrapping may produce ungrammatical output. Mitigation: keep walks short (verb + 2-3 fillers), use the same surface realizer as paradigms 1 and 3.
  3. CSP combinatorial explosion. If a probe has nsubj domain = 1798 (spec1's full noun list), brute-force over 1798 × 1798 = 3.2M candidates is still ms-scale, but if we add 4+ variables it could blow up. Mitigation: cap variables at 4; use library python-constraint only if needed.
  4. One-day deliverable assumption. Spike is sized for 3–4 hours. If paradigm 1's mining or paradigm 2's graph construction takes longer than expected, paradigm 3 (the user's most-wanted) could get squeezed. Mitigation: implement paradigm 3 first, then 1, then 2. CSP is the must-keep.
  5. Comparison subjectivity. "Naturalness" is eyeballed. Two-person eyeball (you and me) is the spike-iteration agreement; we can fold in LLM-judge or clinician review later if a paradigm advances.

§7 — Plan handoff

The successor implementation plan should cover:

  1. Bootstrap research dirpackages/generation/research/2026-05-07-sentence-generation-paradigms/ with skeleton files and outputs/ subdir.
  2. Implement paradigm 3 (CSP) first — the user's priority. Brute-force enumeration over (nsubj, dobj) × verb × adv. Top-K by sum-PMI.
  3. Implement paradigm 1 (dep-tree) — parse a small public-domain corpus, mine skeletons, slot-fill. Surface realize.
  4. Implement paradigm 2 (graph walk) — construct the graph from runtime parquets, walk, surface realize.
  5. Write compare.py — assemble the side-by-side table including PHON-102 baseline.
  6. Write notebook.md — observations + memo per §4.
  7. File follow-up tickets for whichever paradigm survives the comparison.

Wallclock budget: 3–4 hours for all of the above.