PHON-113 — Paragraph CSP design¶
Goal¶
Rewrite paragraph_csp.solve_paragraph on top of pair_driven.solve() so contrastive constraints (Minpair, Maxopp) fire correctly in paragraph mode, and add MultoppConstraint as a paragraph-native constraint via an (N+1)-way selectional self-join. Drop the caller-supplied verb chain. Preserve discourse-coherence machinery (shared subject, pronoun coref, agreement, discourse markers, subject variety).
Motivation¶
PHON-112 retired solve_shape's contrastive detection block. paragraph_csp._solve_sentence still calls solve_shape(constraints=...) directly, so MinpairConstraint and MaxoppConstraint in paragraph requests are now silently ignored — the solver returns unconstrained candidates. This is a real defect in the post-PHON-112 state.
MultoppConstraint was deferred from PHON-112 with the message "MultoppConstraint requires multi-sentence paragraph composition; deferred to PHON-113". The deferral was correct: multopp's semantic ("substitute vs. N targets at the same phoneme position") only makes sense across sentences sharing a controlled verb-frame.
The third motivator: paragraph composition was sketched in the CSP iteration spike but never fully designed for the join-driven model. PHON-113 lands the architecture.
Architecture¶
constraints
│
├─ filter words.parquet → constraint_filtered_lexicon
│
├─ contrastive constraint dispatch
│ ├─ Minpair / Maxopp: pair filler set (size 2)
│ │ → resolve_contrastive_join → (verb, role_a, w1, role_b, w2)
│ │ → 1 sentence carries the contrast; remaining sentences run independently
│ │
│ └─ Multopp: filler set (size N+1)
│ → resolve_multopp_join (N+1-way self-join on verb+role)
│ → (verb, role, sub, t1, ..., tN)
│ → N+1 sentences share verb+role+discourse_subject; each uses one filler
│
└─ no contrastive constraint
→ independent per-sentence pair_driven.solve()
→ paragraph candidates = bounded cartesian over per-sentence top-K
→ discourse subject pick (top-N from sentence-1 unconstrained probe)
→ per-sentence realization with pronoun coref + agreement + subject variety
→ return top-K paragraph candidates ranked by sum of per-sentence scores
→ (PHON-107 reranker eventually scores paragraph-level coherence)
Paragraph coherence is not the CSP's job to optimize. The CSP produces a candidate space wide enough that a coherent paragraph exists somewhere in it; the reranker (PHON-107) selects the coherent one. The CSP keeps only the cheap-and-true coherence devices: shared discourse subject, pronoun coref, subject-verb agreement, discourse markers.
Multopp as N+1-way join¶
MultoppConstraint(substitute, targets, n_targets, position, slots) produces a filler set:
filler_set = (substitute, *targets[:n_targets])
The join shape for multopp is a self-join across N+1 selectional rows — all sharing the same (verb, role, band), each contributing one of the N+1 fillers:
def resolve_multopp_join(*, filler_set, sel_df, verb_candidates, band, slots=None):
sel_window = sel_df.filter(
(pl.col("band") == band)
& pl.col("filler").is_in(filler_set)
& pl.col("verb").is_in(verb_candidates)
)
# Group by (verb, role); a join row exists iff all N+1 fillers
# have a sel row in that group
grouped = (
sel_window
.group_by(["verb", "role"])
.agg([
pl.col("filler").alias("fillers_present"),
pl.col("ppmi").alias("ppmis_per_filler"),
])
.filter(
pl.col("fillers_present").list.set_intersection(filler_set).list.len()
== len(filler_set)
)
)
if slots is not None:
grouped = grouped.filter(pl.col("role").is_in(list(slots)))
return grouped
Each surviving row of grouped is a complete multopp paragraph spec: one verb, one role, all N+1 fillers each with their PMI. The paragraph realizes as N+1 sentences sharing the (verb, role, discourse_subject) trio, each sentence substituting one filler in the locked role.
If no (verb, role) group has all N+1 fillers, the result is empty — SOL, same as any over-constrained case.
API¶
def paragraph_solve(
*,
spec_words: frozenset[str],
store: WordStore, # carries word_df, pairs_df, sel_df
skeletons_df: pl.DataFrame,
band: str,
constraints: list[Constraint] = (),
n_sentences: int = 3, # ignored when multopp present (set to N+1)
top_k_paragraphs: int = 5,
per_sentence_top_k: int = 4, # bounds the per-sentence candidate pool
discourse_subject: str | None = None, # auto-pick if None
use_pronoun_coref: bool = True,
locked_slots: dict[str, str] = {},
) -> list[dict]:
"""Constraint-driven paragraph resolver. Each candidate is a paragraph
dict with sentences, fillers, scores, and rendered text."""
The shape change vs paragraph_csp.solve_paragraph:
- verbs: tuple[str, ...] is dropped — verbs fall out of pair_driven.solve() per sentence (or of the multopp join for multopp paragraphs).
- n_sentences: int replaces len(spec.verbs) for non-multopp paragraphs. Default 3, user can pass higher.
- n_sentences is ignored when a MultoppConstraint is present — paragraph length is n_targets + 1 (substitute + N targets).
- top_k_paragraphs and per_sentence_top_k separate; paragraph search space is bounded.
Pipeline (request walk-through)¶
A request paragraph_solve(spec_words=spec1, band="fineweb_adult", constraints=[MultoppConstraint(substitute="t", targets=("s","ʃ","tʃ"), n_targets=3)], n_sentences=4, ...):
1. Constraint dispatch + lexicon filter¶
Same as PHON-112: resolve_per_slot_allow_sets + compute_verb_candidates (constrained by Exclude/Bound; verb from full lexicon ∩ constraints).
Multopp constraint detected → use the multopp branch.
2. Multopp branch¶
resolve_multopp_join(filler_set=("t", "s", "ʃ", "tʃ"), sel_df, verb_candidates, band, slots=cc.slots) — wait, multopp's filler_set is phonemes, not words. Need to expand: which words in the lexicon contain the substitute? Which contain each target? Words whose phonemes include the substitute or a target at the constraint's position.
So multopp pre-resolves to a content-word filler set per filler-phoneme:
substitute_words = {w for w in spec_words if has_phoneme(w, sub, position)}
target_words[i] = {w for w in spec_words if has_phoneme(w, t_i, position)}
The join finds (verb, role) where there exists at least one word from substitute_words, one word from target_words[0], ..., one word from target_words[N-1], all with ppmi > 0 for (verb, role, band). This is more involved than minpair's pair_frame because the filler set is N+1 disjoint sets, not one pre-paired list.
Implementation: filter sel_df by filler IN union(substitute_words, *target_words), group by (verb, role), and for each group check every filler-phoneme bucket has ≥1 representative in the group.
Output: one row per (verb, role) group that satisfies coverage, with the K-best representative per filler-phoneme bucket selected by ppmi.
3. Discourse subject pick¶
For multopp, the subject is shared across N+1 sentences. Use the existing _pick_discourse_subjects helper (which probes sentence-1 with the verb locked and returns top-N nsubj candidates) — adapt to call pair_driven.solve(locked_slots={"V": verb, "dobj": filler}, ...) for one filler from the multopp set, take top-N nsubj results, those become the discourse subject candidates.
4. Per-sentence realization¶
Each of the N+1 sentences:
- Verb: locked from the multopp join row
- Role: locked from the multopp join row (e.g., "dobj")
- Filler: one of the substitute or N targets
- Discourse subject: shared across all N+1 sentences (with optional pronoun coref for sentences 2..N+1)
- Other slots: filled per-sentence by a single-sentence pair_driven.solve() invocation with locked_slots = {V: verb, role: filler, "nsubj": subject} (or similar)
5. Score + return¶
Paragraph score = sum of per-sentence scores. Return top_k_paragraphs paragraphs ranked by score, diversified by discourse subject.
Non-multopp paragraph branch¶
For requests without a multopp constraint:
-
Pick discourse subject(s): same as multopp — probe with
pair_driven.solve()once unlocked, take top-N nsubj candidates. -
Per-sentence solve: for each sentence, call
pair_driven.solve(locked_slots={"nsubj": subject}, constraints=...). Each call independently picks its own verb and remaining fillers. -
Minpair / Maxopp constraint, if present, fires in ONE sentence (the one carrying the contrast); other sentences run unconstrained-by-contrast. The decision of which sentence carries the contrast is a heuristic — first sentence by default, since the contrast word should be early in the paragraph for SLP attention.
-
Pronoun coref: sentences 2..N substitute pronouns for the discourse subject if
use_pronoun_coref=True. -
Subject-verb agreement: handled by the existing
realize()machinery (already plural-aware via_is_plural). -
Compose: bounded cartesian over per-sentence top-K. With
per_sentence_top_k=4andn_sentences=3, that's 4³ = 64 candidate paragraphs; cap attop_k_paragraphs.
Constraint dispatch table for paragraphs¶
| Constraint | Effect | Notes |
|---|---|---|
ExcludeConstraint |
Pre-filter lexicon (incl. verb) | Same as PHON-112 |
IncludeConstraint |
Per-word axis | Per-sentence scoring |
BoundConstraint |
Pre-filter lexicon | Same as PHON-112 |
BoundBoostConstraint |
Per-word axis | Per-sentence scoring |
MinpairConstraint |
1 sentence carries contrast | First sentence by default |
MaxoppConstraint |
1 sentence carries contrast | Same as Minpair |
MultoppConstraint |
N+1-way join → N+1 sentences | Locks verb+role; paragraph length = N+1 |
Multopp + Minpair/Maxopp simultaneously: rejected (raise ValueError). Two contrastive constraints over-constrain.
Discourse coherence machinery (preserved from v1)¶
- Shared discourse subject (
_pick_discourse_subjects): top-N nsubj candidates from probe; one paragraph per subject for diversification. - Pronoun coref: sentences 2..N substitute "it"/"they" for the discourse subject (existing
_pronoun_forhelper). - Subject-verb agreement:
realize()already handles plural-aware conjugation. - Discourse markers: keep the existing list of paragraph-initial discourse markers ("Then,", "After that,", "Finally,") — apply to sentences 2..N at random per paragraph for variety.
- Subject variety:
_diversify_by_subjectreturns top-K paragraphs with distinct discourse subjects.
These are cheap, true, and uncontroversial. The reranker handles the harder coherence questions (semantic flow, topical drift).
Scope¶
In scope:
- New pair_driven.resolve_multopp_join (or in pair_driven.py)
- New paragraph_solve (or replace paragraph_csp.solve_paragraph body)
- Multopp constraint dispatch + paragraph realization
- Minpair/Maxopp paragraph integration (carried by 1 sentence)
- Preserve coherence machinery (subject, coref, agreement, markers, variety)
- Tests for paragraph behavior
Out of scope: - Reranker (PHON-107) - Productionization (PHON-109) - Frontend (PHON-110) - Verb-chain semantic coherence (deferred — reranker's job)
Migration plan¶
-
Survives unchanged:
pair_driven.solve(),pair_driven.resolve_contrastive_join,pair_driven.resolve_per_slot_allow_sets,pair_driven.select_host_skeletons,verb_candidates.compute_verb_candidates,skeleton_csp.realize/SkeletonShape/parse_arg_structure,_load_pairs_for_request. -
Gets retired:
paragraph_csp.solve_paragraphbody (replaced);ParagraphSpec.verbsfield;_solve_sentence(replaced bypair_driven.solve(locked_slots=...)). -
Gets rewritten:
paragraph_cspbecomes a thin wrapper overpair_driven.solve()with discourse-coherence orchestration. -
Tests:
paragraph_csptests rewritten for the new architecture.test_pair_driven_solve.pyextended with multopp tests (similar shape to minpair tests). -
Branch: continues
feature/csp-iterationafter PHON-112; no PR until PHON-109 productionization.
Risks¶
- Multopp join cost: N+1 way self-join on selectional could be expensive. Mitigation: pre-filter sel by
filler IN union(substitute_words, *target_words)first; the union is typically <1K words. Group-by + cardinality check is then on a small frame. Likely <100ms. - Cartesian explosion in non-multopp paragraphs: K^N candidate paragraphs for K per-sentence × N sentences. K=4 N=3 = 64 fine; K=8 N=5 = 32K rough. Default
per_sentence_top_k=4andn_sentences=3keeps it bounded; expose both as kwargs so callers can opt in to bigger search. - Pronoun coref on minpair: if minpair carries a pair word in nsubj, the paragraph's discourse subject might overlap with the pair → pronoun substitution becomes ambiguous. Disable coref for sentences whose subject = pair word; handle as edge case.
- Discourse subject = pair word: if
_pick_discourse_subjectsreturns a word that's also in the pair, coref handling needs to know. Keep the existing logic but flag this state.
Open questions¶
- What's the default
n_sentences? Spike used 3-tuple verb chains (chase/sit/eat). Default 3 is reasonable. User can pass higher. - Where does Minpair/Maxopp's contrast land? First sentence by default; future could expose
contrast_position: int = 0to the constraint orparagraph_solve. - Should we expose verb chains as an opt-in? A future user might want explicit verb chains for clinical scripting. Keep API minimal in v1; add
verb_chain: list[str] | None = Nonelater if demand is real.
Self-review¶
- [x] All decisions concrete: API signature, multopp join shape, constraint dispatch.
- [x] No "TBD" / placeholder language.
- [x] Internal consistency:
WordStorecarries pairs_df + sel_df throughout; constraint dispatch table covers all 7 constraint types; minpair/maxopp paragraph behavior consistent with PHON-112's single-sentence behavior. - [x] Scope: paragraphs only; reranker stays out (PHON-107); productionization stays out (PHON-109).
- [x] Ambiguity:
n_sentencesis overridden by multopp'sn_targets + 1; documented.