PHON-106 — CSP contrastive scorers (minpair / maxopp / multopp)¶
Date: 2026-05-09
Branch: feature/csp-iteration
Scope: spike-internal — single-sentence linked-slot enumeration for MinpairConstraint and MaxoppConstraint. MultoppConstraint defined but deferred to PHON-108 paragraph integration.
Frame¶
The CSP spike currently has a partial contrastive scorer in constraint_surface.cross_slot_axes: minpair is a post-hoc cross-slot scorer that checks whether the candidate's already-selected slot fillers happen to form a minimal pair, and maxopp is a no-op TODO. This is structurally wrong for CSP — constraints should shape the per-slot data BEFORE enumeration, not score after.
The web app's routes/contrastive.ts has the right shape for all three intervention modes (Storkel 2022; Gierut 1989–1992):
- Minimal Pairs: precomputed minimal_pairs table in D1, queried by (p1, p2, position) — returns valid (word1, word2) rows.
- Maximal Opposition: 100 + countFeatureDiffs ranking with hasMajorClassDiff filter to pick PAIRS from a set of unknowns; word lists then delegate to minimal_pairs.
- Multiple Opposition: target-set selection by greedy distance from substitute, then word-set assembly across positions.
The data underlying these queries is built by packages/data/src/phonolex_data/pipeline/derived.py:_compute_minimal_pairs (already exists, fed only to D1 today). PHON-106 emits this data as a runtime parquet so the Python CSP can read it, defines proper constraint types, and shifts contrastive enforcement from post-hoc scoring to linked-slot CSP enumeration — where a contrast group of slots is filled jointly by selecting a row from the pair list. Realization is structural.
Goal¶
For single-sentence shapes with ≥ 2 nominal content slots:
- MinpairConstraint and MaxoppConstraint produce sentences in which the (nsubj, dobj) slots are guaranteed to form a valid minimal pair contrasting the requested phonemes at the requested position.
- The vectorized enumeration path handles linked-slot mode natively (no python fallback for these two).
- The contrast realization is structural (not a soft scoring axis); maxopp adds a continuous feature_distance ranking signal.
MultoppConstraint is defined and validated in v1 but errors out in single-sentence shapes; paragraph integration is PHON-108's scope.
Non-goals¶
- Multopp paragraph integration (deferred to PHON-108 / paragraph polish).
- Combining multiple contrastive constraints in one request (
ValueErrorfor v1). - ContrastiveConstraint propagation into recursive
_resolve_embedded_clauseccomp solves (outer constraint doesn't apply to embedded clauses). - Web app harmonization to use the same continuous learned vectors (PHON-111 — separate ticket).
- Productionization to
packages/generators/csp/(PHON-109 scope).
Constraint surface¶
Replace the current single ContrastiveConstraint with a union of three frozen dataclasses:
@dataclass(frozen=True)
class MinpairConstraint:
phoneme1: str
phoneme2: str
position: Literal["initial", "medial", "final", "any"] = "any"
type: Literal["contrastive_minpair"] = "contrastive_minpair"
@dataclass(frozen=True)
class MaxoppConstraint:
"""The user has selected (p1, p2) via the web's /maximal-opposition/pairs
endpoint or equivalent. The CSP scorer rewards pair rows by continuous
feature_distance (from the learned 27-d posterior vectors), and pre-
filters pair rows to those satisfying min_sonorant_diff (default 0.5,
matches the web app's hasMajorClassDiff binary threshold)."""
phoneme1: str
phoneme2: str
position: Literal["initial", "medial", "final", "any"] = "any"
min_sonorant_diff: float = 0.5
type: Literal["contrastive_maxopp"] = "contrastive_maxopp"
@dataclass(frozen=True)
class MultoppConstraint:
"""Multiple-opposition therapy: one substitute phoneme + a set of target
phonemes the child collapses to it. Defined in v1; integration deferred
to PHON-108 paragraph composition (multopp set has 3-5 words; needs
multiple sentences to distribute coherently)."""
substitute: str
targets: tuple[str, ...]
position: Literal["initial", "medial", "final", "any"] = "any"
n_targets: int = 3 # default ≥ 3 contrasts per output (frontend-overridable)
type: Literal["contrastive_multopp"] = "contrastive_multopp"
Constraint = (
ExcludeConstraint
| IncludeConstraint
| BoundConstraint
| BoundBoostConstraint
| MinpairConstraint
| MaxoppConstraint
| MultoppConstraint
)
The existing ContrastiveConstraint import is removed throughout the spike. Test fixtures and downstream callers reference the new types directly.
pairs.parquet schema & pipeline¶
New runtime artifact data/runtime/pairs.parquet (LFS-tracked, sibling to words/edges/selectional). Built from the same _compute_minimal_pairs data the D1 seed already consumes — emitted as a parquet so the Python generator can read it without a D1 round-trip.
pl.Schema({
"word1": pl.String,
"word2": pl.String,
"phoneme1": pl.String, # phoneme in word1 at the contrasting position
"phoneme2": pl.String, # same in word2
"position": pl.UInt8, # 0-indexed phoneme position
"position_type": pl.String, # "initial" | "medial" | "final"
"feature_distance": pl.Float32, # cosine or L2 distance between learned 27-d posterior vectors
"sonorant_diff": pl.Float32, # |vec1[sonorant_idx] - vec2[sonorant_idx]| ∈ [0, 1]
})
feature_distance and sonorant_diff are derived from packages/features/outputs/vectors.csv (the Bayesian-learned posterior means) — finer resolution than the discretized +/-/0 features the web app uses.
Pipeline changes¶
| File | Change |
|---|---|
packages/data/src/phonolex_data/pipeline/derived.py:_compute_minimal_pairs |
Extend the returned tuples to include feature_distance and sonorant_diff per row, computed via a per-(phoneme1, phoneme2) lookup table built from vectors.csv. ~30 line addition. Existing _compute_phoneme_data already loads the vectors; reuse. |
packages/data/src/phonolex_data/pipeline/schema.py:Derived |
Existing minimal_pairs field tuple shape extends from 6 elements to 8 elements. |
packages/data/src/phonolex_data/runtime/schema.py |
Add pairs_schema() returning the Polars schema above. |
packages/data/src/phonolex_data/runtime/emit_parquet.py |
Emit data/runtime/pairs.parquet from Derived.minimal_pairs. |
packages/data/src/phonolex_data/runtime/store.py |
Extend WordStore.from_parquet() to also load pairs.parquet; expose as store.pairs_df. |
packages/web/workers/scripts/export-to-d1.py |
Read pairs from the new parquet (via WordStore) rather than from Derived.minimal_pairs directly — single source of truth, no schema drift. The two new columns (feature_distance, sonorant_diff) are NOT seeded into D1 — only the legacy 6 columns. PHON-111 will harmonize the web side. |
.gitattributes |
LFS-track data/runtime/pairs.parquet. |
Estimated pair count: ~125K words × avg 2–4 minimal-pair partners = ~300–500K rows. ~30 MB parquet.
Linked-slot enumeration mechanics¶
When a MinpairConstraint or MaxoppConstraint is in the request, solve_shape switches to linked-slot mode for the contrast group:
Standard mode:
cart = nsubj_frame ⨯ V_frame ⨯ dobj_frame [⨯ advmod_frame] ← independent slots
Linked-slot mode:
pair_rows = pairs_df.filter(p1, p2, position_type, sonorant_filter)
pair_rows = pair_rows.filter(word1 ∈ filtered_spec ∧ word2 ∈ filtered_spec)
# Emit both orientations
pair_frame = (
pair_rows.select(word1.alias("nsubj"), word2.alias("dobj"), feature_distance, sonorant_diff)
.vstack(
pair_rows.select(word2.alias("nsubj"), word1.alias("dobj"), feature_distance, sonorant_diff)
)
)
cart = pair_frame ⨯ V_frame [⨯ advmod_frame]
Cardinality: ~125K independent (nsubj × dobj) Cartesian shrinks to the pair-row count for the requested (p1, p2, position), typically 100–2000 — speedup is real before reranking.
Routing: the existing _should_use_vectorized check currently sends ContrastiveConstraint to the python fallback. PHON-106 narrows that: only MultoppConstraint (deferred) forces fallback. Minpair and Maxopp run vectorized via linked-slot mode.
Contrast group selection rules:
- Shape has slots nsubj AND dobj → group = (nsubj, dobj).
- Shape has ≥ 2 content slots but no nsubj/dobj pair (e.g., nsubj-V-pobj_to) → group = first two shape.content_slots in order.
- Shape has < 2 content slots → constraint raises ValueError("contrastive needs ≥ 2 content nominal slots").
Scoring within linked-slot mode¶
| Component | Standard mode | Linked-slot mode |
|---|---|---|
pmi_nsubj, pmi_dobj |
PMI from selectional table | Same — looked up from the pair row's word1/word2 against the verb's PMI table |
pmi_advmod, freq_xcomp, etc. |
Standard | Standard — non-linked slots score independently |
Per-word axes (include_*, bound_boost_*) |
Standard | Standard — apply to the pair row's nsubj/dobj fillers like any other word |
contrast_maxopp_<p1>_<p2> |
Not present | NEW: equals feature_distance from the pair row. Default weight 1.0; weights={"contrast_maxopp_k_g": 0.0} reduces to binary minpair-style realization, higher weight rewards harder contrasts more. |
contrast_minpair_<p1>_<p2> |
Not present | Not added — every linked-slot candidate has the realization, so it carries no ranking signal. |
pair_rows is pre-filtered by the spec lexicon ∩ hard constraints (both halves) and (for maxopp) sonorant_diff ≥ min_sonorant_diff. Both orientations of each pair enter the cartesian; the higher-total-score orientation wins via the existing dedup-by-content-pair logic.
If pair_rows.height == 0 after filtering, solve_shape returns [].
Edge cases¶
| Case | Behavior |
|---|---|
MinpairConstraint + nsubj-V-dobj shape |
Linked-slot mode on (nsubj, dobj). Standard. |
MinpairConstraint + nsubj-V-xcomp shape |
Single nominal content slot. ValueError("contrastive needs ≥ 2 content nominal slots"). |
MinpairConstraint + nsubj-V (intransitive) |
Same — single content slot, error. |
MaxoppConstraint + (p1, p2) where every pair row has sonorant_diff < 0.5 |
Empty after filter. solve_shape returns []. Matches web's "no valid pair" behavior. |
Both MinpairConstraint and MaxoppConstraint in one request |
ValueError("at most one contrastive constraint per request") for v1. |
MultoppConstraint in single-sentence shape |
ValueError("multopp constraint requires multi-sentence paragraph composition; not implemented in v1"). |
ccomp recursion + outer MinpairConstraint |
Outer call uses linked-slot mode for matrix (nsubj, dobj). Inner _resolve_embedded_clause solve_shape doesn't inherit the constraint; runs standard mode. |
paragraph_csp with MinpairConstraint |
Each sentence applies the constraint independently. Sentence-shapes with ≥ 2 nominal content slots get linked-slot mode; others raise. PHON-108 paragraph polish refines cross-sentence coordination. |
paragraph_csp with MultoppConstraint |
v1: error (deferred). v2 (PHON-108): distribute multopp set across sentences. |
Testing¶
New file test_contrastive_scorers.py next to existing spike tests.
Pipeline-level:
- test_pairs_parquet_emit_includes_feature_distance — pipeline emits pairs.parquet with all 8 schema columns.
- test_pairs_parquet_count_matches_d1_minimal_pairs — pair row count matches D1's minimal_pairs row count modulo schema additions (legacy 6 cols equivalent).
Constraint surface:
- test_minpair_constraint_hashable — frozen dataclass, hashable.
- test_maxopp_constraint_default_min_sonorant_diff — default = 0.5.
Linked-slot enumeration:
- test_minpair_linked_slot_cartesian_size — Cartesian shrinks from O(domain²) to O(pair_rows).
- test_minpair_linked_slot_realization — every output candidate's (nsubj, dobj) IS a valid (p1, p2) minimal pair (verify against pairs_df).
- test_maxopp_filters_by_sonorant_diff — pair rows with sonorant_diff < 0.5 are excluded.
- test_maxopp_feature_distance_in_components — contrast_maxopp_<p1>_<p2> equals the pair row's feature_distance.
- test_orientation_swap_higher_total_wins — both orientations enter the cartesian; the dedup-by-content-pair step picks the higher-total-score orientation.
Errors:
- test_minpair_single_content_slot_errors
- test_both_minpair_and_maxopp_errors
- test_multopp_in_single_sentence_errors
Routing:
- test_minpair_uses_vectorized_path — _should_use_vectorized returns True for MinpairConstraint (was forced to fallback under the old ContrastiveConstraint).
Equivalence:
The vectorized path is the only implementation for linked-slot mode in v1 — no python fallback, so no cross-path equivalence test for contrastive cases. This is deliberate v1 simplification documented in the spec; PHON-108 may add a python-path equivalent if paragraph integration warrants it.
Open questions¶
None.
References¶
- Web app contrastive:
packages/web/workers/src/routes/contrastive.ts— algorithms ported. - Web standalone tool:
packages/web/frontend/src/components/tools/ContrastiveInterventionTool.tsx— three-mode UX confirms the constraint type split. - Pipeline:
packages/data/src/phonolex_data/pipeline/derived.py:_compute_minimal_pairs(extend),_compute_phoneme_data(reuse vector lookup). - Continuous learned vectors:
packages/features/outputs/vectors.csv(40 phonemes × 27 features). - Linguistic basis: Storkel 2022 ("Minimal, Maximal, or Multiple…"); Gierut 1989–1992 (maximal opposition).
- Forward references: PHON-108 (paragraph integration for multopp), PHON-111 (web app continuous-vector parity).