Skip to content

PHON-106 — CSP contrastive scorers (minpair / maxopp / multopp)

Date: 2026-05-09 Branch: feature/csp-iteration Scope: spike-internal — single-sentence linked-slot enumeration for MinpairConstraint and MaxoppConstraint. MultoppConstraint defined but deferred to PHON-108 paragraph integration.

Frame

The CSP spike currently has a partial contrastive scorer in constraint_surface.cross_slot_axes: minpair is a post-hoc cross-slot scorer that checks whether the candidate's already-selected slot fillers happen to form a minimal pair, and maxopp is a no-op TODO. This is structurally wrong for CSP — constraints should shape the per-slot data BEFORE enumeration, not score after.

The web app's routes/contrastive.ts has the right shape for all three intervention modes (Storkel 2022; Gierut 1989–1992): - Minimal Pairs: precomputed minimal_pairs table in D1, queried by (p1, p2, position) — returns valid (word1, word2) rows. - Maximal Opposition: 100 + countFeatureDiffs ranking with hasMajorClassDiff filter to pick PAIRS from a set of unknowns; word lists then delegate to minimal_pairs. - Multiple Opposition: target-set selection by greedy distance from substitute, then word-set assembly across positions.

The data underlying these queries is built by packages/data/src/phonolex_data/pipeline/derived.py:_compute_minimal_pairs (already exists, fed only to D1 today). PHON-106 emits this data as a runtime parquet so the Python CSP can read it, defines proper constraint types, and shifts contrastive enforcement from post-hoc scoring to linked-slot CSP enumeration — where a contrast group of slots is filled jointly by selecting a row from the pair list. Realization is structural.

Goal

For single-sentence shapes with ≥ 2 nominal content slots: - MinpairConstraint and MaxoppConstraint produce sentences in which the (nsubj, dobj) slots are guaranteed to form a valid minimal pair contrasting the requested phonemes at the requested position. - The vectorized enumeration path handles linked-slot mode natively (no python fallback for these two). - The contrast realization is structural (not a soft scoring axis); maxopp adds a continuous feature_distance ranking signal.

MultoppConstraint is defined and validated in v1 but errors out in single-sentence shapes; paragraph integration is PHON-108's scope.

Non-goals

  • Multopp paragraph integration (deferred to PHON-108 / paragraph polish).
  • Combining multiple contrastive constraints in one request (ValueError for v1).
  • ContrastiveConstraint propagation into recursive _resolve_embedded_clause ccomp solves (outer constraint doesn't apply to embedded clauses).
  • Web app harmonization to use the same continuous learned vectors (PHON-111 — separate ticket).
  • Productionization to packages/generators/csp/ (PHON-109 scope).

Constraint surface

Replace the current single ContrastiveConstraint with a union of three frozen dataclasses:

@dataclass(frozen=True)
class MinpairConstraint:
    phoneme1: str
    phoneme2: str
    position: Literal["initial", "medial", "final", "any"] = "any"
    type: Literal["contrastive_minpair"] = "contrastive_minpair"


@dataclass(frozen=True)
class MaxoppConstraint:
    """The user has selected (p1, p2) via the web's /maximal-opposition/pairs
    endpoint or equivalent. The CSP scorer rewards pair rows by continuous
    feature_distance (from the learned 27-d posterior vectors), and pre-
    filters pair rows to those satisfying min_sonorant_diff (default 0.5,
    matches the web app's hasMajorClassDiff binary threshold)."""
    phoneme1: str
    phoneme2: str
    position: Literal["initial", "medial", "final", "any"] = "any"
    min_sonorant_diff: float = 0.5
    type: Literal["contrastive_maxopp"] = "contrastive_maxopp"


@dataclass(frozen=True)
class MultoppConstraint:
    """Multiple-opposition therapy: one substitute phoneme + a set of target
    phonemes the child collapses to it. Defined in v1; integration deferred
    to PHON-108 paragraph composition (multopp set has 3-5 words; needs
    multiple sentences to distribute coherently)."""
    substitute: str
    targets: tuple[str, ...]
    position: Literal["initial", "medial", "final", "any"] = "any"
    n_targets: int = 3   # default ≥ 3 contrasts per output (frontend-overridable)
    type: Literal["contrastive_multopp"] = "contrastive_multopp"


Constraint = (
    ExcludeConstraint
    | IncludeConstraint
    | BoundConstraint
    | BoundBoostConstraint
    | MinpairConstraint
    | MaxoppConstraint
    | MultoppConstraint
)

The existing ContrastiveConstraint import is removed throughout the spike. Test fixtures and downstream callers reference the new types directly.

pairs.parquet schema & pipeline

New runtime artifact data/runtime/pairs.parquet (LFS-tracked, sibling to words/edges/selectional). Built from the same _compute_minimal_pairs data the D1 seed already consumes — emitted as a parquet so the Python generator can read it without a D1 round-trip.

pl.Schema({
    "word1":            pl.String,
    "word2":            pl.String,
    "phoneme1":         pl.String,    # phoneme in word1 at the contrasting position
    "phoneme2":         pl.String,    # same in word2
    "position":         pl.UInt8,     # 0-indexed phoneme position
    "position_type":    pl.String,    # "initial" | "medial" | "final"
    "feature_distance": pl.Float32,   # cosine or L2 distance between learned 27-d posterior vectors
    "sonorant_diff":    pl.Float32,   # |vec1[sonorant_idx] - vec2[sonorant_idx]| ∈ [0, 1]
})

feature_distance and sonorant_diff are derived from packages/features/outputs/vectors.csv (the Bayesian-learned posterior means) — finer resolution than the discretized +/-/0 features the web app uses.

Pipeline changes

File Change
packages/data/src/phonolex_data/pipeline/derived.py:_compute_minimal_pairs Extend the returned tuples to include feature_distance and sonorant_diff per row, computed via a per-(phoneme1, phoneme2) lookup table built from vectors.csv. ~30 line addition. Existing _compute_phoneme_data already loads the vectors; reuse.
packages/data/src/phonolex_data/pipeline/schema.py:Derived Existing minimal_pairs field tuple shape extends from 6 elements to 8 elements.
packages/data/src/phonolex_data/runtime/schema.py Add pairs_schema() returning the Polars schema above.
packages/data/src/phonolex_data/runtime/emit_parquet.py Emit data/runtime/pairs.parquet from Derived.minimal_pairs.
packages/data/src/phonolex_data/runtime/store.py Extend WordStore.from_parquet() to also load pairs.parquet; expose as store.pairs_df.
packages/web/workers/scripts/export-to-d1.py Read pairs from the new parquet (via WordStore) rather than from Derived.minimal_pairs directly — single source of truth, no schema drift. The two new columns (feature_distance, sonorant_diff) are NOT seeded into D1 — only the legacy 6 columns. PHON-111 will harmonize the web side.
.gitattributes LFS-track data/runtime/pairs.parquet.

Estimated pair count: ~125K words × avg 2–4 minimal-pair partners = ~300–500K rows. ~30 MB parquet.

Linked-slot enumeration mechanics

When a MinpairConstraint or MaxoppConstraint is in the request, solve_shape switches to linked-slot mode for the contrast group:

Standard mode:
    cart = nsubj_frame ⨯ V_frame ⨯ dobj_frame [⨯ advmod_frame]   ← independent slots

Linked-slot mode:
    pair_rows = pairs_df.filter(p1, p2, position_type, sonorant_filter)
    pair_rows = pair_rows.filter(word1 ∈ filtered_spec ∧ word2 ∈ filtered_spec)
    # Emit both orientations
    pair_frame = (
        pair_rows.select(word1.alias("nsubj"), word2.alias("dobj"), feature_distance, sonorant_diff)
        .vstack(
            pair_rows.select(word2.alias("nsubj"), word1.alias("dobj"), feature_distance, sonorant_diff)
        )
    )
    cart = pair_frame ⨯ V_frame [⨯ advmod_frame]

Cardinality: ~125K independent (nsubj × dobj) Cartesian shrinks to the pair-row count for the requested (p1, p2, position), typically 100–2000 — speedup is real before reranking.

Routing: the existing _should_use_vectorized check currently sends ContrastiveConstraint to the python fallback. PHON-106 narrows that: only MultoppConstraint (deferred) forces fallback. Minpair and Maxopp run vectorized via linked-slot mode.

Contrast group selection rules: - Shape has slots nsubj AND dobj → group = (nsubj, dobj). - Shape has ≥ 2 content slots but no nsubj/dobj pair (e.g., nsubj-V-pobj_to) → group = first two shape.content_slots in order. - Shape has < 2 content slots → constraint raises ValueError("contrastive needs ≥ 2 content nominal slots").

Scoring within linked-slot mode

Component Standard mode Linked-slot mode
pmi_nsubj, pmi_dobj PMI from selectional table Same — looked up from the pair row's word1/word2 against the verb's PMI table
pmi_advmod, freq_xcomp, etc. Standard Standard — non-linked slots score independently
Per-word axes (include_*, bound_boost_*) Standard Standard — apply to the pair row's nsubj/dobj fillers like any other word
contrast_maxopp_<p1>_<p2> Not present NEW: equals feature_distance from the pair row. Default weight 1.0; weights={"contrast_maxopp_k_g": 0.0} reduces to binary minpair-style realization, higher weight rewards harder contrasts more.
contrast_minpair_<p1>_<p2> Not present Not added — every linked-slot candidate has the realization, so it carries no ranking signal.

pair_rows is pre-filtered by the spec lexicon ∩ hard constraints (both halves) and (for maxopp) sonorant_diff ≥ min_sonorant_diff. Both orientations of each pair enter the cartesian; the higher-total-score orientation wins via the existing dedup-by-content-pair logic.

If pair_rows.height == 0 after filtering, solve_shape returns [].

Edge cases

Case Behavior
MinpairConstraint + nsubj-V-dobj shape Linked-slot mode on (nsubj, dobj). Standard.
MinpairConstraint + nsubj-V-xcomp shape Single nominal content slot. ValueError("contrastive needs ≥ 2 content nominal slots").
MinpairConstraint + nsubj-V (intransitive) Same — single content slot, error.
MaxoppConstraint + (p1, p2) where every pair row has sonorant_diff < 0.5 Empty after filter. solve_shape returns []. Matches web's "no valid pair" behavior.
Both MinpairConstraint and MaxoppConstraint in one request ValueError("at most one contrastive constraint per request") for v1.
MultoppConstraint in single-sentence shape ValueError("multopp constraint requires multi-sentence paragraph composition; not implemented in v1").
ccomp recursion + outer MinpairConstraint Outer call uses linked-slot mode for matrix (nsubj, dobj). Inner _resolve_embedded_clause solve_shape doesn't inherit the constraint; runs standard mode.
paragraph_csp with MinpairConstraint Each sentence applies the constraint independently. Sentence-shapes with ≥ 2 nominal content slots get linked-slot mode; others raise. PHON-108 paragraph polish refines cross-sentence coordination.
paragraph_csp with MultoppConstraint v1: error (deferred). v2 (PHON-108): distribute multopp set across sentences.

Testing

New file test_contrastive_scorers.py next to existing spike tests.

Pipeline-level: - test_pairs_parquet_emit_includes_feature_distance — pipeline emits pairs.parquet with all 8 schema columns. - test_pairs_parquet_count_matches_d1_minimal_pairs — pair row count matches D1's minimal_pairs row count modulo schema additions (legacy 6 cols equivalent).

Constraint surface: - test_minpair_constraint_hashable — frozen dataclass, hashable. - test_maxopp_constraint_default_min_sonorant_diff — default = 0.5.

Linked-slot enumeration: - test_minpair_linked_slot_cartesian_size — Cartesian shrinks from O(domain²) to O(pair_rows). - test_minpair_linked_slot_realization — every output candidate's (nsubj, dobj) IS a valid (p1, p2) minimal pair (verify against pairs_df). - test_maxopp_filters_by_sonorant_diff — pair rows with sonorant_diff < 0.5 are excluded. - test_maxopp_feature_distance_in_componentscontrast_maxopp_<p1>_<p2> equals the pair row's feature_distance. - test_orientation_swap_higher_total_wins — both orientations enter the cartesian; the dedup-by-content-pair step picks the higher-total-score orientation.

Errors: - test_minpair_single_content_slot_errors - test_both_minpair_and_maxopp_errors - test_multopp_in_single_sentence_errors

Routing: - test_minpair_uses_vectorized_path_should_use_vectorized returns True for MinpairConstraint (was forced to fallback under the old ContrastiveConstraint).

Equivalence:

The vectorized path is the only implementation for linked-slot mode in v1 — no python fallback, so no cross-path equivalence test for contrastive cases. This is deliberate v1 simplification documented in the spec; PHON-108 may add a python-path equivalent if paragraph integration warrants it.

Open questions

None.

References

  • Web app contrastive: packages/web/workers/src/routes/contrastive.ts — algorithms ported.
  • Web standalone tool: packages/web/frontend/src/components/tools/ContrastiveInterventionTool.tsx — three-mode UX confirms the constraint type split.
  • Pipeline: packages/data/src/phonolex_data/pipeline/derived.py:_compute_minimal_pairs (extend), _compute_phoneme_data (reuse vector lookup).
  • Continuous learned vectors: packages/features/outputs/vectors.csv (40 phonemes × 27 features).
  • Linguistic basis: Storkel 2022 ("Minimal, Maximal, or Multiple…"); Gierut 1989–1992 (maximal opposition).
  • Forward references: PHON-108 (paragraph integration for multopp), PHON-111 (web app continuous-vector parity).