Skip to content

PHON-154 Variant-Aware Matching — Phase 3: Minimal Pairs Across Variants

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Steps use checkbox (- [ ]) syntax.

Goal: Generate minimal pairs (Contrast Sets) across all attested pronunciations — words A and B form a pair if SOME variant of A and SOME variant of B differ by exactly one phoneme.

Architecture: _compute_minimal_pairs (pipeline/derived.py) buckets each word by masked-position key. Currently it buckets only the primary record.phonemes. Change it to bucket EVERY attested pronunciation (primary + each variant's phonemes, de-duplicated), skip same-word pairings, and dedup the final pair tuples. This is build-time Python; the pairs.parquet row count grows.

Tech Stack: Python 3.12, NumPy, pytest. File: packages/data/src/phonolex_data/pipeline/derived.py.

Spec: docs/superpowers/specs/2026-06-15-phon-154-variant-aware-matching-design.md. Depends on: nothing new (uses WordRecord.variants, already populated).

Out of scope: the pairs is_canonical flag stays word-level (unchanged); the reseed is deferred (folded with PHON-151).


Reference: current _compute_minimal_pairs (derived.py ~225-284)

Buckets {(length, pos, masked_tuple): [(word, phoneme_at_pos), …]} over record.phonemes only; within a bucket, every distinct-word pair with different phonemes at the masked slot is a minimal pair (8-tuple (w1, w2, ph1, ph2, pos, pos_type, feat_dist, son_diff)). pair_distances = _phoneme_pair_distances(vectors, feature_names) and .get((ph1,ph2),(0.0,0.0)) — a missing entry defaults to (0.0, 0.0), so an empty vectors dict is safe for tests.


Task 1: Bucket every attested pronunciation, skip same-word, dedup

Files: - Modify: packages/data/src/phonolex_data/pipeline/derived.py (_compute_minimal_pairs, ~225-284) - Test: packages/data/tests/test_derived.py

  • [ ] Step 1: Confirm _phoneme_pair_distances tolerates empty vectors

Read _phoneme_pair_distances (derived.py ~148). Confirm that passing vectors={}, feature_names=[] returns an empty dict (so the test can avoid constructing learned vectors and rely on the (0.0, 0.0) default). If it would raise on empty input, the test will instead pass a tiny 1-feature vector dict covering the test phonemes — note which you used.

  • [ ] Step 2: Write the failing tests

Add to packages/data/tests/test_derived.py (match the file's existing imports — it already imports WordRecord and _compute_minimal_pairs or imports the module; follow its style):

def test_minimal_pairs_match_via_variant_pronunciation():
    """alpha primary [k,æ,t]; beta primary [d,ɔ,ɡ] (no primary pair with alpha),
    but beta has a variant [k,æ,p] — differs from alpha only at the final phoneme,
    so alpha~beta is a minimal pair witnessed by beta's variant."""
    words = {
        "alpha": WordRecord(word="alpha", has_phonology=True,
                            phonemes=["k", "æ", "t"], phoneme_count=3, syllable_count=1),
        "beta": WordRecord(word="beta", has_phonology=True,
                           phonemes=["d", "ɔ", "ɡ"], phoneme_count=3, syllable_count=1,
                           variants=[{"phonemes": ["k", "æ", "p"], "ipa": "kæp",
                                      "syllables": [], "syllable_count": 1, "wcm_score": 0}]),
    }
    pairs = _compute_minimal_pairs(words, vectors={}, feature_names=[])
    pair_words = {(p[0], p[1]) for p in pairs}
    assert ("alpha", "beta") in pair_words or ("beta", "alpha") in pair_words


def test_minimal_pairs_no_self_pair_across_a_words_own_variants():
    """A word's own two pronunciations differing by one phoneme must NOT produce
    a self-pair (w1 == w2)."""
    words = {
        "gamma": WordRecord(word="gamma", has_phonology=True,
                            phonemes=["b", "æ", "t"], phoneme_count=3, syllable_count=1,
                            variants=[{"phonemes": ["b", "ɛ", "t"], "ipa": "bɛt",
                                       "syllables": [], "syllable_count": 1, "wcm_score": 0}]),
    }
    pairs = _compute_minimal_pairs(words, vectors={}, feature_names=[])
    assert all(p[0] != p[1] for p in pairs)


def test_minimal_pairs_dedup_same_contrast():
    """The same (w1, w2, ph1, ph2, pos) contrast must appear at most once even
    when multiple variant combinations witness it."""
    words = {
        "alpha": WordRecord(word="alpha", has_phonology=True,
                            phonemes=["k", "æ", "t"], phoneme_count=3, syllable_count=1,
                            variants=[{"phonemes": ["k", "æ", "t"], "ipa": "kæt",
                                       "syllables": [], "syllable_count": 1, "wcm_score": 0}]),
        "beta": WordRecord(word="beta", has_phonology=True,
                           phonemes=["k", "æ", "p"], phoneme_count=3, syllable_count=1,
                           variants=[{"phonemes": ["k", "æ", "p"], "ipa": "kæp",
                                      "syllables": [], "syllable_count": 1, "wcm_score": 0}]),
    }
    pairs = _compute_minimal_pairs(words, vectors={}, feature_names=[])
    keys = [(p[0], p[1], p[2], p[3], p[4]) for p in pairs]
    assert len(keys) == len(set(keys)), "duplicate contrast tuples emitted"

(If the test file imports the function differently, adapt the import; if _compute_minimal_pairs isn't imported there yet, import it from phonolex_data.pipeline.derived.)

  • [ ] Step 3: Run tests to verify they fail

Run: uv run python -m pytest packages/data/tests/test_derived.py -k "variant or self_pair or dedup" -v Expected: test_minimal_pairs_match_via_variant_pronunciation FAILS (primary-only bucketing never pairs alpha~beta). The other two may pass vacuously on the old code — that's fine; they guard the new behavior.

  • [ ] Step 4: Rewrite _compute_minimal_pairs bucketing + pairing

Replace the bucketing loop and the pairing loop in _compute_minimal_pairs (derived.py ~244-284) with:

    # {(length, position, masked_phonemes_tuple): {(word, phoneme_at_position), …}}
    # PHON-154: bucket EVERY attested pronunciation (primary + variants), so a
    # pair witnessed only by a non-primary variant is found. A set per bucket
    # de-dups identical (word, phoneme) entries when two of a word's variants
    # share the same masked key + slot phoneme.
    buckets: dict[tuple[int, int, tuple[str, ...]], set[tuple[str, str]]] = {}
    for word, record in words.items():
        if not record.has_phonology:
            continue
        # De-duplicated attested pronunciations, primary first.
        pronunciations: list[list[str]] = []
        if record.phonemes:
            pronunciations.append(list(record.phonemes))
        for v in record.variants:
            vp = v.get("phonemes")
            if vp and list(vp) not in pronunciations:
                pronunciations.append(list(vp))
        for phonemes in pronunciations:
            length = len(phonemes)
            if length < 2:
                continue
            p_list = list(phonemes)
            for pos in range(length):
                original = p_list[pos]
                p_list[pos] = "*"
                key = (length, pos, tuple(p_list))
                buckets.setdefault(key, set()).add((word, original))
                p_list[pos] = original

    minimal_pairs: list[tuple[str, str, str, str, int, str, float, float]] = []
    seen: set[tuple[str, str, str, str, int]] = set()
    for (length, pos, _masked), members_set in buckets.items():
        if len(members_set) < 2:
            continue
        if pos == 0:
            pos_type = "initial"
        elif pos == length - 1:
            pos_type = "final"
        else:
            pos_type = "medial"
        members = sorted(members_set)  # deterministic ordering
        for i in range(len(members)):
            w1, ph1 = members[i]
            for j in range(i + 1, len(members)):
                w2, ph2 = members[j]
                if w1 == w2:
                    # Two pronunciations of the SAME word — not a cross-word pair.
                    continue
                if ph1 == ph2:
                    # Same phoneme at the wildcard slot = homophones at this slot.
                    continue
                dedup_key = (w1, w2, ph1, ph2, pos)
                if dedup_key in seen:
                    # Same contrast already recorded via another variant combo.
                    continue
                seen.add(dedup_key)
                feat_dist, son_diff = pair_distances.get((ph1, ph2), (0.0, 0.0))
                minimal_pairs.append((w1, w2, ph1, ph2, pos, pos_type, feat_dist, son_diff))

    return minimal_pairs
  • [ ] Step 5: Run tests to verify they pass

Run: uv run python -m pytest packages/data/tests/test_derived.py -k "variant or self_pair or dedup" -v Expected: all three PASS.

  • [ ] Step 6: Full data-pipeline regression

Run: uv run python -m pytest packages/data/tests/test_derived.py packages/data/tests/runtime/ -q Expected: green. (Existing minimal-pair tests must still pass — variant-aware bucketing is a superset, so all prior primary-witnessed pairs are still produced.)

  • [ ] Step 7: Commit
git add packages/data/src/phonolex_data/pipeline/derived.py packages/data/tests/test_derived.py
git commit -m "feat(phon-154): minimal pairs generated across attested variants"

Phase 3 done — exit criteria

  • _compute_minimal_pairs buckets all attested pronunciations; a pair witnessed only by a variant is produced; no self-pairs; contrasts de-duplicated.
  • test_derived.py + runtime suite green.
  • pairs.parquet will grow on the next (deferred) reseed; the new pairs flow to Contrast Sets witnessed by variant pronunciations.

Next

  • Phase 4: audio /analyze per-variant scoring + frontend variant display + superscript flag.
  • Phase 2b: count-range matching (queries.ts) + CONTAINS_MEDIAL variant post-filter.
  • Reseed: single regeneration folded with PHON-151.