PHON-154 Variant-Aware Matching — Phase 3: Minimal Pairs Across Variants¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Steps use checkbox (
- [ ]) syntax.
Goal: Generate minimal pairs (Contrast Sets) across all attested pronunciations — words A and B form a pair if SOME variant of A and SOME variant of B differ by exactly one phoneme.
Architecture: _compute_minimal_pairs (pipeline/derived.py) buckets each word by masked-position key. Currently it buckets only the primary record.phonemes. Change it to bucket EVERY attested pronunciation (primary + each variant's phonemes, de-duplicated), skip same-word pairings, and dedup the final pair tuples. This is build-time Python; the pairs.parquet row count grows.
Tech Stack: Python 3.12, NumPy, pytest. File: packages/data/src/phonolex_data/pipeline/derived.py.
Spec: docs/superpowers/specs/2026-06-15-phon-154-variant-aware-matching-design.md. Depends on: nothing new (uses WordRecord.variants, already populated).
Out of scope: the pairs is_canonical flag stays word-level (unchanged); the reseed is deferred (folded with PHON-151).
Reference: current _compute_minimal_pairs (derived.py ~225-284)¶
Buckets {(length, pos, masked_tuple): [(word, phoneme_at_pos), …]} over record.phonemes only; within a bucket, every distinct-word pair with different phonemes at the masked slot is a minimal pair (8-tuple (w1, w2, ph1, ph2, pos, pos_type, feat_dist, son_diff)). pair_distances = _phoneme_pair_distances(vectors, feature_names) and .get((ph1,ph2),(0.0,0.0)) — a missing entry defaults to (0.0, 0.0), so an empty vectors dict is safe for tests.
Task 1: Bucket every attested pronunciation, skip same-word, dedup¶
Files:
- Modify: packages/data/src/phonolex_data/pipeline/derived.py (_compute_minimal_pairs, ~225-284)
- Test: packages/data/tests/test_derived.py
- [ ] Step 1: Confirm
_phoneme_pair_distancestolerates empty vectors
Read _phoneme_pair_distances (derived.py ~148). Confirm that passing vectors={}, feature_names=[] returns an empty dict (so the test can avoid constructing learned vectors and rely on the (0.0, 0.0) default). If it would raise on empty input, the test will instead pass a tiny 1-feature vector dict covering the test phonemes — note which you used.
- [ ] Step 2: Write the failing tests
Add to packages/data/tests/test_derived.py (match the file's existing imports — it already imports WordRecord and _compute_minimal_pairs or imports the module; follow its style):
def test_minimal_pairs_match_via_variant_pronunciation():
"""alpha primary [k,æ,t]; beta primary [d,ɔ,ɡ] (no primary pair with alpha),
but beta has a variant [k,æ,p] — differs from alpha only at the final phoneme,
so alpha~beta is a minimal pair witnessed by beta's variant."""
words = {
"alpha": WordRecord(word="alpha", has_phonology=True,
phonemes=["k", "æ", "t"], phoneme_count=3, syllable_count=1),
"beta": WordRecord(word="beta", has_phonology=True,
phonemes=["d", "ɔ", "ɡ"], phoneme_count=3, syllable_count=1,
variants=[{"phonemes": ["k", "æ", "p"], "ipa": "kæp",
"syllables": [], "syllable_count": 1, "wcm_score": 0}]),
}
pairs = _compute_minimal_pairs(words, vectors={}, feature_names=[])
pair_words = {(p[0], p[1]) for p in pairs}
assert ("alpha", "beta") in pair_words or ("beta", "alpha") in pair_words
def test_minimal_pairs_no_self_pair_across_a_words_own_variants():
"""A word's own two pronunciations differing by one phoneme must NOT produce
a self-pair (w1 == w2)."""
words = {
"gamma": WordRecord(word="gamma", has_phonology=True,
phonemes=["b", "æ", "t"], phoneme_count=3, syllable_count=1,
variants=[{"phonemes": ["b", "ɛ", "t"], "ipa": "bɛt",
"syllables": [], "syllable_count": 1, "wcm_score": 0}]),
}
pairs = _compute_minimal_pairs(words, vectors={}, feature_names=[])
assert all(p[0] != p[1] for p in pairs)
def test_minimal_pairs_dedup_same_contrast():
"""The same (w1, w2, ph1, ph2, pos) contrast must appear at most once even
when multiple variant combinations witness it."""
words = {
"alpha": WordRecord(word="alpha", has_phonology=True,
phonemes=["k", "æ", "t"], phoneme_count=3, syllable_count=1,
variants=[{"phonemes": ["k", "æ", "t"], "ipa": "kæt",
"syllables": [], "syllable_count": 1, "wcm_score": 0}]),
"beta": WordRecord(word="beta", has_phonology=True,
phonemes=["k", "æ", "p"], phoneme_count=3, syllable_count=1,
variants=[{"phonemes": ["k", "æ", "p"], "ipa": "kæp",
"syllables": [], "syllable_count": 1, "wcm_score": 0}]),
}
pairs = _compute_minimal_pairs(words, vectors={}, feature_names=[])
keys = [(p[0], p[1], p[2], p[3], p[4]) for p in pairs]
assert len(keys) == len(set(keys)), "duplicate contrast tuples emitted"
(If the test file imports the function differently, adapt the import; if _compute_minimal_pairs isn't imported there yet, import it from phonolex_data.pipeline.derived.)
- [ ] Step 3: Run tests to verify they fail
Run: uv run python -m pytest packages/data/tests/test_derived.py -k "variant or self_pair or dedup" -v
Expected: test_minimal_pairs_match_via_variant_pronunciation FAILS (primary-only bucketing never pairs alpha~beta). The other two may pass vacuously on the old code — that's fine; they guard the new behavior.
- [ ] Step 4: Rewrite
_compute_minimal_pairsbucketing + pairing
Replace the bucketing loop and the pairing loop in _compute_minimal_pairs (derived.py ~244-284) with:
# {(length, position, masked_phonemes_tuple): {(word, phoneme_at_position), …}}
# PHON-154: bucket EVERY attested pronunciation (primary + variants), so a
# pair witnessed only by a non-primary variant is found. A set per bucket
# de-dups identical (word, phoneme) entries when two of a word's variants
# share the same masked key + slot phoneme.
buckets: dict[tuple[int, int, tuple[str, ...]], set[tuple[str, str]]] = {}
for word, record in words.items():
if not record.has_phonology:
continue
# De-duplicated attested pronunciations, primary first.
pronunciations: list[list[str]] = []
if record.phonemes:
pronunciations.append(list(record.phonemes))
for v in record.variants:
vp = v.get("phonemes")
if vp and list(vp) not in pronunciations:
pronunciations.append(list(vp))
for phonemes in pronunciations:
length = len(phonemes)
if length < 2:
continue
p_list = list(phonemes)
for pos in range(length):
original = p_list[pos]
p_list[pos] = "*"
key = (length, pos, tuple(p_list))
buckets.setdefault(key, set()).add((word, original))
p_list[pos] = original
minimal_pairs: list[tuple[str, str, str, str, int, str, float, float]] = []
seen: set[tuple[str, str, str, str, int]] = set()
for (length, pos, _masked), members_set in buckets.items():
if len(members_set) < 2:
continue
if pos == 0:
pos_type = "initial"
elif pos == length - 1:
pos_type = "final"
else:
pos_type = "medial"
members = sorted(members_set) # deterministic ordering
for i in range(len(members)):
w1, ph1 = members[i]
for j in range(i + 1, len(members)):
w2, ph2 = members[j]
if w1 == w2:
# Two pronunciations of the SAME word — not a cross-word pair.
continue
if ph1 == ph2:
# Same phoneme at the wildcard slot = homophones at this slot.
continue
dedup_key = (w1, w2, ph1, ph2, pos)
if dedup_key in seen:
# Same contrast already recorded via another variant combo.
continue
seen.add(dedup_key)
feat_dist, son_diff = pair_distances.get((ph1, ph2), (0.0, 0.0))
minimal_pairs.append((w1, w2, ph1, ph2, pos, pos_type, feat_dist, son_diff))
return minimal_pairs
- [ ] Step 5: Run tests to verify they pass
Run: uv run python -m pytest packages/data/tests/test_derived.py -k "variant or self_pair or dedup" -v
Expected: all three PASS.
- [ ] Step 6: Full data-pipeline regression
Run: uv run python -m pytest packages/data/tests/test_derived.py packages/data/tests/runtime/ -q
Expected: green. (Existing minimal-pair tests must still pass — variant-aware bucketing is a superset, so all prior primary-witnessed pairs are still produced.)
- [ ] Step 7: Commit
git add packages/data/src/phonolex_data/pipeline/derived.py packages/data/tests/test_derived.py
git commit -m "feat(phon-154): minimal pairs generated across attested variants"
Phase 3 done — exit criteria¶
_compute_minimal_pairsbuckets all attested pronunciations; a pair witnessed only by a variant is produced; no self-pairs; contrasts de-duplicated.test_derived.py+ runtime suite green.pairs.parquetwill grow on the next (deferred) reseed; the new pairs flow to Contrast Sets witnessed by variant pronunciations.
Next¶
- Phase 4: audio
/analyzeper-variant scoring + frontend variant display + superscript flag. - Phase 2b: count-range matching (
queries.ts) + CONTAINS_MEDIAL variant post-filter. - Reseed: single regeneration folded with PHON-151.