PHON-154 Variant-Aware Matching — Phase 1: Data Foundation¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Emit variant-matchable columns (variants_str, cv_shapes, has_variants, phoneme/syllable count ranges) into words.parquet and the D1 seed so downstream phases can match across all attested pronunciations.
Architecture: One row per word is preserved. A new variants_str column holds every attested pronunciation in pipe form, concatenated so adjacent variants are separated by || (an unambiguous boundary — no phoneme can span it). Parallel cv_shapes set and count-range columns let CV-shape and count filters match any variant. Computed in emit_parquet._word_record_to_row from WordRecord.phonemes + WordRecord.variants (each variant dict carries phonemes, syllables, syllable_count). This phase only ADDS columns — no behavior changes yet, so it is safe to land independently.
Tech Stack: Python 3.12, Polars, pytest. Files in packages/data/.
Spec: docs/superpowers/specs/2026-06-15-phon-154-variant-aware-matching-design.md
Governing rule (carried from spec): include if ANY variant satisfies; exclude if ANY variant violates. This phase just produces the data; matching semantics land in Phase 2.
Reference: existing shapes (already in the codebase)¶
WordRecord.variants: list[dict]— each dict has keysphonemes(list[str]),ipa(str),syllables(list[dict] withonset/nucleus/coda/stress),syllable_count(int),wcm_score. May be empty for single-pronunciation words; when CMU has alternates it includes the primary too. (packages/data/src/phonolex_data/pipeline/words.py:270-290)- Primary CV-shape derivation already used by the pipeline: per syllable
"C"*len(onset) + "V" + "C"*len(coda), joined with-(packages/data/src/phonolex_data/pipeline/words.py:200-203). _word_record_to_rowalready buildsphonemes_str = "|" + "|".join(phonemes) + "|"and JSON-encodesvariants(packages/data/src/phonolex_data/runtime/emit_parquet.py:36-45).- Words schema core columns live in
_CORE_WORDS_COLUMNS(packages/data/src/phonolex_data/runtime/schema.py:37-79). - D1 emit: a
WORDS_COLUMNSordered list + a hand-writtenCREATE TABLE words (...)DDL inpackages/data/src/phonolex_data/runtime/emit_d1_sql.py(phonemes_str/variants/cv_shape appear around lines 49-61 and 134-146).
Task 1: Add the variant columns to the words schema¶
Files:
- Modify: packages/data/src/phonolex_data/runtime/schema.py (_CORE_WORDS_COLUMNS, after the variants entry ~line 68)
- Test: packages/data/tests/runtime/test_schema.py
- [ ] Step 1: Write the failing test
Add to packages/data/tests/runtime/test_schema.py:
def test_words_schema_has_variant_matching_columns():
from phonolex_data.runtime.schema import words_schema
schema = words_schema()
for col, dtype in [
("variants_str", pl.Utf8),
("cv_shapes", pl.Utf8),
("has_variants", pl.Boolean),
("phoneme_count_min", pl.Int32),
("phoneme_count_max", pl.Int32),
("syllable_count_min", pl.Int32),
("syllable_count_max", pl.Int32),
]:
assert schema[col] == dtype, f"{col} should be {dtype}"
(Ensure import polars as pl is present at the top of the test file.)
- [ ] Step 2: Run test to verify it fails
Run: uv run python -m pytest packages/data/tests/runtime/test_schema.py::test_words_schema_has_variant_matching_columns -v
Expected: FAIL with KeyError: 'variants_str'.
- [ ] Step 3: Add the columns
In packages/data/src/phonolex_data/runtime/schema.py, inside _CORE_WORDS_COLUMNS, immediately after the "variants": pl.Utf8, line, add:
# PHON-154 variant-aware matching. variants_str = every attested pronunciation
# in pipe form, concatenated so adjacent variants are separated by `||`
# (boundary marker — no phoneme spans it). cv_shapes = pipe-bounded distinct
# CV-shape set across variants. Counts as ranges so "match any variant" works.
"variants_str": pl.Utf8,
"cv_shapes": pl.Utf8,
"has_variants": pl.Boolean,
"phoneme_count_min": pl.Int32,
"phoneme_count_max": pl.Int32,
"syllable_count_min": pl.Int32,
"syllable_count_max": pl.Int32,
- [ ] Step 4: Run test to verify it passes
Run: uv run python -m pytest packages/data/tests/runtime/test_schema.py::test_words_schema_has_variant_matching_columns -v
Expected: PASS.
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/runtime/schema.py packages/data/tests/runtime/test_schema.py
git commit -m "feat(phon-154): add variant-matching columns to words schema"
Task 2: Compute variants_str + has_variants in emit¶
Files:
- Modify: packages/data/src/phonolex_data/runtime/emit_parquet.py (_word_record_to_row, ~lines 36-45)
- Test: packages/data/tests/runtime/test_emit_parquet.py
- [ ] Step 1: Write the failing test
Add to packages/data/tests/runtime/test_emit_parquet.py:
def test_variants_str_concatenates_with_double_pipe_boundary(tmp_path: Path):
"""A multi-pronunciation word emits all variants in |...||...| form,
primary first; has_variants is True. Single-pronunciation words emit just
the primary and has_variants is False."""
db = LexicalDatabase(
words={
"hello": WordRecord(
word="hello", has_phonology=True, is_canonical=True, pos="INTJ",
phonemes=["h", "ə", "l", "oʊ"], phoneme_count=4, syllable_count=2,
cv_shape="CV-CV",
variants=[
{"phonemes": ["h", "ə", "l", "oʊ"], "ipa": "həloʊ",
"syllables": [{"onset": ["h"], "nucleus": "ə", "coda": [], "stress": 0},
{"onset": ["l"], "nucleus": "oʊ", "coda": [], "stress": 1}],
"syllable_count": 2, "wcm_score": 3},
{"phonemes": ["h", "ɛ", "l", "oʊ"], "ipa": "hɛloʊ",
"syllables": [{"onset": ["h"], "nucleus": "ɛ", "coda": [], "stress": 0},
{"onset": ["l"], "nucleus": "oʊ", "coda": [], "stress": 1}],
"syllable_count": 2, "wcm_score": 3},
],
),
"cat": WordRecord(
word="cat", has_phonology=True, is_canonical=True, pos="NOUN",
phonemes=["k", "æ", "t"], phoneme_count=3, syllable_count=1,
cv_shape="CVC", variants=[],
),
},
edges=[],
)
emit_parquet(db, tmp_path)
df = pl.read_parquet(tmp_path / "words.parquet").sort("word")
rows = {r["word"]: r for r in df.iter_rows(named=True)}
assert rows["hello"]["variants_str"] == "|h|ə|l|oʊ||h|ɛ|l|oʊ|"
assert rows["hello"]["has_variants"] is True
assert rows["cat"]["variants_str"] == "|k|æ|t|"
assert rows["cat"]["has_variants"] is False
- [ ] Step 2: Run test to verify it fails
Run: uv run python -m pytest packages/data/tests/runtime/test_emit_parquet.py::test_variants_str_concatenates_with_double_pipe_boundary -v
Expected: FAIL (variants_str is None / column absent — emit doesn't compute it yet).
- [ ] Step 3: Implement the computation
In packages/data/src/phonolex_data/runtime/emit_parquet.py, replace the body of _word_record_to_row (lines 36-45) with:
def _word_record_to_row(record) -> dict:
"""Convert a WordRecord to a dict matching the words_schema."""
row = {f.name: getattr(record, f.name) for f in dataclasses.fields(record)}
# Pipe-delimited phoneme string for D1 LIKE-pattern queries
phonemes = row.get("phonemes") or []
row["phonemes_str"] = "|" + "|".join(phonemes) + "|" if phonemes else None
# PHON-154: variant-matchable forms. Build the de-duplicated set of attested
# pronunciations (primary first), then the other matching fields, BEFORE
# serializing `variants` to JSON below.
variants = row.get("variants") or []
seqs: list[list[str]] = []
if phonemes:
seqs.append(phonemes)
for v in variants:
vp = v.get("phonemes")
if vp and vp not in seqs:
seqs.append(vp)
# Each variant pipe-wrapped (|a|b|); concatenation makes boundaries `||`.
row["variants_str"] = "".join("|" + "|".join(s) + "|" for s in seqs) if seqs else None
row["has_variants"] = len(seqs) > 1
# CV-shape set across variants (primary + each variant's syllables).
cv_set: list[str] = []
if row.get("cv_shape"):
cv_set.append(row["cv_shape"])
for v in variants:
cv = _cv_shape_of(v.get("syllables"))
if cv and cv not in cv_set:
cv_set.append(cv)
row["cv_shapes"] = "|" + "|".join(cv_set) + "|" if cv_set else None
# Count ranges across variants (record.phoneme_count/syllable_count are the
# primary; syllable_count is already the max across variants per pipeline).
phon_counts = [len(s) for s in seqs] or ([row["phoneme_count"]] if row.get("phoneme_count") else [])
syl_counts = [c for c in ([row.get("syllable_count")] + [v.get("syllable_count") for v in variants]) if c]
row["phoneme_count_min"] = min(phon_counts) if phon_counts else None
row["phoneme_count_max"] = max(phon_counts) if phon_counts else None
row["syllable_count_min"] = min(syl_counts) if syl_counts else None
row["syllable_count_max"] = max(syl_counts) if syl_counts else None
# Variants → JSON string (schema is pl.Utf8; Parquet can't write empty struct)
row["variants"] = json.dumps(variants) if variants else None
return row
def _cv_shape_of(syllables) -> str | None:
"""CV skeleton from a variant's syllable dicts, matching the pipeline's
primary cv_shape derivation (C*onset + V + C*coda per syllable, joined '-')."""
if not syllables:
return None
parts = ["C" * len(s["onset"]) + "V" + "C" * len(s["coda"]) for s in syllables]
return "-".join(parts) if parts else None
- [ ] Step 4: Run test to verify it passes
Run: uv run python -m pytest packages/data/tests/runtime/test_emit_parquet.py::test_variants_str_concatenates_with_double_pipe_boundary -v
Expected: PASS.
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/runtime/emit_parquet.py packages/data/tests/runtime/test_emit_parquet.py
git commit -m "feat(phon-154): emit variants_str/has_variants + variant CV/count fields"
Task 3: Cover cv_shapes and count ranges with a test¶
Files:
- Test: packages/data/tests/runtime/test_emit_parquet.py
- [ ] Step 1: Write the test (implementation already done in Task 2)
Add to packages/data/tests/runtime/test_emit_parquet.py (reuses the Task-2 hello/cat shape):
def test_variant_cv_shapes_and_count_ranges(tmp_path: Path):
db = LexicalDatabase(
words={
"either": WordRecord(
word="either", has_phonology=True, is_canonical=True, pos="ADV",
phonemes=["i", "ð", "ɚ"], phoneme_count=3, syllable_count=2,
cv_shape="V-CV",
variants=[
{"phonemes": ["i", "ð", "ɚ"], "ipa": "iðɚ",
"syllables": [{"onset": [], "nucleus": "i", "coda": [], "stress": 1},
{"onset": ["ð"], "nucleus": "ɚ", "coda": [], "stress": 0}],
"syllable_count": 2, "wcm_score": 2},
{"phonemes": ["aɪ", "ð", "ɚ"], "ipa": "aɪðɚ",
"syllables": [{"onset": [], "nucleus": "aɪ", "coda": [], "stress": 1},
{"onset": ["ð"], "nucleus": "ɚ", "coda": [], "stress": 0}],
"syllable_count": 2, "wcm_score": 2},
],
),
},
edges=[],
)
emit_parquet(db, tmp_path)
r = pl.read_parquet(tmp_path / "words.parquet").row(0, named=True)
# Both variants are V-CV → the set has one shape, pipe-bounded.
assert r["cv_shapes"] == "|V-CV|"
# Both pronunciations are 3 phonemes / 2 syllables.
assert r["phoneme_count_min"] == 3 and r["phoneme_count_max"] == 3
assert r["syllable_count_min"] == 2 and r["syllable_count_max"] == 2
- [ ] Step 2: Run test to verify it passes
Run: uv run python -m pytest packages/data/tests/runtime/test_emit_parquet.py::test_variant_cv_shapes_and_count_ranges -v
Expected: PASS (logic was implemented in Task 2).
- [ ] Step 3: Commit
git add packages/data/tests/runtime/test_emit_parquet.py
git commit -m "test(phon-154): cover variant cv_shapes + count ranges"
Task 4: Add the columns to the D1 seed emit (DDL + insert)¶
Files:
- Modify: packages/data/src/phonolex_data/runtime/emit_d1_sql.py (the words WORDS_COLUMNS ordered list ~lines 45-61, and the CREATE TABLE words (...) DDL ~lines 130-146)
- Test: packages/data/tests/runtime/test_emit_d1_sql.py
- [ ] Step 1: Write the failing test
Add to packages/data/tests/runtime/test_emit_d1_sql.py (follow the file's existing fixture/style; this asserts the generated SQL declares + populates the new columns):
def test_words_ddl_and_insert_include_variant_columns(tmp_path: Path):
"""The words CREATE TABLE + INSERT must carry the PHON-154 variant columns."""
from phonolex_data.runtime.emit_d1_sql import emit_d1_sql
# Build a tiny words.parquet via emit_parquet, then emit the seed.
from phonolex_data.runtime.emit_parquet import emit_parquet
from phonolex_data.pipeline.schema import LexicalDatabase, WordRecord
db = LexicalDatabase(
words={"cat": WordRecord(word="cat", has_phonology=True, is_canonical=True,
pos="NOUN", phonemes=["k", "æ", "t"],
phoneme_count=3, syllable_count=1, cv_shape="CVC")},
edges=[],
)
emit_parquet(db, tmp_path)
emit_d1_sql(tmp_path, tmp_path / "d1-seed.sql")
sql = (tmp_path / "d1-seed.sql").read_text()
for col in ["variants_str", "cv_shapes", "has_variants",
"phoneme_count_min", "phoneme_count_max",
"syllable_count_min", "syllable_count_max"]:
assert col in sql, f"{col} missing from emitted seed SQL"
NOTE: confirm the exact
emit_d1_sql(...)entry-point signature against the file before running (it may take(input_dir, output_path)or read a config); adjust the call to match. The assertion content (column names present in the seed) is the contract.
- [ ] Step 2: Run test to verify it fails
Run: uv run python -m pytest packages/data/tests/runtime/test_emit_d1_sql.py::test_words_ddl_and_insert_include_variant_columns -v
Expected: FAIL (variants_str not in seed SQL).
- [ ] Step 3: Add columns to the ordered column list + DDL
In packages/data/src/phonolex_data/runtime/emit_d1_sql.py:
(a) In the words ordered column list (the list containing "phonemes_str", "cv_shape", "variants", ~lines 45-61), add the new column names after "variants":
"variants_str",
"cv_shapes",
"has_variants",
"phoneme_count_min",
"phoneme_count_max",
"syllable_count_min",
"syllable_count_max",
(b) In the CREATE TABLE words (...) DDL string (~lines 130-146), add matching column declarations after the cv_shape TEXT / variants TEXT lines (mind trailing commas — these are not the last column if cv_shape currently is; place them before the closing )):
variants_str TEXT,
cv_shapes TEXT,
has_variants INTEGER,
phoneme_count_min INTEGER,
phoneme_count_max INTEGER,
syllable_count_min INTEGER,
syllable_count_max INTEGER,
(Booleans serialize to 0/1 INTEGER in D1, consistent with the existing is_canonical INTEGER handling.)
- [ ] Step 4: Run test to verify it passes
Run: uv run python -m pytest packages/data/tests/runtime/test_emit_d1_sql.py::test_words_ddl_and_insert_include_variant_columns -v
Expected: PASS.
- [ ] Step 5: Run the runtime emit test suite (regression)
Run: uv run python -m pytest packages/data/tests/runtime/ -q
Expected: all pass (including test_d1_parity.py, which checks parquet↔SQL column parity — the new columns must be present on both sides; if it fails, the column list/DDL is out of sync with the schema).
- [ ] Step 6: Commit
git add packages/data/src/phonolex_data/runtime/emit_d1_sql.py packages/data/tests/runtime/test_emit_d1_sql.py
git commit -m "feat(phon-154): add variant columns to D1 seed DDL + insert"
Task 5: Regenerate the runtime parquet + D1 seed (developer build step)¶
This is a heavy local build, not a unit test. It produces the artifacts Phase 2 will query. Run it AFTER Tasks 1-4 land.
Files:
- Generates (gitignored): data/runtime/words.parquet, pairs.parquet, edges.parquet
- Generates (LFS): packages/web/workers/scripts/d1-seed.sql
- [ ] Step 1: Rebuild the runtime parquet from source
Run: uv run python packages/data/scripts/build_runtime_parquet.py
Expected: completes; prints words: ~125K rows.
- [ ] Step 2: Spot-check the new columns in the built parquet
Run:
uv run python -c "import polars as pl; df=pl.read_parquet('data/runtime/words.parquet'); r=df.filter(pl.col('word')=='hello').row(0,named=True); print(r['variants_str'], '|', r['has_variants'], '|', r['cv_shapes'])"
||-bounded multi-variant string for "hello" (e.g. |h|ə|l|oʊ||h|ɛ|l|oʊ|), has_variants=True, and a pipe-bounded cv_shapes set.
- [ ] Step 3: Emit the D1 seed
Run: uv run python packages/web/workers/scripts/export-to-d1.py
Expected: writes packages/web/workers/scripts/d1-seed.sql with the new columns.
- [ ] Step 4: Apply to local D1 (re-chunk + load)
Run:
uv run python packages/web/workers/scripts/chunk-seed-sql.py
cd packages/web/workers && for f in scripts/d1-chunks/chunk_*.sql; do npx wrangler d1 execute phonolex --local --file "$f"; done
- [ ] Step 5: Verify in local D1
Run (from packages/web/workers):
npx wrangler d1 execute phonolex --local --command "SELECT word, variants_str, has_variants FROM words WHERE word='hello';"
variants_str and has_variants=1.
- [ ] Step 6: Commit the seed
git add packages/web/workers/scripts/d1-seed.sql
git commit -m "data(phon-154): reseed with variant-matching columns"
NOTE: the seed is LFS-tracked and large. Confirm
git lfs statusshows it staged via LFS. This reseed folds into / coordinates with the PHON-151 reseed — if PHON-151 lands first or concurrently, regenerate once on top of its vectors rather than double-bumping the seed.
Phase 1 done — exit criteria¶
words.parquet+d1-seed.sqlcarryvariants_str(||boundary),cv_shapes,has_variants, and phoneme/syllable count ranges.uv run python -m pytest packages/data/tests/runtime/ -qgreen (incl. d1-parity).- No behavior change yet — columns are present but unused. Safe to merge independently.
Next phases (separate plans, written once these column shapes are locked)¶
- Phase 2 — Worker matching:
patterns.tsvariant LIKE clauses (STARTS/ENDS/CONTAINS againstvariants_str),wordFilter.tsCV/count/exclusion across variants, worker tests,/api/words/search+/api/sentences. - Phase 3 — Contrast pairs: minimal-pair / opposition generation across variants (build-time), reseed pairs.
- Phase 4 — Audio + frontend:
/analyzeper-variant scoring (attribution from best match), Lookup variant display, superscripthas_variantsflag on result rows, ProductionCard multi-variant.