PHON-154 Variant-Aware Matching — Phase 1: Data Foundation¶

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Emit variant-matchable columns (variants_str, cv_shapes, has_variants, phoneme/syllable count ranges) into words.parquet and the D1 seed so downstream phases can match across all attested pronunciations.

Architecture: One row per word is preserved. A new variants_str column holds every attested pronunciation in pipe form, concatenated so adjacent variants are separated by || (an unambiguous boundary — no phoneme can span it). Parallel cv_shapes set and count-range columns let CV-shape and count filters match any variant. Computed in emit_parquet._word_record_to_row from WordRecord.phonemes + WordRecord.variants (each variant dict carries phonemes, syllables, syllable_count). This phase only ADDS columns — no behavior changes yet, so it is safe to land independently.

Tech Stack: Python 3.12, Polars, pytest. Files in packages/data/.

Spec: docs/superpowers/specs/2026-06-15-phon-154-variant-aware-matching-design.md

Governing rule (carried from spec): include if ANY variant satisfies; exclude if ANY variant violates. This phase just produces the data; matching semantics land in Phase 2.

Reference: existing shapes (already in the codebase)¶

WordRecord.variants: list[dict] — each dict has keys phonemes (list[str]), ipa (str), syllables (list[dict] with onset/nucleus/coda/stress), syllable_count (int), wcm_score. May be empty for single-pronunciation words; when CMU has alternates it includes the primary too. (packages/data/src/phonolex_data/pipeline/words.py:270-290)
Primary CV-shape derivation already used by the pipeline: per syllable "C"*len(onset) + "V" + "C"*len(coda), joined with - (packages/data/src/phonolex_data/pipeline/words.py:200-203).
_word_record_to_row already builds phonemes_str = "|" + "|".join(phonemes) + "|" and JSON-encodes variants (packages/data/src/phonolex_data/runtime/emit_parquet.py:36-45).
Words schema core columns live in _CORE_WORDS_COLUMNS (packages/data/src/phonolex_data/runtime/schema.py:37-79).
D1 emit: a WORDS_COLUMNS ordered list + a hand-written CREATE TABLE words (...) DDL in packages/data/src/phonolex_data/runtime/emit_d1_sql.py (phonemes_str/variants/cv_shape appear around lines 49-61 and 134-146).

Task 1: Add the variant columns to the words schema¶

Files: - Modify: packages/data/src/phonolex_data/runtime/schema.py (_CORE_WORDS_COLUMNS, after the variants entry ~line 68) - Test: packages/data/tests/runtime/test_schema.py

[ ] Step 1: Write the failing test

Add to packages/data/tests/runtime/test_schema.py:

def test_words_schema_has_variant_matching_columns():
    from phonolex_data.runtime.schema import words_schema
    schema = words_schema()
    for col, dtype in [
        ("variants_str", pl.Utf8),
        ("cv_shapes", pl.Utf8),
        ("has_variants", pl.Boolean),
        ("phoneme_count_min", pl.Int32),
        ("phoneme_count_max", pl.Int32),
        ("syllable_count_min", pl.Int32),
        ("syllable_count_max", pl.Int32),
    ]:
        assert schema[col] == dtype, f"{col} should be {dtype}"

(Ensure import polars as pl is present at the top of the test file.)

[ ] Step 2: Run test to verify it fails

Run: uv run python -m pytest packages/data/tests/runtime/test_schema.py::test_words_schema_has_variant_matching_columns -v Expected: FAIL with KeyError: 'variants_str'.

[ ] Step 3: Add the columns

In packages/data/src/phonolex_data/runtime/schema.py, inside _CORE_WORDS_COLUMNS, immediately after the "variants": pl.Utf8, line, add:

    # PHON-154 variant-aware matching. variants_str = every attested pronunciation
    # in pipe form, concatenated so adjacent variants are separated by `||`
    # (boundary marker — no phoneme spans it). cv_shapes = pipe-bounded distinct
    # CV-shape set across variants. Counts as ranges so "match any variant" works.
    "variants_str": pl.Utf8,
    "cv_shapes": pl.Utf8,
    "has_variants": pl.Boolean,
    "phoneme_count_min": pl.Int32,
    "phoneme_count_max": pl.Int32,
    "syllable_count_min": pl.Int32,
    "syllable_count_max": pl.Int32,

[ ] Step 4: Run test to verify it passes

Run: uv run python -m pytest packages/data/tests/runtime/test_schema.py::test_words_schema_has_variant_matching_columns -v Expected: PASS.

[ ] Step 5: Commit

git add packages/data/src/phonolex_data/runtime/schema.py packages/data/tests/runtime/test_schema.py
git commit -m "feat(phon-154): add variant-matching columns to words schema"

Task 2: Compute `variants_str` + `has_variants` in emit¶

Files: - Modify: packages/data/src/phonolex_data/runtime/emit_parquet.py (_word_record_to_row, ~lines 36-45) - Test: packages/data/tests/runtime/test_emit_parquet.py

[ ] Step 1: Write the failing test

Add to packages/data/tests/runtime/test_emit_parquet.py:

def test_variants_str_concatenates_with_double_pipe_boundary(tmp_path: Path):
    """A multi-pronunciation word emits all variants in |...||...| form,
    primary first; has_variants is True. Single-pronunciation words emit just
    the primary and has_variants is False."""
    db = LexicalDatabase(
        words={
            "hello": WordRecord(
                word="hello", has_phonology=True, is_canonical=True, pos="INTJ",
                phonemes=["h", "ə", "l", "oʊ"], phoneme_count=4, syllable_count=2,
                cv_shape="CV-CV",
                variants=[
                    {"phonemes": ["h", "ə", "l", "oʊ"], "ipa": "həloʊ",
                     "syllables": [{"onset": ["h"], "nucleus": "ə", "coda": [], "stress": 0},
                                   {"onset": ["l"], "nucleus": "oʊ", "coda": [], "stress": 1}],
                     "syllable_count": 2, "wcm_score": 3},
                    {"phonemes": ["h", "ɛ", "l", "oʊ"], "ipa": "hɛloʊ",
                     "syllables": [{"onset": ["h"], "nucleus": "ɛ", "coda": [], "stress": 0},
                                   {"onset": ["l"], "nucleus": "oʊ", "coda": [], "stress": 1}],
                     "syllable_count": 2, "wcm_score": 3},
                ],
            ),
            "cat": WordRecord(
                word="cat", has_phonology=True, is_canonical=True, pos="NOUN",
                phonemes=["k", "æ", "t"], phoneme_count=3, syllable_count=1,
                cv_shape="CVC", variants=[],
            ),
        },
        edges=[],
    )
    emit_parquet(db, tmp_path)
    df = pl.read_parquet(tmp_path / "words.parquet").sort("word")
    rows = {r["word"]: r for r in df.iter_rows(named=True)}
    assert rows["hello"]["variants_str"] == "|h|ə|l|oʊ||h|ɛ|l|oʊ|"
    assert rows["hello"]["has_variants"] is True
    assert rows["cat"]["variants_str"] == "|k|æ|t|"
    assert rows["cat"]["has_variants"] is False

[ ] Step 2: Run test to verify it fails

Run: uv run python -m pytest packages/data/tests/runtime/test_emit_parquet.py::test_variants_str_concatenates_with_double_pipe_boundary -v Expected: FAIL (variants_str is None / column absent — emit doesn't compute it yet).

[ ] Step 3: Implement the computation

In packages/data/src/phonolex_data/runtime/emit_parquet.py, replace the body of _word_record_to_row (lines 36-45) with:

def _word_record_to_row(record) -> dict:
    """Convert a WordRecord to a dict matching the words_schema."""
    row = {f.name: getattr(record, f.name) for f in dataclasses.fields(record)}
    # Pipe-delimited phoneme string for D1 LIKE-pattern queries
    phonemes = row.get("phonemes") or []
    row["phonemes_str"] = "|" + "|".join(phonemes) + "|" if phonemes else None

    # PHON-154: variant-matchable forms. Build the de-duplicated set of attested
    # pronunciations (primary first), then the other matching fields, BEFORE
    # serializing `variants` to JSON below.
    variants = row.get("variants") or []
    seqs: list[list[str]] = []
    if phonemes:
        seqs.append(phonemes)
    for v in variants:
        vp = v.get("phonemes")
        if vp and vp not in seqs:
            seqs.append(vp)
    # Each variant pipe-wrapped (|a|b|); concatenation makes boundaries `||`.
    row["variants_str"] = "".join("|" + "|".join(s) + "|" for s in seqs) if seqs else None
    row["has_variants"] = len(seqs) > 1

    # CV-shape set across variants (primary + each variant's syllables).
    cv_set: list[str] = []
    if row.get("cv_shape"):
        cv_set.append(row["cv_shape"])
    for v in variants:
        cv = _cv_shape_of(v.get("syllables"))
        if cv and cv not in cv_set:
            cv_set.append(cv)
    row["cv_shapes"] = "|" + "|".join(cv_set) + "|" if cv_set else None

    # Count ranges across variants (record.phoneme_count/syllable_count are the
    # primary; syllable_count is already the max across variants per pipeline).
    phon_counts = [len(s) for s in seqs] or ([row["phoneme_count"]] if row.get("phoneme_count") else [])
    syl_counts = [c for c in ([row.get("syllable_count")] + [v.get("syllable_count") for v in variants]) if c]
    row["phoneme_count_min"] = min(phon_counts) if phon_counts else None
    row["phoneme_count_max"] = max(phon_counts) if phon_counts else None
    row["syllable_count_min"] = min(syl_counts) if syl_counts else None
    row["syllable_count_max"] = max(syl_counts) if syl_counts else None

    # Variants → JSON string (schema is pl.Utf8; Parquet can't write empty struct)
    row["variants"] = json.dumps(variants) if variants else None
    return row


def _cv_shape_of(syllables) -> str | None:
    """CV skeleton from a variant's syllable dicts, matching the pipeline's
    primary cv_shape derivation (C*onset + V + C*coda per syllable, joined '-')."""
    if not syllables:
        return None
    parts = ["C" * len(s["onset"]) + "V" + "C" * len(s["coda"]) for s in syllables]
    return "-".join(parts) if parts else None

[ ] Step 4: Run test to verify it passes

Run: uv run python -m pytest packages/data/tests/runtime/test_emit_parquet.py::test_variants_str_concatenates_with_double_pipe_boundary -v Expected: PASS.

[ ] Step 5: Commit

git add packages/data/src/phonolex_data/runtime/emit_parquet.py packages/data/tests/runtime/test_emit_parquet.py
git commit -m "feat(phon-154): emit variants_str/has_variants + variant CV/count fields"

Task 3: Cover `cv_shapes` and count ranges with a test¶

Files: - Test: packages/data/tests/runtime/test_emit_parquet.py

[ ] Step 1: Write the test (implementation already done in Task 2)

Add to packages/data/tests/runtime/test_emit_parquet.py (reuses the Task-2 hello/cat shape):

def test_variant_cv_shapes_and_count_ranges(tmp_path: Path):
    db = LexicalDatabase(
        words={
            "either": WordRecord(
                word="either", has_phonology=True, is_canonical=True, pos="ADV",
                phonemes=["i", "ð", "ɚ"], phoneme_count=3, syllable_count=2,
                cv_shape="V-CV",
                variants=[
                    {"phonemes": ["i", "ð", "ɚ"], "ipa": "iðɚ",
                     "syllables": [{"onset": [], "nucleus": "i", "coda": [], "stress": 1},
                                   {"onset": ["ð"], "nucleus": "ɚ", "coda": [], "stress": 0}],
                     "syllable_count": 2, "wcm_score": 2},
                    {"phonemes": ["aɪ", "ð", "ɚ"], "ipa": "aɪðɚ",
                     "syllables": [{"onset": [], "nucleus": "aɪ", "coda": [], "stress": 1},
                                   {"onset": ["ð"], "nucleus": "ɚ", "coda": [], "stress": 0}],
                     "syllable_count": 2, "wcm_score": 2},
                ],
            ),
        },
        edges=[],
    )
    emit_parquet(db, tmp_path)
    r = pl.read_parquet(tmp_path / "words.parquet").row(0, named=True)
    # Both variants are V-CV → the set has one shape, pipe-bounded.
    assert r["cv_shapes"] == "|V-CV|"
    # Both pronunciations are 3 phonemes / 2 syllables.
    assert r["phoneme_count_min"] == 3 and r["phoneme_count_max"] == 3
    assert r["syllable_count_min"] == 2 and r["syllable_count_max"] == 2

[ ] Step 2: Run test to verify it passes

Run: uv run python -m pytest packages/data/tests/runtime/test_emit_parquet.py::test_variant_cv_shapes_and_count_ranges -v Expected: PASS (logic was implemented in Task 2).

[ ] Step 3: Commit

git add packages/data/tests/runtime/test_emit_parquet.py
git commit -m "test(phon-154): cover variant cv_shapes + count ranges"

Task 4: Add the columns to the D1 seed emit (DDL + insert)¶

Files: - Modify: packages/data/src/phonolex_data/runtime/emit_d1_sql.py (the words WORDS_COLUMNS ordered list ~lines 45-61, and the CREATE TABLE words (...) DDL ~lines 130-146) - Test: packages/data/tests/runtime/test_emit_d1_sql.py

[ ] Step 1: Write the failing test

Add to packages/data/tests/runtime/test_emit_d1_sql.py (follow the file's existing fixture/style; this asserts the generated SQL declares + populates the new columns):

def test_words_ddl_and_insert_include_variant_columns(tmp_path: Path):
    """The words CREATE TABLE + INSERT must carry the PHON-154 variant columns."""
    from phonolex_data.runtime.emit_d1_sql import emit_d1_sql
    # Build a tiny words.parquet via emit_parquet, then emit the seed.
    from phonolex_data.runtime.emit_parquet import emit_parquet
    from phonolex_data.pipeline.schema import LexicalDatabase, WordRecord
    db = LexicalDatabase(
        words={"cat": WordRecord(word="cat", has_phonology=True, is_canonical=True,
                                 pos="NOUN", phonemes=["k", "æ", "t"],
                                 phoneme_count=3, syllable_count=1, cv_shape="CVC")},
        edges=[],
    )
    emit_parquet(db, tmp_path)
    emit_d1_sql(tmp_path, tmp_path / "d1-seed.sql")
    sql = (tmp_path / "d1-seed.sql").read_text()
    for col in ["variants_str", "cv_shapes", "has_variants",
                "phoneme_count_min", "phoneme_count_max",
                "syllable_count_min", "syllable_count_max"]:
        assert col in sql, f"{col} missing from emitted seed SQL"

NOTE: confirm the exact emit_d1_sql(...) entry-point signature against the file before running (it may take (input_dir, output_path) or read a config); adjust the call to match. The assertion content (column names present in the seed) is the contract.

[ ] Step 2: Run test to verify it fails

Run: uv run python -m pytest packages/data/tests/runtime/test_emit_d1_sql.py::test_words_ddl_and_insert_include_variant_columns -v Expected: FAIL (variants_str not in seed SQL).

[ ] Step 3: Add columns to the ordered column list + DDL

In packages/data/src/phonolex_data/runtime/emit_d1_sql.py:

(a) In the words ordered column list (the list containing "phonemes_str", "cv_shape", "variants", ~lines 45-61), add the new column names after "variants":

    "variants_str",
    "cv_shapes",
    "has_variants",
    "phoneme_count_min",
    "phoneme_count_max",
    "syllable_count_min",
    "syllable_count_max",

(b) In the CREATE TABLE words (...) DDL string (~lines 130-146), add matching column declarations after the cv_shape TEXT / variants TEXT lines (mind trailing commas — these are not the last column if cv_shape currently is; place them before the closing )):

  variants_str TEXT,
  cv_shapes TEXT,
  has_variants INTEGER,
  phoneme_count_min INTEGER,
  phoneme_count_max INTEGER,
  syllable_count_min INTEGER,
  syllable_count_max INTEGER,

(Booleans serialize to 0/1 INTEGER in D1, consistent with the existing is_canonical INTEGER handling.)

[ ] Step 4: Run test to verify it passes

Run: uv run python -m pytest packages/data/tests/runtime/test_emit_d1_sql.py::test_words_ddl_and_insert_include_variant_columns -v Expected: PASS.

[ ] Step 5: Run the runtime emit test suite (regression)

Run: uv run python -m pytest packages/data/tests/runtime/ -q Expected: all pass (including test_d1_parity.py, which checks parquet↔SQL column parity — the new columns must be present on both sides; if it fails, the column list/DDL is out of sync with the schema).

[ ] Step 6: Commit

git add packages/data/src/phonolex_data/runtime/emit_d1_sql.py packages/data/tests/runtime/test_emit_d1_sql.py
git commit -m "feat(phon-154): add variant columns to D1 seed DDL + insert"

Task 5: Regenerate the runtime parquet + D1 seed (developer build step)¶

This is a heavy local build, not a unit test. It produces the artifacts Phase 2 will query. Run it AFTER Tasks 1-4 land.

Files: - Generates (gitignored): data/runtime/words.parquet, pairs.parquet, edges.parquet - Generates (LFS): packages/web/workers/scripts/d1-seed.sql

[ ] Step 1: Rebuild the runtime parquet from source

Run: uv run python packages/data/scripts/build_runtime_parquet.py Expected: completes; prints words: ~125K rows.

[ ] Step 2: Spot-check the new columns in the built parquet

Run:

uv run python -c "import polars as pl; df=pl.read_parquet('data/runtime/words.parquet'); r=df.filter(pl.col('word')=='hello').row(0,named=True); print(r['variants_str'], '|', r['has_variants'], '|', r['cv_shapes'])"

Expected: prints a ||-bounded multi-variant string for "hello" (e.g. |h|ə|l|oʊ||h|ɛ|l|oʊ|), has_variants=True, and a pipe-bounded cv_shapes set.

[ ] Step 3: Emit the D1 seed

Run: uv run python packages/web/workers/scripts/export-to-d1.py Expected: writes packages/web/workers/scripts/d1-seed.sql with the new columns.

[ ] Step 4: Apply to local D1 (re-chunk + load)

Run:

uv run python packages/web/workers/scripts/chunk-seed-sql.py
cd packages/web/workers && for f in scripts/d1-chunks/chunk_*.sql; do npx wrangler d1 execute phonolex --local --file "$f"; done

Expected: chunks apply cleanly.

[ ] Step 5: Verify in local D1

Run (from packages/web/workers):

npx wrangler d1 execute phonolex --local --command "SELECT word, variants_str, has_variants FROM words WHERE word='hello';"

Expected: row shows the multi-variant variants_str and has_variants=1.

[ ] Step 6: Commit the seed

git add packages/web/workers/scripts/d1-seed.sql
git commit -m "data(phon-154): reseed with variant-matching columns"

NOTE: the seed is LFS-tracked and large. Confirm git lfs status shows it staged via LFS. This reseed folds into / coordinates with the PHON-151 reseed — if PHON-151 lands first or concurrently, regenerate once on top of its vectors rather than double-bumping the seed.

Phase 1 done — exit criteria¶

words.parquet + d1-seed.sql carry variants_str (|| boundary), cv_shapes, has_variants, and phoneme/syllable count ranges.
uv run python -m pytest packages/data/tests/runtime/ -q green (incl. d1-parity).
No behavior change yet — columns are present but unused. Safe to merge independently.

Next phases (separate plans, written once these column shapes are locked)¶

Phase 2 — Worker matching: patterns.ts variant LIKE clauses (STARTS/ENDS/CONTAINS against variants_str), wordFilter.ts CV/count/exclusion across variants, worker tests, /api/words/search + /api/sentences.
Phase 3 — Contrast pairs: minimal-pair / opposition generation across variants (build-time), reseed pairs.
Phase 4 — Audio + frontend: /analyze per-variant scoring (attribution from best match), Lookup variant display, superscript has_variants flag on result rows, ProductionCard multi-variant.