PHON-154 — Variant-aware matching + display across all tools¶
Status: design approved 2026-06-15 · Ticket: PHON-154 · Branch: feature/phon-154-variant-aware-matching
Blocks: PHON-152 (v6 audio prod cutover). Relates: PHON-153, PHON-151 (reseed).
Problem¶
The lexicon carries multiple attested pronunciations per word (words.variants, JSON), but no matching or scoring path consumes them — everything keys off the single primary pronunciation:
- Build:
emit_parquet._word_record_to_rowderivesphonemes_strfrom the primaryphonemes;variantsisjson.dumps'd to a column that is never indexed/matched. - Worker:
patterns.tsmatchesphonemes_str/initial_phoneme/final_phoneme(primary-only);wordFilter.ts,sentences.ts,contrastive.tshave novariantreferences. - Audio:
/analyze+/pronouncealign/score against the single primary (SELECT phonemes FROM words); the scorer self-documents "assumes the speaker intends canonical." - Frontend: Lookup shows only the primary
ipa; tools never surface alternates.
Consequence (strictly wrong): a speaker producing a valid attested variant (e.g. "hello" → hɛloʊ, CMU HELLO(1)) is scored as a deviation; an ɛ phoneme filter misses "hello"; minimal-pair / corpus constraints ignore variants. This must be fixed before audio ships to production.
Governing principle¶
Conservative, applied uniformly: a word matches an include-constraint if any attested variant satisfies it; a word is excluded by an exclude-constraint if any attested variant violates it. Wherever a new variant interaction surfaces during implementation and the semantics are ambiguous, choose the interpretation that errs toward safety/inclusion of the word in scope (the same direction as exclusion-on-any).
Matching across variants is default-on everywhere — no UI toggle, no internal knob.
Scope¶
In: data-layer variant-matchable forms; Worker phoneme/CV/count matching across variants; minimal-pair generation across variants; audio per-variant scoring; frontend variant display + a superscript "has variants" flag; reseed. Word-level property norms (frequency, AoA, concreteness, …) are pronunciation-independent and unchanged.
A. Data layer¶
One row per word is preserved (results stay per-word; no DISTINCT, no pagination/bind-batching breakage — the reason pipe-delimited columns beat per-variant rows here).
New/changed columns on words:
variants_str(TEXT) — all attested pronunciations (primary first), each rendered in the existing|p1|p2|…|pn|pipe form, concatenated directly. Because each segment is pipe-wrapped, the join between adjacent variants is||, which doubles as an unambiguous variant boundary (no phoneme can span it).- Example:
hello→|h|ə|l|oʊ||h|ɛ|l|oʊ| phonemes_stris retained unchanged (primary; canonical display + backward-compat).cv_shapes(TEXT) — distinct CV shapes across variants as a pipe-bounded set (|CVC|CVCV|), matched withLIKE '%|shape|%'; derived at build from each variant's syllable structure.cv_shape(primary) retained.phoneme_count/syllable_count— matched as a range (min/max across variants). Addphoneme_count_min/max,syllable_count_min/max(or an equivalent compact form) so count filters match if any variant falls in range. Primary columns retained for display.has_variants(INTEGER 0/1) — convenience flag for the frontend superscript (derivable asvariant_count > 1, materialized to avoid parsing JSON on every render).
words is currently ~19 columns; these additions stay well under the D1 100-column limit. The variants JSON column stays (full per-variant phonemes/ipa/syllables for display + audio).
B. Matching (Worker)¶
patterns.ts emits variant-aware LIKE clauses against variants_str:
- CONTAINS / CONTAINS_MEDIAL (SQL part):
variants_str LIKE '%|seq|%' - STARTS_WITH:
(variants_str LIKE '|seq|%' OR variants_str LIKE '%||seq|%')— first variant, or any later variant (preceded by||). - ENDS_WITH:
(variants_str LIKE '%|seq|' OR variants_str LIKE '%|seq||%')— last variant, or any earlier variant (followed by||). - The single-phoneme
initial_phoneme/final_phonemeequality fast-path is superseded by the STARTS_WITH/ENDS_WITH variant patterns above (minor extra LIKE cost; correctness over the micro-optimization). CONTAINS_MEDIALapp post-filter: checks medial position (not first/last) within each variant; matches if any variant satisfies. The route already fetches candidate rows — extend the post-filter to iteratevariants.
wordFilter.ts (compileWordFilter, shared by /api/words/search + /api/sentences):
- Phoneme-pattern clauses use the variant patterns above.
- CV-shape filter matches against cv_shapes (any variant).
- phoneme_count / syllable_count filters match the variant range.
- Exclusion (phoneme-exclude, CV-exclude): excluded if any variant violates (conservative principle).
- is_canonical scoping is word-level and unchanged.
C. Contrast Sets / minimal pairs¶
Minimal-pair / maximal-opposition / multiple-opposition generation (build-time, db.derived.minimal_pairs → pairs.parquet) considers variants: words A and B form a pair if some variant of A and some variant of B stand in the target relation (one-phoneme difference, etc.). Results surface per-word (the pair is between words, witnessed by a variant). Expect the pairs table to grow; record the new row count. The is_canonical pair flag stays word-level.
D. Audio¶
/analyze (and /pronounce where applicable):
- Fetch the full variant set for the target word (not just phonemes).
- Align/score the produced transcript against each variant, returning a per-variant array: [{ variant_phonemes, ipa, positions, deviations, features }, …] (primary first).
- Display: the UI lists every variant with its own deviation overlay + score; the clinician picks which they care about. No auto-"best" collapsing in the display.
- Session attribution (the pooled read): fed from the best-matching variant per production (lowest aggregate deviation), so the attribution reflects the closest legitimate target rather than an arbitrary one. The per-production feature vector pooled into /attribute is the best-matching variant's.
E. Frontend display¶
- Lookup card: render all variant pronunciations (primary as headline, alternates listed).
- Result rows / cards (Word Lists, Sentences, Contrast Sets, Lookup neighbors): when
has_variants, show a superscript flag indicating alternate pronunciations exist; expandable to view them. The matched word is surfaced once (per-word). - Audio: per-variant target rows + deviation overlays + scores (extends
ProductionCard/DeviationOverlayto a list). - No new filter controls — variant matching is implicit.
F. Build / reseed¶
build_runtime_parquet/emit_parquet: computevariants_str,cv_shapes, count ranges,has_variants.emit_d1_sql: new columns in thewordsDDL + inserts.- Pair generation across variants.
- Regenerate
d1-seed.sql; this is a matching-corpus change → folds into / gates with the PHON-151 reseed (run together; one seed bump). - Property-metadata config (
properties.ts+config.py) updated if any new filterable surface is exposed (most additions are matching-internal, not new filter knobs).
Testing¶
- Data: unit tests that
variants_strrenders||boundaries; multi-pronunciation words ("hello", "either", "data", "the") produce expected variant forms;has_variants/cv_shapes/count ranges correct. - Worker:
patterns.tsvariant LIKE clauses (STARTS/ENDS/CONTAINS) match a non-first variant; exclusion drops a word whose any variant violates;/api/words/search+/api/sentencesreturn a word matched only via a non-primary variant; the count/CV filters match a variant. - Contrastive: a minimal pair witnessed only by variants is produced.
- Audio:
/analyzereturns per-variant scores; attribution uses the best-matching variant. Frontend: ProductionCard renders multiple variants; superscript flag appears iffhas_variants. - Full local CI matrix (frontend type-check/lint/build; workers type-check/tests; Python data tests) before PR, per house rules.
Out of scope¶
- Word-level property norms (frequency, AoA, etc.) — pronunciation-independent, unchanged.
- New user-facing filter controls or variant-selection toggles.
- Re-fitting feature vectors (that's PHON-151).
Risks / watch-items¶
- Seed size growth from
variants_str+ expandedpairs. Measure; the seed is already LFS-chunked. - LIKE-pattern correctness at
||boundaries — covered by tests; the boundary can't contain a phoneme so cross-variant false matches are structurally impossible. - Silent recall increase everywhere (intended): matching now catches words via alternates. Validate against a few known cases; this is the "improve performance on everything" goal.
- D1 100-bind / 100-column limits — additions are few; keep batching at 80.