Skip to content

PHON-154 — Variant-aware matching + display across all tools

Status: design approved 2026-06-15 · Ticket: PHON-154 · Branch: feature/phon-154-variant-aware-matching Blocks: PHON-152 (v6 audio prod cutover). Relates: PHON-153, PHON-151 (reseed).

Problem

The lexicon carries multiple attested pronunciations per word (words.variants, JSON), but no matching or scoring path consumes them — everything keys off the single primary pronunciation:

  • Build: emit_parquet._word_record_to_row derives phonemes_str from the primary phonemes; variants is json.dumps'd to a column that is never indexed/matched.
  • Worker: patterns.ts matches phonemes_str / initial_phoneme / final_phoneme (primary-only); wordFilter.ts, sentences.ts, contrastive.ts have no variant references.
  • Audio: /analyze + /pronounce align/score against the single primary (SELECT phonemes FROM words); the scorer self-documents "assumes the speaker intends canonical."
  • Frontend: Lookup shows only the primary ipa; tools never surface alternates.

Consequence (strictly wrong): a speaker producing a valid attested variant (e.g. "hello" → hɛloʊ, CMU HELLO(1)) is scored as a deviation; an ɛ phoneme filter misses "hello"; minimal-pair / corpus constraints ignore variants. This must be fixed before audio ships to production.

Governing principle

Conservative, applied uniformly: a word matches an include-constraint if any attested variant satisfies it; a word is excluded by an exclude-constraint if any attested variant violates it. Wherever a new variant interaction surfaces during implementation and the semantics are ambiguous, choose the interpretation that errs toward safety/inclusion of the word in scope (the same direction as exclusion-on-any).

Matching across variants is default-on everywhere — no UI toggle, no internal knob.

Scope

In: data-layer variant-matchable forms; Worker phoneme/CV/count matching across variants; minimal-pair generation across variants; audio per-variant scoring; frontend variant display + a superscript "has variants" flag; reseed. Word-level property norms (frequency, AoA, concreteness, …) are pronunciation-independent and unchanged.

A. Data layer

One row per word is preserved (results stay per-word; no DISTINCT, no pagination/bind-batching breakage — the reason pipe-delimited columns beat per-variant rows here).

New/changed columns on words:

  • variants_str (TEXT) — all attested pronunciations (primary first), each rendered in the existing |p1|p2|…|pn| pipe form, concatenated directly. Because each segment is pipe-wrapped, the join between adjacent variants is ||, which doubles as an unambiguous variant boundary (no phoneme can span it).
  • Example: hello|h|ə|l|oʊ||h|ɛ|l|oʊ|
  • phonemes_str is retained unchanged (primary; canonical display + backward-compat).
  • cv_shapes (TEXT) — distinct CV shapes across variants as a pipe-bounded set (|CVC|CVCV|), matched with LIKE '%|shape|%'; derived at build from each variant's syllable structure. cv_shape (primary) retained.
  • phoneme_count / syllable_count — matched as a range (min/max across variants). Add phoneme_count_min/max, syllable_count_min/max (or an equivalent compact form) so count filters match if any variant falls in range. Primary columns retained for display.
  • has_variants (INTEGER 0/1) — convenience flag for the frontend superscript (derivable as variant_count > 1, materialized to avoid parsing JSON on every render).

words is currently ~19 columns; these additions stay well under the D1 100-column limit. The variants JSON column stays (full per-variant phonemes/ipa/syllables for display + audio).

B. Matching (Worker)

patterns.ts emits variant-aware LIKE clauses against variants_str:

  • CONTAINS / CONTAINS_MEDIAL (SQL part): variants_str LIKE '%|seq|%'
  • STARTS_WITH: (variants_str LIKE '|seq|%' OR variants_str LIKE '%||seq|%') — first variant, or any later variant (preceded by ||).
  • ENDS_WITH: (variants_str LIKE '%|seq|' OR variants_str LIKE '%|seq||%') — last variant, or any earlier variant (followed by ||).
  • The single-phoneme initial_phoneme / final_phoneme equality fast-path is superseded by the STARTS_WITH/ENDS_WITH variant patterns above (minor extra LIKE cost; correctness over the micro-optimization).
  • CONTAINS_MEDIAL app post-filter: checks medial position (not first/last) within each variant; matches if any variant satisfies. The route already fetches candidate rows — extend the post-filter to iterate variants.

wordFilter.ts (compileWordFilter, shared by /api/words/search + /api/sentences): - Phoneme-pattern clauses use the variant patterns above. - CV-shape filter matches against cv_shapes (any variant). - phoneme_count / syllable_count filters match the variant range. - Exclusion (phoneme-exclude, CV-exclude): excluded if any variant violates (conservative principle). - is_canonical scoping is word-level and unchanged.

C. Contrast Sets / minimal pairs

Minimal-pair / maximal-opposition / multiple-opposition generation (build-time, db.derived.minimal_pairspairs.parquet) considers variants: words A and B form a pair if some variant of A and some variant of B stand in the target relation (one-phoneme difference, etc.). Results surface per-word (the pair is between words, witnessed by a variant). Expect the pairs table to grow; record the new row count. The is_canonical pair flag stays word-level.

D. Audio

/analyze (and /pronounce where applicable): - Fetch the full variant set for the target word (not just phonemes). - Align/score the produced transcript against each variant, returning a per-variant array: [{ variant_phonemes, ipa, positions, deviations, features }, …] (primary first). - Display: the UI lists every variant with its own deviation overlay + score; the clinician picks which they care about. No auto-"best" collapsing in the display. - Session attribution (the pooled read): fed from the best-matching variant per production (lowest aggregate deviation), so the attribution reflects the closest legitimate target rather than an arbitrary one. The per-production feature vector pooled into /attribute is the best-matching variant's.

E. Frontend display

  • Lookup card: render all variant pronunciations (primary as headline, alternates listed).
  • Result rows / cards (Word Lists, Sentences, Contrast Sets, Lookup neighbors): when has_variants, show a superscript flag indicating alternate pronunciations exist; expandable to view them. The matched word is surfaced once (per-word).
  • Audio: per-variant target rows + deviation overlays + scores (extends ProductionCard / DeviationOverlay to a list).
  • No new filter controls — variant matching is implicit.

F. Build / reseed

  • build_runtime_parquet / emit_parquet: compute variants_str, cv_shapes, count ranges, has_variants.
  • emit_d1_sql: new columns in the words DDL + inserts.
  • Pair generation across variants.
  • Regenerate d1-seed.sql; this is a matching-corpus change → folds into / gates with the PHON-151 reseed (run together; one seed bump).
  • Property-metadata config (properties.ts + config.py) updated if any new filterable surface is exposed (most additions are matching-internal, not new filter knobs).

Testing

  • Data: unit tests that variants_str renders || boundaries; multi-pronunciation words ("hello", "either", "data", "the") produce expected variant forms; has_variants/cv_shapes/count ranges correct.
  • Worker: patterns.ts variant LIKE clauses (STARTS/ENDS/CONTAINS) match a non-first variant; exclusion drops a word whose any variant violates; /api/words/search + /api/sentences return a word matched only via a non-primary variant; the count/CV filters match a variant.
  • Contrastive: a minimal pair witnessed only by variants is produced.
  • Audio: /analyze returns per-variant scores; attribution uses the best-matching variant. Frontend: ProductionCard renders multiple variants; superscript flag appears iff has_variants.
  • Full local CI matrix (frontend type-check/lint/build; workers type-check/tests; Python data tests) before PR, per house rules.

Out of scope

  • Word-level property norms (frequency, AoA, etc.) — pronunciation-independent, unchanged.
  • New user-facing filter controls or variant-selection toggles.
  • Re-fitting feature vectors (that's PHON-151).

Risks / watch-items

  • Seed size growth from variants_str + expanded pairs. Measure; the seed is already LFS-chunked.
  • LIKE-pattern correctness at || boundaries — covered by tests; the boundary can't contain a phoneme so cross-variant false matches are structurally impossible.
  • Silent recall increase everywhere (intended): matching now catches words via alternates. Validate against a few known cases; this is the "improve performance on everything" goal.
  • D1 100-bind / 100-column limits — additions are few; keep batching at 80.