G2P Character-to-Phoneme Alignment¶

Problem¶

The Inventio governor's LookupBuilder (in diffusion-governors) maps subword tokens to phonemes by decoding the token to a string and looking it up in CMUdict as a standalone word. This fails for subword fragments: token "th" gets the letter-name entry ([t, ˈi, ˈeɪ, tʃ]) instead of recognizing it carries /θ/ or /ð/ in actual words like "therapy" or "the".

Character-to-phoneme alignment solves this. Given a word's spelling and its known phoneme sequence, the alignment maps character spans to the phonemes they produce. Any tokenizer splits at character boundaries, so alignment data lets LookupBuilder project phonemes onto arbitrary subword tokens.

Scope¶

All ~134K entries in data/cmu/cmudict-0.7b, including ~8,600 alternate pronunciations
Output: one JSON file consumed by diffusion-governors' LookupBuilder
Pure Python, no external dependencies

Data Source¶

Parse the raw CMU dict directly (not cmu.json, which drops alternates). Convert ARPAbet to IPA using data/mappings/arpa_to_ipa.json. Stress-dependent vowel quality is preserved: AH0 -> ə, AH1 -> ˈʌ are distinct phonemes in the alignment.

CMU Dict Parsing Rules¶

Comment lines: Skip lines starting with ;;;
Entry format: WORD PH1 PH2 PH3 (two spaces between word and phonemes)
Alternate pronunciations: WORD(N) format (e.g., LIVE(1) L IH1 V). Strip the (N) suffix to recover the base word. The base entry (no suffix) is variant 0; (1) is variant 1, (2) is variant 2, etc.
Apostrophes: Include as characters in the alignment. E.g., O'BRIEN -> OW0 B R AY1 IH0 N — the apostrophe is a character that produces no phoneme (silent, like silent-e).
Hyphens: Include as characters in the alignment, treated as silent (no phoneme produced). E.g., ABLE-BODIED -> the hyphen at position 4 maps to no phonemes.
Punctuation-prefixed entries: Skip entries whose key starts with !, ", #, or other non-alphabetic, non-apostrophe characters. These are meta-entries (e.g., !EXCLAMATION-POINT, "CLOSE-QUOTE, #HASH-MARK) not useful for subword attribution.
Possessives and contractions: Entries ending in 'S or starting with ' (e.g., 'BOUT) are included. The apostrophe is a character in the alignment.
Case: All word keys are lowercased in the output.

Algorithm: Constrained DP with Bootstrapped Grapheme Table¶

Phase 1: Bootstrap grapheme-phoneme table¶

Seed with known multi-character English graphemes:

th, sh, ch, ck, ph, gh, ng, wh, wr, kn, gn, mb, mn, ps, pn, rh,
dg, tch, ght, qu, ee, ea, oo, ou, ow, oi, oy, aw, au, ew, ei, ey,
ai, ay, ie, ue, oa, oe, eigh, ough, augh

Run a greedy first pass (longest-matching-seed-grapheme-first) over all CMU entries. For each word, walk left-to-right: if the next characters match a seed grapheme, consume them as a unit; otherwise consume one character. Pair each grapheme unit with the next unconsumed phoneme(s). Validate each pairing against the known phoneme sequence — if the greedy segmentation can't account for all phonemes, discard the word from the bootstrap (it will be handled in Phase 2). Only validated pairings enter the table.

This populates the full grapheme-to-phoneme table (including single-character mappings like "a" -> {ˈæ, æ, ə, ...}) without hand-coding.

Phase 2: DP alignment¶

For each word, find the optimal alignment path through a DP over states (i, j) where i = characters consumed, j = phonemes consumed.

Transitions from state (i, j):

For every grapheme span word[i:i+g] for g in 1..max_grapheme_len, try all three transition types: - Silent (0 phonemes): transition to (i+g, j), cost = 2 - 1-phoneme: if ipa[j] exists, transition to (i+g, j+1), cost = 0 if this grapheme-phoneme pairing is in the bootstrap table, cost = 1 otherwise (novel pairing) - 2-phoneme: if ipa[j:j+2] has 2 elements, transition to (i+g, j+2), cost = 0 if in table, cost = 1 otherwise

All three transitions are always available for every grapheme span. The bootstrap table determines cost (0 vs 1), not availability. This is critical for multi-phoneme graphemes like "ew" -> [j, u] and "x" -> [k, s], which the Phase 1 bootstrap cannot learn (its 1-to-1 pairing discards these words) but Phase 2 must handle at cost = 1.

Cost function: - Each transition has a cost (defined above) - Total path cost = sum of transition costs - DP minimizes total cost - Primary: Lowest total cost (prefer known grapheme-phoneme pairings over novel ones, and both over silent-letter transitions) - Secondary (tie-break): Fewest grapheme units (prefer "th" -> θ over "t" -> t + "h" -> silent when both have the same cost)

Goal state: (len(word), len(ipa))

Stress handling: The grapheme table maps to full IPA forms including stress markers. ˈʌ and ə are distinct targets. No stripping.

Silent letters & many-to-zero¶

Handled naturally by the DP's silent transition (cost = 2): - Silent "e" ("make" -> meɪk) - Silent "k" ("knee" -> ni) - Silent "gh" ("night" -> naɪt) - Doubled consonants ("ll", "ss") producing one phoneme - Apostrophes and hyphens (always silent)

The higher cost (2) for silent transitions means the DP prefers multi-character graphemes ("ck" -> k) over single-char + silent ("c" -> k + "k" -> silent) when both are possible.

Failure handling¶

Words that don't align (foreign borrowings, extreme irregulars like "colonel" -> kɝnəl): 1. DP returns no valid path (no sequence of transitions reaches the goal state) 2. Fall back to a left-greedy proportional heuristic: assign phonemes left-to-right, one phoneme per character. If there are more characters than phonemes, trailing characters get empty phoneme lists. If there are more phonemes than characters, the last character gets all remaining phonemes. 3. Flag with "confidence": "low"

Output Format¶

File: data/g2p_alignment.json (gitignored)

{
  "bartholomew": [
    {
      "variant": 0,
      "arpa": ["B", "AA0", "R", "TH", "AA1", "L", "AH0", "M", "Y", "UW2"],
      "ipa": ["b", "ɑ", "ɹ", "θ", "ˈɑ", "l", "ə", "m", "j", "ˌu"],
      "alignment": [
        {"chars": [0, 1], "grapheme": "b", "phonemes": ["b"]},
        {"chars": [1, 2], "grapheme": "a", "phonemes": ["ɑ"]},
        {"chars": [2, 3], "grapheme": "r", "phonemes": ["ɹ"]},
        {"chars": [3, 5], "grapheme": "th", "phonemes": ["θ"]},
        {"chars": [5, 6], "grapheme": "o", "phonemes": ["ˈɑ"]},
        {"chars": [6, 7], "grapheme": "l", "phonemes": ["l"]},
        {"chars": [7, 8], "grapheme": "o", "phonemes": ["ə"]},
        {"chars": [8, 9], "grapheme": "m", "phonemes": ["m"]},
        {"chars": [9, 11], "grapheme": "ew", "phonemes": ["j", "ˌu"]}
      ],
      "confidence": "high"
    }
  ],
  "live": [
    {
      "variant": 0,
      "arpa": ["L", "AY1", "V"],
      "ipa": ["l", "ˈaɪ", "v"],
      "alignment": [
        {"chars": [0, 1], "grapheme": "l", "phonemes": ["l"]},
        {"chars": [1, 2], "grapheme": "i", "phonemes": ["ˈaɪ"]},
        {"chars": [2, 3], "grapheme": "v", "phonemes": ["v"]},
        {"chars": [3, 4], "grapheme": "e", "phonemes": []}
      ],
      "confidence": "high"
    },
    {
      "variant": 1,
      "arpa": ["L", "IH1", "V"],
      "ipa": ["l", "ˈɪ", "v"],
      "alignment": [
        {"chars": [0, 1], "grapheme": "l", "phonemes": ["l"]},
        {"chars": [1, 2], "grapheme": "i", "phonemes": ["ˈɪ"]},
        {"chars": [2, 3], "grapheme": "v", "phonemes": ["v"]},
        {"chars": [3, 4], "grapheme": "e", "phonemes": []}
      ],
      "confidence": "high"
    }
  ]
}

File Location¶

Script: workers/scripts/g2p_alignment.py
Output: data/g2p_alignment.json (gitignored)
Inputs: data/cmu/cmudict-0.7b, data/mappings/arpa_to_ipa.json

Testing¶

Hand-curated validation set (~50 words) covering: - Regular patterns ("cat", "dog", "fish") - Multi-char graphemes ("think", "church", "phone") - Silent letters ("knife", "write", "make") - Multi-phoneme graphemes ("box" -> bɑks) - Irregulars ("colonel", "choir", "enough") - Multiple pronunciations ("live", "read", "bass") - Apostrophes ("o'brien", "'bout") - Hyphens ("able-bodied")

Report: % high-confidence alignments, % needing fallback, list of failed words.

Downstream Consumer¶

diffusion-governors' LookupBuilder will be updated (in a separate change) to load this alignment data and use it for subword token phoneme attribution. Given token "th" spanning chars [3:5] in "bartholomew", the lookup projects ["θ"] onto that token instead of looking up "th" as a standalone word.