Skip to content

PhonoLex 4.0.0 Release Plan

Monorepo merged. Integrated pipeline built. PHOIBLE replaced with learned vectors. Governor lookup clean. Now: expand constraints, build content catalog, ship chat.


Sub-project A: Integrated Lexical Database Pipeline (P0) — DONE

Pickle eliminated. Pipeline builds from raw datasets via phonolex_data loaders.

  • [x] A1: Quick integration fixes (imports, paths, pyproject.toml)
  • [x] A2: Rewrite export-to-d1.pybuild_lexical_database() pipeline
  • [x] CMU dict with all pronunciation variants (load_cmudict returns list of variants)
  • [x] Syllabification, WCM, IPA normalization
  • [x] 15 norm datasets (Warriner, Glasgow, Concreteness, Sensorimotor, Kuperman, Semantic Diversity, Socialness, BOI, SUBTLEX, ELP, Iconicity, Prevalence, IPhOD, MorphoLex, CYP-LEX)
  • [x] 7 association datasets (SWOW, USF, SimLex, MEN, WordSim-353, SPP, ECCC)
  • [x] Vocab lists, morphological segmentation
  • [x] Percentiles, minimal pairs, phoneme dot products, syllable components
  • [x] A3: End-to-end verification
  • [x] Pipeline produces 207,673 words, 850,511 edges, 642,279 minimal pairs
  • [x] D1 seeded — 45 phonemes, 990 dot products
  • [x] All 7 API endpoints return 200 OK
  • [x] Governor lookup built — 104,853 tokens, clean attribution
  • [x] T5Gemma 9B loads, governed generation works (0 violations on /ɹ/ and /θ,ð/ exclusion)

Sub-project F: Learned Feature Vectors (P5) — DONE

PHOIBLE replaced with Bayesian-learned continuous articulatory features.

  • [x] Hayes (2009) 26-feature prior → Beta priors
  • [x] ECCC perceptual confusion + MorphoLex alternation + Hillenbrand formant evidence
  • [x] r=0.987 cosine correlation with PHOIBLE, 8/8 voicing pairs, 4/6 clinical
  • [x] Composite vectors (α·v_onset + β·v_offset) for diphthong trajectories
  • [x] Integrated into pipeline, D1, and governor lookup
  • [x] All PHOIBLE references removed from codebase, frontend, citations, ToS

Sub-project D: Chat Nav (P3) — NOT STARTED

Trivial — add sidebar link in PhonoLex web to governed chat.

  • [ ] Add "Chat" nav item in sidebar (bottom of panel)
  • [ ] Dev-gate: disabled in production, enabled via env flag
  • [ ] Link to dashboard URL

Sub-project B2: Phoneme Inclusion Constraint (NEW — P1)

New governor mechanism: boost tokens containing target phonemes to achieve a minimum coverage rate in generated text. Enables therapy-targeted content where specific sounds must APPEAR, not just be excluded.

Mechanism: LogitBoost that increases probability of compliant tokens containing target phonemes. Coverage tracked as running percentage across generated text. Boost strength increases as coverage falls below target.

  • [ ] Design inclusion constraint: Include(phonemes={"s"}, min_coverage=0.30)
  • [ ] Implement as LogitBoost mechanism in packages/governors/
  • [ ] Add coverage tracking to GovernorContext
  • [ ] Add constraint profile support in dashboard
  • [ ] Test: generate text with 30% /s/ coverage target

Sub-project C: Content Catalog (P2) — NOT STARTED

Pre-generated compliant content as a companion to Text Analysis. Batch-generate text with constraint profiles, select for quality and compliance, bundle for download.

Depends on: B2 (phoneme inclusion) for coverage-rate profiles

  • [ ] Define constraint profiles for clinical use cases:
  • Phoneme exclusion bundles (per-sound: /ɹ/-free, /θ/-free, etc.)
  • Phoneme inclusion bundles (high /s/ density, high /ɹ/ density, etc.)
  • Minimal pair therapy passages
  • Maximal opposition passages
  • Multiple opposition passages
  • [ ] Batch generation pipeline — run profiles through governor + T5Gemma
  • [ ] Quality selection — fluency scoring, compliance verification, diversity
  • [ ] Storage — D1 table or static JSON catalog
  • [ ] "Generated Text" tool in PhonoLex web — browse/search/filter/download
  • [ ] Bundle formats: sentence bundles, passage bundles, CSV export

Sub-project E: Dashboard UX Overhaul (P4) — NOT STARTED

Depends on B2 for inclusion constraint UI.

  • [ ] Dynamic constraint builder — full IPA phoneme picker, norm sliders
  • [ ] Real-time constraint feedback — show active constraints, blocked/boosted tokens
  • [ ] Constraint profile management — save/load custom profiles
  • [ ] Generation parameter controls — temperature, top-k, max tokens

Dependency Graph

A (pipeline) ✅
├── F (learned vectors) ✅
├── D (chat nav) → trivial, do anytime
├── B2 (phoneme inclusion) → new constraint mechanism
│   └── C (content catalog) → depends on B2 for coverage profiles
│       └── E (dashboard UX) → depends on B2 + C patterns

Priority Order

  1. D — Chat nav link (trivial, do first)
  2. B2 — Phoneme inclusion constraint (enables C)
  3. C — Content catalog (the product deliverable)
  4. E — Dashboard UX overhaul (polish)

Fixed Bugs This Session

  • arpa_to_ipa.json: removed stress diacritics (ˈ, ˌ) — stress tracked as integers
  • CMU loader: includes all pronunciation variants (THE → ðə, ðʌ, ði)
  • Governor lookup: three-tier attribution (whole-word exact, subword morpheme-filtered, unknown best-effort)
  • Governor lookup: morpheme boundary filtering prevents cross-morpheme contamination
  • Governor lookup: letter-name filtering prevents CMU abbreviation pollution
  • Syllabification: is_vowel() works with clean IPA (no diacritics to strip)