PhonoLex 4.0.0 Release Plan¶
Monorepo merged. Integrated pipeline built. PHOIBLE replaced with learned vectors. Governor lookup clean. Now: expand constraints, build content catalog, ship chat.
Sub-project A: Integrated Lexical Database Pipeline (P0) — DONE¶
Pickle eliminated. Pipeline builds from raw datasets via phonolex_data loaders.
- [x] A1: Quick integration fixes (imports, paths, pyproject.toml)
- [x] A2: Rewrite
export-to-d1.py—build_lexical_database()pipeline - [x] CMU dict with all pronunciation variants (load_cmudict returns list of variants)
- [x] Syllabification, WCM, IPA normalization
- [x] 15 norm datasets (Warriner, Glasgow, Concreteness, Sensorimotor, Kuperman, Semantic Diversity, Socialness, BOI, SUBTLEX, ELP, Iconicity, Prevalence, IPhOD, MorphoLex, CYP-LEX)
- [x] 7 association datasets (SWOW, USF, SimLex, MEN, WordSim-353, SPP, ECCC)
- [x] Vocab lists, morphological segmentation
- [x] Percentiles, minimal pairs, phoneme dot products, syllable components
- [x] A3: End-to-end verification
- [x] Pipeline produces 207,673 words, 850,511 edges, 642,279 minimal pairs
- [x] D1 seeded — 45 phonemes, 990 dot products
- [x] All 7 API endpoints return 200 OK
- [x] Governor lookup built — 104,853 tokens, clean attribution
- [x] T5Gemma 9B loads, governed generation works (0 violations on /ɹ/ and /θ,ð/ exclusion)
Sub-project F: Learned Feature Vectors (P5) — DONE¶
PHOIBLE replaced with Bayesian-learned continuous articulatory features.
- [x] Hayes (2009) 26-feature prior → Beta priors
- [x] ECCC perceptual confusion + MorphoLex alternation + Hillenbrand formant evidence
- [x] r=0.987 cosine correlation with PHOIBLE, 8/8 voicing pairs, 4/6 clinical
- [x] Composite vectors (α·v_onset + β·v_offset) for diphthong trajectories
- [x] Integrated into pipeline, D1, and governor lookup
- [x] All PHOIBLE references removed from codebase, frontend, citations, ToS
Sub-project D: Chat Nav (P3) — NOT STARTED¶
Trivial — add sidebar link in PhonoLex web to governed chat.
- [ ] Add "Chat" nav item in sidebar (bottom of panel)
- [ ] Dev-gate: disabled in production, enabled via env flag
- [ ] Link to dashboard URL
Sub-project B2: Phoneme Inclusion Constraint (NEW — P1)¶
New governor mechanism: boost tokens containing target phonemes to achieve a minimum coverage rate in generated text. Enables therapy-targeted content where specific sounds must APPEAR, not just be excluded.
Mechanism: LogitBoost that increases probability of compliant tokens containing target phonemes. Coverage tracked as running percentage across generated text. Boost strength increases as coverage falls below target.
- [ ] Design inclusion constraint:
Include(phonemes={"s"}, min_coverage=0.30) - [ ] Implement as LogitBoost mechanism in
packages/governors/ - [ ] Add coverage tracking to
GovernorContext - [ ] Add constraint profile support in dashboard
- [ ] Test: generate text with 30% /s/ coverage target
Sub-project C: Content Catalog (P2) — NOT STARTED¶
Pre-generated compliant content as a companion to Text Analysis. Batch-generate text with constraint profiles, select for quality and compliance, bundle for download.
Depends on: B2 (phoneme inclusion) for coverage-rate profiles
- [ ] Define constraint profiles for clinical use cases:
- Phoneme exclusion bundles (per-sound: /ɹ/-free, /θ/-free, etc.)
- Phoneme inclusion bundles (high /s/ density, high /ɹ/ density, etc.)
- Minimal pair therapy passages
- Maximal opposition passages
- Multiple opposition passages
- [ ] Batch generation pipeline — run profiles through governor + T5Gemma
- [ ] Quality selection — fluency scoring, compliance verification, diversity
- [ ] Storage — D1 table or static JSON catalog
- [ ] "Generated Text" tool in PhonoLex web — browse/search/filter/download
- [ ] Bundle formats: sentence bundles, passage bundles, CSV export
Sub-project E: Dashboard UX Overhaul (P4) — NOT STARTED¶
Depends on B2 for inclusion constraint UI.
- [ ] Dynamic constraint builder — full IPA phoneme picker, norm sliders
- [ ] Real-time constraint feedback — show active constraints, blocked/boosted tokens
- [ ] Constraint profile management — save/load custom profiles
- [ ] Generation parameter controls — temperature, top-k, max tokens
Dependency Graph¶
A (pipeline) ✅
├── F (learned vectors) ✅
├── D (chat nav) → trivial, do anytime
├── B2 (phoneme inclusion) → new constraint mechanism
│ └── C (content catalog) → depends on B2 for coverage profiles
│ └── E (dashboard UX) → depends on B2 + C patterns
Priority Order¶
- D — Chat nav link (trivial, do first)
- B2 — Phoneme inclusion constraint (enables C)
- C — Content catalog (the product deliverable)
- E — Dashboard UX overhaul (polish)
Fixed Bugs This Session¶
- arpa_to_ipa.json: removed stress diacritics (ˈ, ˌ) — stress tracked as integers
- CMU loader: includes all pronunciation variants (THE → ðə, ðʌ, ði)
- Governor lookup: three-tier attribution (whole-word exact, subword morpheme-filtered, unknown best-effort)
- Governor lookup: morpheme boundary filtering prevents cross-morpheme contamination
- Governor lookup: letter-name filtering prevents CMU abbreviation pollution
- Syllabification: is_vowel() works with clean IPA (no diacritics to strip)