PhonoLex 4.0.0 Release Plan¶

Monorepo merged. Integrated pipeline built. PHOIBLE replaced with learned vectors. Governor lookup clean. Now: expand constraints, build content catalog, ship chat.

Sub-project A: Integrated Lexical Database Pipeline (P0) — DONE¶

Pickle eliminated. Pipeline builds from raw datasets via phonolex_data loaders.

[x] A1: Quick integration fixes (imports, paths, pyproject.toml)
[x] A2: Rewrite export-to-d1.py — build_lexical_database() pipeline
[x] CMU dict with all pronunciation variants (load_cmudict returns list of variants)
[x] Syllabification, WCM, IPA normalization
[x] 15 norm datasets (Warriner, Glasgow, Concreteness, Sensorimotor, Kuperman, Semantic Diversity, Socialness, BOI, SUBTLEX, ELP, Iconicity, Prevalence, IPhOD, MorphoLex, CYP-LEX)
[x] 7 association datasets (SWOW, USF, SimLex, MEN, WordSim-353, SPP, ECCC)
[x] Vocab lists, morphological segmentation
[x] Percentiles, minimal pairs, phoneme dot products, syllable components
[x] A3: End-to-end verification
[x] Pipeline produces 207,673 words, 850,511 edges, 642,279 minimal pairs
[x] D1 seeded — 45 phonemes, 990 dot products
[x] All 7 API endpoints return 200 OK
[x] Governor lookup built — 104,853 tokens, clean attribution
[x] T5Gemma 9B loads, governed generation works (0 violations on /ɹ/ and /θ,ð/ exclusion)

Sub-project F: Learned Feature Vectors (P5) — DONE¶

PHOIBLE replaced with Bayesian-learned continuous articulatory features.

[x] Hayes (2009) 26-feature prior → Beta priors
[x] ECCC perceptual confusion + MorphoLex alternation + Hillenbrand formant evidence
[x] r=0.987 cosine correlation with PHOIBLE, 8/8 voicing pairs, 4/6 clinical
[x] Composite vectors (α·v_onset + β·v_offset) for diphthong trajectories
[x] Integrated into pipeline, D1, and governor lookup
[x] All PHOIBLE references removed from codebase, frontend, citations, ToS

Sub-project D: Chat Nav (P3) — NOT STARTED¶

Trivial — add sidebar link in PhonoLex web to governed chat.

[ ] Add "Chat" nav item in sidebar (bottom of panel)
[ ] Dev-gate: disabled in production, enabled via env flag
[ ] Link to dashboard URL

Sub-project B2: Phoneme Inclusion Constraint (NEW — P1)¶

New governor mechanism: boost tokens containing target phonemes to achieve a minimum coverage rate in generated text. Enables therapy-targeted content where specific sounds must APPEAR, not just be excluded.

Mechanism: LogitBoost that increases probability of compliant tokens containing target phonemes. Coverage tracked as running percentage across generated text. Boost strength increases as coverage falls below target.

[ ] Design inclusion constraint: Include(phonemes={"s"}, min_coverage=0.30)
[ ] Implement as LogitBoost mechanism in packages/governors/
[ ] Add coverage tracking to GovernorContext
[ ] Add constraint profile support in dashboard
[ ] Test: generate text with 30% /s/ coverage target

Sub-project C: Content Catalog (P2) — NOT STARTED¶

Pre-generated compliant content as a companion to Text Analysis. Batch-generate text with constraint profiles, select for quality and compliance, bundle for download.

Depends on: B2 (phoneme inclusion) for coverage-rate profiles

[ ] Define constraint profiles for clinical use cases:
Phoneme exclusion bundles (per-sound: /ɹ/-free, /θ/-free, etc.)
Phoneme inclusion bundles (high /s/ density, high /ɹ/ density, etc.)
Minimal pair therapy passages
Maximal opposition passages
Multiple opposition passages
[ ] Batch generation pipeline — run profiles through governor + T5Gemma
[ ] Quality selection — fluency scoring, compliance verification, diversity
[ ] Storage — D1 table or static JSON catalog
[ ] "Generated Text" tool in PhonoLex web — browse/search/filter/download
[ ] Bundle formats: sentence bundles, passage bundles, CSV export

Sub-project E: Dashboard UX Overhaul (P4) — NOT STARTED¶

Depends on B2 for inclusion constraint UI.

[ ] Dynamic constraint builder — full IPA phoneme picker, norm sliders
[ ] Real-time constraint feedback — show active constraints, blocked/boosted tokens
[ ] Constraint profile management — save/load custom profiles
[ ] Generation parameter controls — temperature, top-k, max tokens

Dependency Graph¶

A (pipeline) ✅
├── F (learned vectors) ✅
├── D (chat nav) → trivial, do anytime
├── B2 (phoneme inclusion) → new constraint mechanism
│   └── C (content catalog) → depends on B2 for coverage profiles
│       └── E (dashboard UX) → depends on B2 + C patterns

Priority Order¶

D — Chat nav link (trivial, do first)
B2 — Phoneme inclusion constraint (enables C)
C — Content catalog (the product deliverable)
E — Dashboard UX overhaul (polish)

Fixed Bugs This Session¶

arpa_to_ipa.json: removed stress diacritics (ˈ, ˌ) — stress tracked as integers
CMU loader: includes all pronunciation variants (THE → ðə, ðʌ, ði)
Governor lookup: three-tier attribution (whole-word exact, subword morpheme-filtered, unknown best-effort)
Governor lookup: morpheme boundary filtering prevents cross-morpheme contamination
Governor lookup: letter-name filtering prevents CMU abbreviation pollution
Syllabification: is_vowel() works with clean IPA (no diacritics to strip)