Skip to content

Next Session — v5 Constrained Generation

Where we are

Branch feature/governed-generation-ui. Dynamic governor with word-aware reranker working for exclusion and inclusion. Other constraint types wired but untested.

Tested and working: - exclude — zero leaks, word-aware reranker + GUARD retry - include — per-phoneme calibrated boost, self-regulating coverage, coverage % UI

Wired but untested (9 remaining): 1. exclude_clusters 2. bound (35 filterable norms) 3. complexity_wcm 4. complexity_syllables 5. complexity_shapes 6. msh 7. vocab_boost 8. vocab_only 9. boost_minpair 10. boost_maxopp 11. thematic

Clinical value assessment — what to prioritize

Tier 1: Essential for SLPs

  • exclude — "No /ɹ/ for this client." Core use case. Done.
  • include + coverage % — "Practice /b/ at 20%." Done.
  • msh — Motor Speech Hierarchy. "Only stage 2-3 sounds." Clinicians use this directly.
  • bound: aoa — "Words a 5-year-old would know." Very common clinical target.
  • complexity_syllables — "Only 1-2 syllable words." Standard clinical target.

Tier 2: Useful but secondary

  • exclude_clusters — "Allow /s/ in singletons but not clusters." Some clinicians want this.
  • complexity_wcm — More nuanced than syllable count. Researchers and advanced clinicians.
  • bound: concreteness — Concrete words are easier to visualize/teach.
  • vocab_only — Restrict to specific word lists (Ogden basic, GSL). ELL contexts.
  • thematic — "Words about animals." Themed therapy sessions.

Tier 3: Questionable — may not produce useful output

  • complexity_shapes — CV/CVC/CCVC. Very granular. May over-constrain and produce garbage.
  • vocab_boost — Soft targeting of word lists. Overlaps with thematic and include.
  • boost_minpair — Minimal pairs are a lookup/selection tool, not a generation constraint.
  • boost_maxopp — Same issue. Maximal opposition is a contrastive therapy approach.
  • bound: frequency/log_frequency — Raw frequency too opaque and restrictive.
  • bound: sensorimotor norms — Very niche, sparse data coverage.

Unified phoneme targeting: coverage 0% = exclude

Collapse exclude + include into a single "Phoneme Targeting" section. Each phoneme gets a coverage slider (0-50%). Coverage 0% routes to the hard exclude path (reranker penalty + GUARD, zero tolerance). Coverage >0% routes to the soft include boost. One mental model, one section, one slider per phoneme.

UI redesign: organized constraint categories

Reorganize the constraint UI into clear categories:

  1. Phoneme Targeting — unified section, two modes:
  2. Exclude mode (coverage 0% = hard block)
  3. Include mode (coverage 5-50% = soft boost with self-regulating coverage)

  4. Complexity — four controls in one section:

  5. Max syllable count
  6. Max WCM
  7. Allowed syllable shapes
  8. MSH stage

  9. Psycholinguistic Bounds — curated to norms that don't break function words:

  10. Safe for generation: familiarity, frequency, imageability, semantic_diversity, socialness, prevalence
  11. Use with care (some function word failures): concreteness, AoA (Glasgow not Kuperman), valence, arousal, dominance
  12. REMOVE from generation UI: all phonotactic probs, elp_lexical_decision_rt, aoa_kuperman, BoI, sensorimotor norms, phoneme_count, wcm_score (handled by Complexity section)
  13. All norms remain available in analysis/lookup tools

  14. Themed Vocabulary — semantic fields + word lists, composable:

  15. Seed words define the semantic field (USF associations)
  16. Word lists constrain the pool (Ogden, AVL, GSL, etc.)
  17. VocabOnly mode (hard restrict to list + stop words + punctuation always)
  18. VocabBoost mode (soft encourage)
  19. AVL supported out of the box
  20. e.g. /theme animals ogden_basic = animal words from Ogden list

Adjustable output length / complexity level — low AoA should produce shorter sentences, simpler structure. Auto-adjust max_new_tokens and punctuation boost aggressiveness based on active constraints, or expose as user-settable "reading level" bins.

Constraint sliders use percentile scale for opaque metrics only (25th-75th percentile frequency) instead of raw values (0.5-50). Raw values available in analysis/expert mode. All _percentile columns already in D1.

Power user features via slash commands in the prompt field. Command parser/registry in git history of deleted dashboard frontend — port to packages/web/frontend/src/.

Spec: docs/superpowers/specs/2026-03-16-governed-chat-command-language-design.md

Key files

  • packages/governors/src/phonolex_governors/generation/reranker.py — word-aware reranker with partial word reconstruction, self-regulating include coverage
  • packages/governors/src/phonolex_governors/checking/checker.py — all check types (exclude, MSH, bounds, complexity, vocab-only, clusters)
  • packages/generation/server/model.pygenerate_with_checking(), penalty escalation, punctuation boost
  • packages/generation/server/governor.pybuild_checker_config(), build_boost_processor()
  • packages/generation/server/word_norms.py — word-level norms (105K), vocab memberships, frequency-weighted phoneme natural rates
  • packages/generation/server/routes/generate.py/generate-single with include coverage stats
  • packages/web/frontend/src/components/tools/GovernedGenerationTool/OutputCard.tsx — compliance + include highlighting
  • packages/web/frontend/src/components/tools/GovernedGenerationTool/PhonemeConstraints.tsx — coverage % slider (no strength)

Key tuning values

  • Penalty schedule: [15, 30, 60, 100, 100]
  • Include boost: min(2.5 * sqrt(ln(target/natural)), 10.0) per-phoneme, SUBTLEX frequency-weighted natural rates
  • Punctuation boost: 2.0 baseline + 2.0/word over 12 words
  • Temperature: 0.6, repetition_penalty: 1.3, top_k: 50, top_p: 0.9

How to start the servers

# Backend (FastAPI + T5Gemma, ~40s cold start)
cd packages/generation
nohup uv run uvicorn server.main:app --host 0.0.0.0 --port 8000 --log-level debug > /tmp/phonolex-backend.log 2>&1 &

# Frontend (React, instant)
cd packages/web/frontend
npm run dev

What NOT to do

READ THE EXISTING CODE BEFORE TOUCHING ANYTHING.

  • The UI is packages/web/frontend/. There is no dashboard frontend. Do not create one.
  • DO NOT reimplement algorithms that already exist. The reranker, checker, word norms, boost calibration, coverage tracking — all exist and work. Read reranker.py, checker.py, word_norms.py, model.py before writing anything.
  • DO NOT create new packages or directories unless explicitly asked. The code lives where it lives.
  • Don't boost compliant word-start tokens in the reranker — causes fragmentation by biasing toward new words over continuations
  • Don't use dictionary phoneme rates for boost calibration — use SUBTLEX frequency-weighted rates
  • Don't rebuild the governor lookup for phoneme enforcement — G2P handles that now
  • Don't use static per-token masks — the dynamic paradigm replaces them