ThematicConstraint — Association-Backed Semantic Field Boost¶
2026-03-17 — Design spec for thematic constraint using cognitive association graph
Problem¶
The governor engine has no way to steer generation toward a semantic theme. Clinicians frequently need themed content: "generate text about animals," "keep the topic focused on food." Currently this relies entirely on the model's response to the prompt, with no constraint-level enforcement.
PhonoLex has 1M+ cognitive association edges (SWOW, USF) that capture human free-association norms — exactly the data needed to define semantic fields. This data is fully loaded in the web app but completely disconnected from the governor pipeline.
Solution¶
ThematicConstraint — a static LogitBoost that defines a semantic field from exemplar words, sweeps the vocabulary for all words associated with that field above a threshold, and boosts those tokens proportionally to their association strength.
Cognitive Model: Exemplar Theory¶
The constraint uses an exemplar-based approach: seed words are instances of the category. A word belongs to the field if it is strongly associated with any exemplar (max aggregation). This is appropriate for open-ended therapeutic themes where the clinician provides representative examples, not a formal definition.
Future extension (noted, not built): Prototype mode using mean aggregation across exemplars for tighter, more focused fields. Same scoring pipeline, different aggregation function.
Parameters¶
seed_words: list[str]— exemplar words defining the semantic fieldstrength: float = 1.5— scales the association-derived weightsthreshold: float = 0.02— minimum field score for inclusion (filters noise)
Scoring¶
field_score(word) = max(assoc(word, exemplar) for exemplar in seed_words)
boost(token) = field_score(token_word) * strength if field_score >= threshold
= 0.0 otherwise
Where assoc(a, b) looks up the pre-merged association graph (SWOW + USF max-merged at load time into a single canonical-keyed dict). Both directions are checked via canonical key ordering. See Data Loading section for details.
Association Data Sources¶
| Dataset | Scale | Coverage | What it measures |
|---|---|---|---|
| SWOW-EN | 0.0–1.0 (proportion of responses) | ~500K edges | Free association: "what word comes to mind?" |
| USF | 0.0–1.0 (proportion of participants) | ~72K edges | Free association (Nelson et al. 2004) |
Both use identical 0–1 normalized scales representing response proportions. They are directly commensurable without rescaling. Merged via max — for each word pair, the strongest known association wins.
Mechanism¶
Static LogitBoost only. No coverage mode — the user dials strength up or down to control thematic influence. This matches the clinical intent: "nudge toward this theme" not "exactly N% themed tokens."
Data Loading and Build-Time Architecture¶
Association Graph¶
The association graph is loaded once at server startup, alongside the existing lookup:
# Canonical ordering: word1 < word2 (alphabetical)
assoc_graph: dict[tuple[str, str], float]
Building the graph:
1. Load SWOW via load_swow() from phonolex_data.loaders — for each (cue, target, strength), canonicalize the key and insert with max if key exists
2. Load USF via load_free_association() — for each (cue, target, usf_forward), merge with max against existing entries
Result: one graph with the strongest known association for every word pair.
Lookup function:
def assoc_strength(graph, a, b):
key = (min(a, b), max(a, b))
return graph.get(key, 0.0)
Checking both orderings via canonical key means association is treated as symmetric for field membership, even though the underlying data is directional.
Server Integration¶
model.py already loads the lookup at startup. Add a parallel load of the association graph:
_assoc_graph: dict[tuple[str, str], float] | None = None
def load_model():
...
_assoc_graph = build_assoc_graph() # SWOW + USF merged
A new get_assoc_graph() accessor, parallel to get_lookup().
Build Kwargs¶
build_governor() passes assoc_graph as a new kwarg when building constraints:
kwargs = dict(lookup=lookup, vocab_size=vocab_size, assoc_graph=assoc_graph)
ThematicConstraint.build() reads kwargs["assoc_graph"]. All other constraints ignore it.
Kwarg threading: assoc_graph flows through the full chain: governor_cache.get_processor(assoc_graph=) → build_logits_processor(assoc_graph=) → build_governor(assoc_graph=) → Governor.from_constraints(**kwargs) → ThematicConstraint.build(**kwargs). Each function in the chain accepts assoc_graph=None as an optional parameter and includes it in kwargs when present. The graph is a server-lifetime singleton (never changes), so it does not affect the GovernorCache hash — the cache key remains constraint-list-only.
Memory estimate: ~500K SWOW edges + ~72K USF edges → ~550K canonical pairs after deduplication. Each entry is (str, str) → float with Python dict overhead. Estimated ~80–120MB. The server already runs T5Gemma 9B (bfloat16, ~18GB on MPS). The graph fits comfortably in the remaining CPU memory.
At Build Time¶
ThematicConstraint.build():
1. For each token in the lookup, get the token's word (from entry.get("word") — stripped, lowercased)
2. Compute field_score = max(assoc_strength(graph, word, seed) for seed in seed_words)
3. If field_score >= threshold, set weights[token_id] = field_score
4. Return LogitBoost(weights, scale=strength)
Edge Cases¶
Seed words not in graph: If a seed word has zero edges in the association graph (very rare word, proper noun, etc.), it contributes nothing to the field. The constraint still works from the other seeds. If all seeds are absent, every token gets weight 0 — effectively a no-op. The server should emit a warning in the generation response.
Warning mechanism: ThematicConstraint.build() returns the LogitBoost along with a count of non-zero weights. If zero, build_governor() collects the warning message. The route handler includes it in the response via a new warnings: list[str] | None = None field on AssistantResponse and SingleGenerationResponse. The frontend checks response.assistant.warnings and emits system messages for each.
Subword tokens: BPE subword fragments (e.g., "##ing", "▁un") have word values in the lookup but are absent from the association graph. They naturally receive zero field scores and are excluded — this is the intended filtering behavior. No special subword handling is needed.
Tokens not in lookup: Same as every other constraint — no word mapping, no boost, pass through.
Threshold at 0.0: All words with any association to any seed are included. May produce very broad boosting — allowed but not recommended.
Empty seed_words: Validation error at the schema level. At least one seed word required.
Schema and API¶
Engine constraint — new file packages/governors/src/diffusion_governors/thematic.py:
class ThematicConstraint(Constraint):
def __init__(self, seed_words: list[str], strength: float = 1.5, threshold: float = 0.02):
self.seed_words = seed_words
self.strength = strength
self.threshold = threshold
@property
def mechanism_kind(self) -> str:
return "boost"
def build(self, **kwargs) -> Mechanism:
assoc_graph = kwargs.get("assoc_graph")
if assoc_graph is None:
raise ValueError("ThematicConstraint requires assoc_graph in build kwargs")
# ... score tokens, build LogitBoost
Dashboard schema in schemas.py:
class ThematicConstraint(BaseModel):
type: Literal["thematic"] = "thematic"
seed_words: list[str]
strength: float = 1.5
threshold: float = 0.02
@field_validator("seed_words")
@classmethod
def validate_seed_words(cls, v):
if not v:
raise ValueError("At least one seed word is required")
return v
Constraint union: Add ThematicConstraint to the discriminated union.
_to_dg_constraint in governor.py:
elif isinstance(c, ThematicConstraint):
return DGThematic(seed_words=c.seed_words, strength=c.strength, threshold=c.threshold)
Frontend¶
Command Syntax¶
/theme <word>... [strength]— set theme with optional strength override/theme dog cat bird→ strength 1.5 (default)/theme dog cat bird 3.0→ strength 3.0/remove theme→ remove
Parser: Detect whether the last argument is a number (strength override) or another seed word. If it parses as a float and there is at least 1 preceding arg, treat as strength. Otherwise treat all args as seeds. This means /theme dog 3.0 sets seed=["dog"] with strength=3.0, while /theme dog uses default strength.
threshold is not exposed in the command syntax — the default 0.02 is always used. Adjustable via direct API only.
StoreEntry¶
One entry per /theme command — the seeds are a group, not individual entries:
| { type: "theme"; seedWords: string[]; strength: number }
Compiler¶
const themes = entries.filter((e) => e.type === "theme");
for (const t of themes) {
result.push({ type: "thematic", seed_words: t.seedWords, strength: t.strength });
}
ConstraintBar¶
- Chip label:
"dog, cat, bird"(seed words joined) - Chip category:
"theme"— new color - matchFields:
undefined— remove by type only (triggers the existing "remove all of type" path in the store'sremovemethod)
Registry¶
Add "theme" to VERBS and add help text entry.
Future Extensions (noted, not built)¶
- Prototype mode: Mean aggregation across exemplars instead of max. Tighter fields. Same scoring pipeline, different aggregation function. Could be a parameter:
mode: "exemplar" | "prototype". - Autocomplete for seed words: Reuse the CommandAutocomplete component infrastructure. The known word set would be words appearing in SWOW/USF, exposed via a
/api/theme-vocabendpoint or bundled at startup. Same pattern as IPA keyboard but with vocabulary instead of phonemes. - Multi-hop expansion: Follow associates of associates with decayed strength. Reaches more words but introduces computed scores. Deferred pending coherence assessment.
- Coverage mode: Track thematic token rate across the session, boost proportionally to gap. The
_CoverageMechanisminfrastructure exists but thematic coverage needs a clear clinical use case before adding complexity.
Files Affected¶
Engine (packages/governors/src/diffusion_governors/)¶
- Create:
thematic.py—ThematicConstraintclass,assoc_strength()helper, graph-to-boost scoring - Modify:
__init__.py— addThematicConstraintto imports and__all__
Dashboard server (packages/dashboard/server/)¶
- Modify:
model.py— load association graph at startup, addget_assoc_graph()accessor - Modify:
schemas.py— addThematicConstraintschema, updateConstraintunion - Modify:
governor.py— add_to_dg_constraintcase for ThematicConstraint; passassoc_graphtobuild_governor()kwargs - Modify:
routes/generate.py— passassoc_graphtogovernor_cache.get_processor()
Dashboard frontend (packages/dashboard/frontend/src/)¶
- Modify:
types.ts— addThematicConstraintinterface,StoreEntryvariant, update unions - Modify:
commands/parser.ts— addparseThemefunction - Modify:
commands/registry.ts— add"theme"to VERBS - Modify:
commands/compiler.ts— add theme compilation - Modify:
components/ConstraintBar/index.tsx— add theme chip handling - Modify:
store/constraintStore.ts— minor: handle new entry type (if needed)
Tests¶
- Engine: test field scoring (single seed, multi seed, threshold, no-match seeds, empty graph)
- Server: test schema validation, _to_dg_constraint mapping
- Frontend: parser tests for
/theme, compiler tests
New data dependency¶
build_assoc_graph()utility function (inmodel.pyor a separateassociations.pymodule in the server)- Loads from
phonolex_data.loaders.load_swow()andphonolex_data.loaders.load_free_association() - Loaded once at server startup, stored as module-level singleton
What This Does NOT Change¶
- Existing constraints — untouched
- Lookup format — no changes to
build_lookup.pyor the lookup JSON - Web app association routes — untouched (D1-based, separate system)
- Governor composition order — gates → boosts → projections, unchanged
- Coverage tracking — ThematicConstraint is static-only, does not use CoverageTracker