ThematicConstraint — Association-Backed Semantic Field Boost¶

2026-03-17 — Design spec for thematic constraint using cognitive association graph

Problem¶

The governor engine has no way to steer generation toward a semantic theme. Clinicians frequently need themed content: "generate text about animals," "keep the topic focused on food." Currently this relies entirely on the model's response to the prompt, with no constraint-level enforcement.

PhonoLex has 1M+ cognitive association edges (SWOW, USF) that capture human free-association norms — exactly the data needed to define semantic fields. This data is fully loaded in the web app but completely disconnected from the governor pipeline.

Solution¶

ThematicConstraint — a static LogitBoost that defines a semantic field from exemplar words, sweeps the vocabulary for all words associated with that field above a threshold, and boosts those tokens proportionally to their association strength.

Cognitive Model: Exemplar Theory¶

The constraint uses an exemplar-based approach: seed words are instances of the category. A word belongs to the field if it is strongly associated with any exemplar (max aggregation). This is appropriate for open-ended therapeutic themes where the clinician provides representative examples, not a formal definition.

Future extension (noted, not built): Prototype mode using mean aggregation across exemplars for tighter, more focused fields. Same scoring pipeline, different aggregation function.

Parameters¶

seed_words: list[str] — exemplar words defining the semantic field
strength: float = 1.5 — scales the association-derived weights
threshold: float = 0.02 — minimum field score for inclusion (filters noise)

Scoring¶

field_score(word) = max(assoc(word, exemplar) for exemplar in seed_words)
boost(token) = field_score(token_word) * strength   if field_score >= threshold
             = 0.0                                   otherwise

Where assoc(a, b) looks up the pre-merged association graph (SWOW + USF max-merged at load time into a single canonical-keyed dict). Both directions are checked via canonical key ordering. See Data Loading section for details.

Association Data Sources¶

Dataset	Scale	Coverage	What it measures
SWOW-EN	0.0–1.0 (proportion of responses)	~500K edges	Free association: "what word comes to mind?"
USF	0.0–1.0 (proportion of participants)	~72K edges	Free association (Nelson et al. 2004)

Both use identical 0–1 normalized scales representing response proportions. They are directly commensurable without rescaling. Merged via max — for each word pair, the strongest known association wins.

Mechanism¶

Static LogitBoost only. No coverage mode — the user dials strength up or down to control thematic influence. This matches the clinical intent: "nudge toward this theme" not "exactly N% themed tokens."

Data Loading and Build-Time Architecture¶

Association Graph¶

The association graph is loaded once at server startup, alongside the existing lookup:

# Canonical ordering: word1 < word2 (alphabetical)
assoc_graph: dict[tuple[str, str], float]

Building the graph: 1. Load SWOW via load_swow() from phonolex_data.loaders — for each (cue, target, strength), canonicalize the key and insert with max if key exists 2. Load USF via load_free_association() — for each (cue, target, usf_forward), merge with max against existing entries

Result: one graph with the strongest known association for every word pair.

Lookup function:

def assoc_strength(graph, a, b):
    key = (min(a, b), max(a, b))
    return graph.get(key, 0.0)

Checking both orderings via canonical key means association is treated as symmetric for field membership, even though the underlying data is directional.

Server Integration¶

model.py already loads the lookup at startup. Add a parallel load of the association graph:

_assoc_graph: dict[tuple[str, str], float] | None = None

def load_model():
    ...
    _assoc_graph = build_assoc_graph()  # SWOW + USF merged

A new get_assoc_graph() accessor, parallel to get_lookup().

Build Kwargs¶

build_governor() passes assoc_graph as a new kwarg when building constraints:

kwargs = dict(lookup=lookup, vocab_size=vocab_size, assoc_graph=assoc_graph)

ThematicConstraint.build() reads kwargs["assoc_graph"]. All other constraints ignore it.

Kwarg threading: assoc_graph flows through the full chain: governor_cache.get_processor(assoc_graph=) → build_logits_processor(assoc_graph=) → build_governor(assoc_graph=) → Governor.from_constraints(**kwargs) → ThematicConstraint.build(**kwargs). Each function in the chain accepts assoc_graph=None as an optional parameter and includes it in kwargs when present. The graph is a server-lifetime singleton (never changes), so it does not affect the GovernorCache hash — the cache key remains constraint-list-only.

Memory estimate: ~500K SWOW edges + ~72K USF edges → ~550K canonical pairs after deduplication. Each entry is (str, str) → float with Python dict overhead. Estimated ~80–120MB. The server already runs T5Gemma 9B (bfloat16, ~18GB on MPS). The graph fits comfortably in the remaining CPU memory.

At Build Time¶

ThematicConstraint.build(): 1. For each token in the lookup, get the token's word (from entry.get("word") — stripped, lowercased) 2. Compute field_score = max(assoc_strength(graph, word, seed) for seed in seed_words) 3. If field_score >= threshold, set weights[token_id] = field_score 4. Return LogitBoost(weights, scale=strength)

Edge Cases¶

Seed words not in graph: If a seed word has zero edges in the association graph (very rare word, proper noun, etc.), it contributes nothing to the field. The constraint still works from the other seeds. If all seeds are absent, every token gets weight 0 — effectively a no-op. The server should emit a warning in the generation response.

Warning mechanism: ThematicConstraint.build() returns the LogitBoost along with a count of non-zero weights. If zero, build_governor() collects the warning message. The route handler includes it in the response via a new warnings: list[str] | None = None field on AssistantResponse and SingleGenerationResponse. The frontend checks response.assistant.warnings and emits system messages for each.

Subword tokens: BPE subword fragments (e.g., "##ing", "▁un") have word values in the lookup but are absent from the association graph. They naturally receive zero field scores and are excluded — this is the intended filtering behavior. No special subword handling is needed.

Tokens not in lookup: Same as every other constraint — no word mapping, no boost, pass through.

Threshold at 0.0: All words with any association to any seed are included. May produce very broad boosting — allowed but not recommended.

Empty seed_words: Validation error at the schema level. At least one seed word required.

Schema and API¶

Engine constraint — new file packages/governors/src/diffusion_governors/thematic.py:

class ThematicConstraint(Constraint):
    def __init__(self, seed_words: list[str], strength: float = 1.5, threshold: float = 0.02):
        self.seed_words = seed_words
        self.strength = strength
        self.threshold = threshold

    @property
    def mechanism_kind(self) -> str:
        return "boost"

    def build(self, **kwargs) -> Mechanism:
        assoc_graph = kwargs.get("assoc_graph")
        if assoc_graph is None:
            raise ValueError("ThematicConstraint requires assoc_graph in build kwargs")
        # ... score tokens, build LogitBoost

Dashboard schema in schemas.py:

class ThematicConstraint(BaseModel):
    type: Literal["thematic"] = "thematic"
    seed_words: list[str]
    strength: float = 1.5
    threshold: float = 0.02

    @field_validator("seed_words")
    @classmethod
    def validate_seed_words(cls, v):
        if not v:
            raise ValueError("At least one seed word is required")
        return v

Constraint union: Add ThematicConstraint to the discriminated union.

_to_dg_constraint in governor.py:

elif isinstance(c, ThematicConstraint):
    return DGThematic(seed_words=c.seed_words, strength=c.strength, threshold=c.threshold)

Frontend¶

Command Syntax¶

/theme <word>... [strength] — set theme with optional strength override
/theme dog cat bird → strength 1.5 (default)
/theme dog cat bird 3.0 → strength 3.0
/remove theme → remove

Parser: Detect whether the last argument is a number (strength override) or another seed word. If it parses as a float and there is at least 1 preceding arg, treat as strength. Otherwise treat all args as seeds. This means /theme dog 3.0 sets seed=["dog"] with strength=3.0, while /theme dog uses default strength.

threshold is not exposed in the command syntax — the default 0.02 is always used. Adjustable via direct API only.

StoreEntry¶

One entry per /theme command — the seeds are a group, not individual entries:

| { type: "theme"; seedWords: string[]; strength: number }

Compiler¶

const themes = entries.filter((e) => e.type === "theme");
for (const t of themes) {
  result.push({ type: "thematic", seed_words: t.seedWords, strength: t.strength });
}

ConstraintBar¶

Chip label: "dog, cat, bird" (seed words joined)
Chip category: "theme" — new color
matchFields: undefined — remove by type only (triggers the existing "remove all of type" path in the store's remove method)

Registry¶

Add "theme" to VERBS and add help text entry.

Future Extensions (noted, not built)¶

Prototype mode: Mean aggregation across exemplars instead of max. Tighter fields. Same scoring pipeline, different aggregation function. Could be a parameter: mode: "exemplar" | "prototype".
Autocomplete for seed words: Reuse the CommandAutocomplete component infrastructure. The known word set would be words appearing in SWOW/USF, exposed via a /api/theme-vocab endpoint or bundled at startup. Same pattern as IPA keyboard but with vocabulary instead of phonemes.
Multi-hop expansion: Follow associates of associates with decayed strength. Reaches more words but introduces computed scores. Deferred pending coherence assessment.
Coverage mode: Track thematic token rate across the session, boost proportionally to gap. The _CoverageMechanism infrastructure exists but thematic coverage needs a clear clinical use case before adding complexity.

Files Affected¶

Engine (`packages/governors/src/diffusion_governors/`)¶

Create: thematic.py — ThematicConstraint class, assoc_strength() helper, graph-to-boost scoring
Modify: __init__.py — add ThematicConstraint to imports and __all__

Dashboard server (`packages/dashboard/server/`)¶

Modify: model.py — load association graph at startup, add get_assoc_graph() accessor
Modify: schemas.py — add ThematicConstraint schema, update Constraint union
Modify: governor.py — add _to_dg_constraint case for ThematicConstraint; pass assoc_graph to build_governor() kwargs
Modify: routes/generate.py — pass assoc_graph to governor_cache.get_processor()

Dashboard frontend (`packages/dashboard/frontend/src/`)¶

Modify: types.ts — add ThematicConstraint interface, StoreEntry variant, update unions
Modify: commands/parser.ts — add parseTheme function
Modify: commands/registry.ts — add "theme" to VERBS
Modify: commands/compiler.ts — add theme compilation
Modify: components/ConstraintBar/index.tsx — add theme chip handling
Modify: store/constraintStore.ts — minor: handle new entry type (if needed)

Tests¶

Engine: test field scoring (single seed, multi seed, threshold, no-match seeds, empty graph)
Server: test schema validation, _to_dg_constraint mapping
Frontend: parser tests for /theme, compiler tests

New data dependency¶

build_assoc_graph() utility function (in model.py or a separate associations.py module in the server)
Loads from phonolex_data.loaders.load_swow() and phonolex_data.loaders.load_free_association()
Loaded once at server startup, stored as module-level singleton

What This Does NOT Change¶

Existing constraints — untouched
Lookup format — no changes to build_lookup.py or the lookup JSON
Web app association routes — untouched (D1-based, separate system)
Governor composition order — gates → boosts → projections, unchanged
Coverage tracking — ThematicConstraint is static-only, does not use CoverageTracker