Runtime Word-Data Layer — Design (PHON-93, rescoped)¶

Date: 2026-05-05 Status: Spec — pending user review Branch: feature/phon-93-trie-centric-rebuild (off release/v5.2.0) Tickets: PHON-93 (rescoped — supersedes the trie-centric editor scope; editor + CFG enumerator move to a sibling ticket created at writing-plans handoff per feedback_verify_jira_state.md)

Problem¶

PhonoLex's runtime word data is currently fragmented across three independently-maintained paths to the same source-of-truth:

Workers API (production, phonolex.com) — queries Cloudflare D1 over HTTP
v6 generation server — reads four build-time JSON dumps (norms_dump.json 243 MB, vocab_dump.json 0.5 MB, phoneme_rates.json 1 KB, assoc_graph.json)
PHON-64 spike (and any future Python research) — opens the miniflare D1 SQLite directly, hand-writes a SELECT, hand-defines a LexEntry dataclass with ~22 cherry-picked columns

Each path ships its own subset of schema knowledge. packages/generation/scripts/build_runtime_data.py maintains a PROPERTY_COLUMNS list as a manual subset of packages/web/workers/scripts/config.py's PropertyDef records — the source-of-truth lives in config.py but consumers don't see it directly. PHON-88's freq-band columns landed in D1 weeks ago but haven't propagated to build_runtime_data.py. This is a sync-drift bug already in production.

Recent data work (PHON-72/73/76/81/82/83/84/85/86/87/88) added ~98 fields to D1 across a 4-table split (words, word_properties, word_percentiles, word_freq_bands). The fragmented runtime contract is collapsing under that growth — and won't survive the further additions PHON-93's downstream tickets require (DEP corpus reannotation, selectional preferences).

Goal: unify the runtime word-data layer behind a single columnar artifact and a single in-process store. Make the source-of-truth canonical, machine-derivable, and schema-driven. Eliminate the three independent paths.

Scope¶

In: - phonolex_data.runtime — new submodule containing schema, emit_parquet, emit_d1_sql, WordStore - data/runtime/words.parquet + data/runtime/edges.parquet — LFS-tracked canonical artifacts - selectional.parquet schema definition (population deferred — corpus DEP reannotation is sibling ticket B) - scripts/d1-seed.sql exits LFS, derived from Parquet at CI time - v6 generation server migrates off the four JSON dumps onto WordStore - phonolex_governors.generation.trie.VocabTrie swaps the Python dict-trie for a marisa-trie-backed implementation, API preserved - Schema-as-code: PropertyDef records drive Parquet schema + (optional) Pydantic row model

Out (sibling tickets): - C — Editor + CFG enumerator (mlm_iterative_editor, argstruc_enumerator, joint_mask_pll, per-request prefix structures). PHON-93 ends at WordStore.subset(expr).get_column("word").to_list(). - B — Corpus DEP reannotation (extends PHON-72 with spaCy DEP labels; populates selectional.parquet) - D — Frontend word-data cards (deferred per user direction) - E — Workers API rewire / D1 replacement (D1 stays as Workers' serving cache, regenerable) - v7 production integration (downstream)

Methodology principle¶

Source-of-truth migrates from D1 to Parquet. D1 stays as the Workers-API serving cache — it's a shape that fits Cloudflare's runtime, not a canonical store. Per CI, scripts/d1-seed.sql derives from data/runtime/*.parquet and is regenerable on demand; LFS shrinks from 274 MB (current d1-seed.sql) to ~50 MB (Parquet artifacts).

Datasets ship as columnar. Polars + Parquet is the standard 2026 shape for ML-adjacent reference data (HF Datasets / FineWeb-Edu / ConceptNet modern releases). Dict-of-dataclasses is legacy Python; doesn't compose with the rest of the ecosystem (vector ops, expression predicates, lazy eval, schema enforcement, compression).

Schema-as-code. config.py's PropertyDef records are the schema authority. Both Parquet schema and (where wanted) a Pydantic row model are generated from those records. There is no PROPERTY_COLUMNS list to maintain.

Quality-filtered subsets are queries, not schema. The "canon" subset (words with coverage across N+ datasets) is a Polars filter expression over words.parquet, not a separate artifact and not a hardcoded count. Recent data work (PHON-72/73/76/81/82/83/84/85/86/87/88) has expanded coverage, so any historical figure (e.g., the "~44K canon words" referenced in older docs) is stale and the count is larger going forward. CLAUDE.md and other docs that quote a fixed count should be updated when this lands.

Architectural inversions from the trie-centric founding memo¶

The trie-centric findings memo (packages/generation/research/2026-05-05-trie-centric-rebuild/findings-and-scope.md) framed PHON-93's central abstraction as a tagged trie: every constraint dimension (spec compliance, selectional preferences, per-position locks) becomes a tag dimension on a single per-request trie. PHON-93's rescope inverts that framing: the unifying abstraction shifts from tagged trie to Polars expressions over Parquet. The Jira ticket's three named revisions need updating:

Memo's revision	PHON-93 rescope
#1 — PHON-72's full 1.06M-doc sample is the corpus	Unchanged. Holds.
#2 — Selectional preferences become per-slot trie tagging, NOT a separate D1 table	Reverted. Selectional preferences become a separate Parquet table (`selectional.parquet`) joined per request via Polars. PHON-92's original "table" framing was structurally closer to right.
#3 — PPL as tiebreak becomes joint-mask MLM-PLL as the editor's optimization target	Unchanged, but moves to sibling ticket C (editor scope).

The trie_tagger module the memo proposed (~150 LOC, NEW) is retired from PHON-93's scope. Constraint composition happens via Polars boolean algebra on Parquet columns, not via tag-dimension composition on a trie.

Mapping: each role the founding memo gave the trie¶

Founding-memo role	New home
1. Lexicon hydration	`WordStore.from_parquet()` at startup
2. Spec compliance (per-spec tag)	`WordStore.subset(spec_expr)` — Polars filter expression
3. Selectional preferences (per-slot tag)	`selectional.parquet` joined per request; columns expose count + PPMI
4. Per-position locking	Editor-internal small per-position word/token-id sets — ticket C
5. CFG enumeration walk	`argstruc_enumerator` queries `WordStore.subset(slot_expr)` for typed terminals — ticket C
6. MLM editor logit-intersection	Editor's per-request prefix structure (boolean dict-trie) — ticket C
7. Coherence ranking	Joint-mask MLM-PLL, editor-internal — ticket C
8. Diversity (anti-rep, temperature)	Editor-internal — ticket C

PHON-93's surface covers 1-3 directly; feeds 4-6 by exposing query primitives.

Binary vs continuous, resolved¶

PHON-92's memo §3 made the substantive decision: continuous PPMI per (verb, role, filler) triple, with an enumeration-time admission threshold at PMI ≥ 0 (mathematically calibrated: positive log-ratio = above chance) and a coverage gate (c(v, r, *) ≥ 50 to trust zero-PPMI rejections). PISA-vector continuous score was rejected partly because it requires per-verb learned thresholds; PMI's threshold is structural, not learned.

The trie-centric memo's §5.1 left this as an "open question" only because it wasn't sure how to translate continuous PMI into trie tag dimensions. The columnar pivot dissolves the question:

selectional.parquet
   columns: verb, role, filler, count_v_r_f, count_v_r_star, ppmi
   storage: continuous PPMI per (V, role, F) triple — PHON-92's decision intact

Consumer-side flexibility surfaces all three thresholding strategies PHON-92 named:

Query	Polars expression
Default admission (PMI ≥ 0)	`filter(pl.col("ppmi") > 0)`
Tighter threshold	`filter(pl.col("ppmi") > 1.0)`
Coverage-gated rejection	`filter((pl.col("ppmi") > 0) \\| (pl.col("count_v_r_star") >= 50))`
Continuous bias term	read `pl.col("ppmi")` directly, add α·ppmi to LM logit

The trie itself stays bounded to word-membership (boolean) — but only because the continuous data is preserved upstream in selectional.parquet. A boolean trie without selectional.parquet would discard PHON-92's signal at storage and would still be wrong; the columnar pivot is what makes boolean acceptable. This distinction matters for downstream readers.

PHON-93 ships selectional.parquet's schema; population (DEP corpus reannotation) is sibling ticket B.

Architecture¶

data/{norms,cmu,mappings}
        │
        ▼
build_words()          [unchanged]
        │
        ▼
emit_parquet()         [NEW — phonolex_data.runtime]
        │
        ▼
data/runtime/words.parquet  (LFS, canonical, ~30-60 MB)
data/runtime/edges.parquet  (LFS, canonical, ~5-10 MB)
data/runtime/selectional.parquet  (LFS, schema only — populated by ticket B)
        │
   ┌────┴────────────────────────────┐
   ▼                                 ▼
emit_d1_sql()              WordStore (Polars)
   │                                 │
   ▼                                 ▼
scripts/d1-seed.sql        editor + CFG enumerator (ticket C)
 (CI artifact, not LFS)    v6 generation server (migrates here)
   │                                 │
   ▼                                 ▼
wrangler d1 execute        per-request structures built from
   │                       WordStore.subset(...).to_list()
   ▼
Cloudflare D1
(Workers serving cache,
 unchanged at request time)

Components¶

All in phonolex_data.runtime:

Module	Role
`schema`	Codegen from `config.py` `PropertyDef` records → Polars schema dict + optional Pydantic row model. Single source of schema truth.
`emit_parquet`	Pipeline records → `words.parquet` + `edges.parquet`. Called at data-build time.
`emit_d1_sql`	Parquet → D1 DDL + INSERT statements. CI build step, output not LFS-tracked.
`WordStore`	Polars `pl.DataFrame` wrapper exposing the locked query set + `dict[str, int]` word→row_idx for O(1) `get`.

Components elsewhere:

Module	Change
`phonolex_governors.generation.trie.VocabTrie`	API preserved; underlying storage swaps Python dict-trie → marisa-trie + parallel `dict[node_id, (banned, total)]` for per-request tag counts. See §B below.
`packages/generation/scripts/build_runtime_data.py`	Deleted. Four JSON dumps go away.
`packages/generation/server/word_norms.py`	Migrates to read from `WordStore` (loaded from Parquet at server cold-start).
PHON-64 spike's `lexicon.py`	Deleted. Was a hand-rolled D1-SQLite reader; replaced by `WordStore`.
`packages/web/workers/scripts/export-to-d1.py`	Refactored: emit Parquet first; SQL emission moves to `emit_d1_sql` called from CI on the Parquet artifact.

Three deletions. The fragmented runtime contract collapses to one canonical artifact + one runtime store.

`WordStore` API¶

class WordStore:
    @classmethod
    def from_parquet(cls, path: Path) -> "WordStore": ...

    def get(self, word: str) -> dict | None: ...          # O(1) via word→row_idx hash
    def subset(self, expr: pl.Expr) -> pl.DataFrame: ...  # Polars filter
    def prefix(self, prefix: str) -> list[str]: ...       # words starting with prefix
    def iterate_typed(self, expr: pl.Expr) -> list[str]: ... # word list for CFG slot
    def is_admitted(self, word: str, expr: pl.Expr) -> bool: ...  # row lookup + expr eval

    @property
    def df(self) -> pl.DataFrame: ...                      # escape hatch for ad-hoc queries

The five queries match the consumer set locked during brainstorm (editor + CFG enumerator + v6 server). df is the escape hatch for one-off needs the API doesn't anticipate.

`selectional.parquet` schema (defined here, populated by ticket B)¶

column	type	notes
`verb`	str	head verb (lemmatized)
`role`	str	dependency relation (`nsubj`, `dobj`, `iobj`, `obl`, …)
`filler`	str	argument lemma
`count_v_r_f`	u32	observed `c(v, r, f)`
`count_v_r_star`	u32	`c(v, r, *)` for the coverage gate
`ppmi`	f32	per-role PPMI with add-α=0.01 smoothing, sparsity floor min_count=5

PHON-92's formula:

P̂(f | v, r) = (c(v, r, f) + α) / (c(v, r, *) + α · |F_r|)
P̂(f | r)    = (c(*, r, f) + α) / (c(*, r, *) + α · |F_r|)
PPMI(v, r, f) = max(0, log₂( P̂(f | v, r) / P̂(f | r) ))

Triples with c(v, r, f) < 5 are not stored (sparsity floor — evidence-of-absence).

§B — `VocabTrie` swap to marisa-trie¶

The basic Python dict-trie at packages/governors/src/phonolex_governors/generation/trie.py swaps to a marisa-trie-backed implementation.

Why included in PHON-93: - Static global trie at v6 startup (built once from 126K words, lifetime of the process) is exactly marisa-trie's strength: 50-100× memory reduction, sub-µs query, sub-ms build at this scale. - Lands cleanly alongside the data-layer work — both are runtime-infrastructure cleanups; low coupling between them but they share the same surface area. - Per-request small structures (500-2K word spec lexicons) stay as Python dict-tries — marisa-trie is wasted at that grain because rebuilding 100×/day on small lists makes the static-rebuild cost dominate.

Wrinkle: marisa-trie is static (immutable after build). Current tag(banned_words) mutates per-request banned_below / total_below counts on every node. Solution: marisa-trie + parallel dict[node_id, (banned, total)] updated on tag(). Marisa-trie exposes node IDs; the parallel dict stores per-request counts; dead_end_ratio() queries marisa-trie for the node, then the dict for counts. API contract preserved end-to-end.

License: marisa-trie wrapper is MIT; C++ core is dual-licensed BSD-2 / LGPL-2.1. Pick BSD-2 at install time to avoid LGPL contamination of the proprietary build. Documented option in the wrapper. Consistent with the data-license audit's "permissive only" standard.

Data flow¶

Build-time / CI: 1. data/{norms,cmu,mappings} → phonolex_data.pipeline.build_words() → records (unchanged) 2. records → phonolex_data.runtime.emit_parquet() → data/runtime/{words,edges}.parquet (LFS-tracked) 3. data/runtime/{words,edges}.parquet → phonolex_data.runtime.emit_d1_sql() → scripts/d1-seed.sql (CI artifact, not LFS) 4. wrangler d1 execute phonolex --file scripts/d1-seed.sql → Cloudflare D1

CI trigger: data/{norms,cmu,mappings}/** change → CI rebuilds Parquet → derives d1-seed.sql → seeds D1. Existing dorny/paths-filter pattern in .github/workflows/deploy.yml extends naturally.

Runtime (generation server cold-start): 1. Load data/runtime/{words,edges}.parquet (baked into Docker image at build) 2. WordStore.from_parquet() → pl.DataFrame + word→row_idx hash 3. Marisa-trie built from df["word"].to_list() for v6 Reranker's global-trie role

Per-request (editor + CFG, ticket C — shape only): 1. Compose Polars expression from constraint set 2. WordStore.subset(expr) → filtered word list 3. Build per-request boolean dict-trie from word list 4. (Once selectional.parquet is populated) WordStore.subset(spec_expr).join(selectional, on=("verb","filler")).filter(pl.col("ppmi") > 0) → admitted list

Error handling¶

Scenario	Behavior
Parquet schema diverges from `PropertyDef`	`WordStore.from_parquet()` fails fast, names missing/extra columns
Parquet missing at server startup	Hard fail, server doesn't start (vs lazy-fail at query time)
Parquet older than `data/{norms,cmu,mappings}/*` source	CI hook rebuilds; in dev mode, log a warn-not-fail message with explicit message
`WordStore.subset(...)` returns empty	Empty list — no error; caller decides
`selectional.parquet` missing or empty	Treat as "no selectional data available"; downstream consumers skip the join, fall back to spec-only filter (ticket C policy)
`d1-seed.sql` derivation fails in CI	Block deploy, surface schema diff in failure message

Testing¶

Schema codegen — PropertyDef → expected Polars schema (snapshot test against tests/runtime/test_schema_snapshot.json)
Parquet roundtrip — records → emit_parquet → WordStore.from_parquet → query returns expected values; tests N=100 sampled words across ~10 fields
D1-derivation regression — emit_d1_sql output diffs against the current hand-built d1-seed.sql shape (DDL + first 100 INSERT rows). Allow column-order differences; assert row count + value parity.
Marisa-trie API parity — existing VocabTrie test suite passes on the new implementation. Single pytest -k vocab_trie command should pass against both the old and new impl during the migration window.
v6 server migration — existing generation-server tests pass after build_runtime_data.py removal and word_norms.py rewire.
CI integration — data-file change triggers CI Parquet rebuild → row count + schema asserts → deploy.

Migration notes¶

Order of operations (single PR, but staged commits):

Add phonolex_data.runtime submodule (schema, emit_parquet, emit_d1_sql, WordStore) with tests. Doesn't touch any caller yet.
Generate data/runtime/{words,edges}.parquet from current data. Verify roundtrip + D1-derivation parity.
Add Parquet artifacts to LFS (git lfs track 'data/runtime/*.parquet').
Migrate packages/generation/server/word_norms.py to WordStore. Delete build_runtime_data.py. Tests pass.
Swap VocabTrie to marisa-trie + parallel-dict tag counts. Tests pass.
Refactor export-to-d1.py to defer SQL emission to emit_d1_sql on the Parquet artifact.
Update .github/workflows/*.yml for the new CI flow.
Last: git lfs untrack 'scripts/d1-seed.sql', remove the file from LFS history (or accept residual LFS storage for prior commits — discuss before doing the history rewrite).

Backup before LFS removal: snapshot the current d1-seed.sql to scripts/d1-seed.sql.preflight-backup-$(date) outside LFS first. Verify the new CI-derived seed produces a working D1 in staging before committing the untrack.

v6 server cutover — single PR migrates v6 onto WordStore and deletes the four JSON dumps. No parallel-ship phase; per feedback_no_either_or and "I'm not here for bridge solutions," the dumps are the monster being killed.

Open questions for the impl phase¶

Polars LazyFrame vs DataFrame at runtime — eager load is simpler and the dataset is small (~150K rows × ~98 cols ≈ ~30-60 MB compressed → ~150 MB in-memory). LazyFrame buys nothing here. Tentative: eager DataFrame. Confirm at impl.
Schema migration policy — when a new PropertyDef lands in config.py, does the Parquet rebuild auto-trigger in CI? Tentative: yes, on any change to packages/web/workers/scripts/config.py. CI hook to add.
Phoneme features parquet? — packages/features/ ships learned per-phoneme feature vectors. Currently loaded into the Worker's similarity route at cold start; the generation server doesn't use them yet. Out of scope unless a v7 editor strategy needs them.
Edges schema — single file or per-type? — current decision (§A): one edges.parquet with a type column. Edge types stay extensible without changing words. Confirm during impl.
DuckDB layer — DuckDB-on-Parquet for ad-hoc SQL analysis is a free ergonomic win (no runtime dep — DuckDB CLI / notebook usage only). Worth documenting in the runbook but not a code dep.

Risk register¶

R1 — Polars dependency footprint. Adds ~30 MB wheel to phonolex_data. Severity: low. Mitigation: pin a specific Polars major version; track wheel size in CI.

R2 — Marisa-trie static-immutability assumption. If a future v6 feature needs to mutate the global trie post-build (incremental word additions, online learning), the static structure breaks. Severity: low — no such feature exists or is planned. Falsification trigger: any PR adding mutable global-trie operations.

R3 — D1 seed regression during CI derivation. First runs of emit_d1_sql may produce a seed that wrangler can't ingest, or that produces a D1 with subtly different row count / schema vs the current hand-built version. Severity: medium. Mitigation: regression test (test §3) catches this pre-deploy.

R4 — LFS untracking is irreversible-ish. Removing d1-seed.sql from LFS while leaving it in git history requires a history rewrite. Severity: medium. Mitigation: keep the LFS pointer for a release cycle; only do the history rewrite when staging proves stable. Backup snapshot before any rewrite.

R5 — v6 server tests don't fully cover JSON-dump → WordStore migration. Some word_norms.py paths may have implicit JSON-shape assumptions not surfaced by tests. Severity: medium. Mitigation: run the generation server end-to-end on the staging environment with a real spec before merging; the existing eval-harness-v1 provides a smoke check.

R6 — Schema codegen brittleness. Driving Polars schema from PropertyDef introspection assumes PropertyDef fields stay in sync with what Parquet expects (types, nullability). Severity: low. Mitigation: snapshot test for the codegen output; CI fails if schema drifts unexpectedly.

Decision-recommendation¶

GO. All architectural decisions resolved during brainstorm:

Columnar (Polars + Parquet) over dict-of-dataclasses
Parquet canonical, D1 derived in CI from Parquet
selectional.parquet schema defined here, population deferred to ticket B
v6 server migrates in same ticket (single moving target)
d1-seed.sql exits LFS as a CI artifact (with snapshot backup)
Trie editor/CFG work scoped out (sibling ticket C); per-request prefix structures stay as Python dict-tries
Marisa-trie swap for the static global VocabTrie included as §B

Estimate: 6-8 working days for v1 implementation. Milestones in the migration order above.

Branch: feature/phon-93-trie-centric-rebuild off release/v5.2.0. Sibling tickets (B, C) branch separately.

Per feedback_verify_jira_state.md: at writing-plans handoff, JQL the next free PHON-XX before reserving sibling-ticket numbers. Also: edit PHON-93's Jira description to reflect this rescope (the original "trie-centric editor rebuild" framing is superseded by "runtime word-data layer"; editor scope moves to the new sibling ticket).