Runtime Word-Data Layer — Design (PHON-93, rescoped)¶
Date: 2026-05-05
Status: Spec — pending user review
Branch: feature/phon-93-trie-centric-rebuild (off release/v5.2.0)
Tickets: PHON-93 (rescoped — supersedes the trie-centric editor scope; editor + CFG enumerator move to a sibling ticket created at writing-plans handoff per feedback_verify_jira_state.md)
Problem¶
PhonoLex's runtime word data is currently fragmented across three independently-maintained paths to the same source-of-truth:
- Workers API (production,
phonolex.com) — queries Cloudflare D1 over HTTP - v6 generation server — reads four build-time JSON dumps (
norms_dump.json243 MB,vocab_dump.json0.5 MB,phoneme_rates.json1 KB,assoc_graph.json) - PHON-64 spike (and any future Python research) — opens the miniflare D1 SQLite directly, hand-writes a SELECT, hand-defines a
LexEntrydataclass with ~22 cherry-picked columns
Each path ships its own subset of schema knowledge. packages/generation/scripts/build_runtime_data.py maintains a PROPERTY_COLUMNS list as a manual subset of packages/web/workers/scripts/config.py's PropertyDef records — the source-of-truth lives in config.py but consumers don't see it directly. PHON-88's freq-band columns landed in D1 weeks ago but haven't propagated to build_runtime_data.py. This is a sync-drift bug already in production.
Recent data work (PHON-72/73/76/81/82/83/84/85/86/87/88) added ~98 fields to D1 across a 4-table split (words, word_properties, word_percentiles, word_freq_bands). The fragmented runtime contract is collapsing under that growth — and won't survive the further additions PHON-93's downstream tickets require (DEP corpus reannotation, selectional preferences).
Goal: unify the runtime word-data layer behind a single columnar artifact and a single in-process store. Make the source-of-truth canonical, machine-derivable, and schema-driven. Eliminate the three independent paths.
Scope¶
In:
- phonolex_data.runtime — new submodule containing schema, emit_parquet, emit_d1_sql, WordStore
- data/runtime/words.parquet + data/runtime/edges.parquet — LFS-tracked canonical artifacts
- selectional.parquet schema definition (population deferred — corpus DEP reannotation is sibling ticket B)
- scripts/d1-seed.sql exits LFS, derived from Parquet at CI time
- v6 generation server migrates off the four JSON dumps onto WordStore
- phonolex_governors.generation.trie.VocabTrie swaps the Python dict-trie for a marisa-trie-backed implementation, API preserved
- Schema-as-code: PropertyDef records drive Parquet schema + (optional) Pydantic row model
Out (sibling tickets):
- C — Editor + CFG enumerator (mlm_iterative_editor, argstruc_enumerator, joint_mask_pll, per-request prefix structures). PHON-93 ends at WordStore.subset(expr).get_column("word").to_list().
- B — Corpus DEP reannotation (extends PHON-72 with spaCy DEP labels; populates selectional.parquet)
- D — Frontend word-data cards (deferred per user direction)
- E — Workers API rewire / D1 replacement (D1 stays as Workers' serving cache, regenerable)
- v7 production integration (downstream)
Methodology principle¶
Source-of-truth migrates from D1 to Parquet. D1 stays as the Workers-API serving cache — it's a shape that fits Cloudflare's runtime, not a canonical store. Per CI, scripts/d1-seed.sql derives from data/runtime/*.parquet and is regenerable on demand; LFS shrinks from 274 MB (current d1-seed.sql) to ~50 MB (Parquet artifacts).
Datasets ship as columnar. Polars + Parquet is the standard 2026 shape for ML-adjacent reference data (HF Datasets / FineWeb-Edu / ConceptNet modern releases). Dict-of-dataclasses is legacy Python; doesn't compose with the rest of the ecosystem (vector ops, expression predicates, lazy eval, schema enforcement, compression).
Schema-as-code. config.py's PropertyDef records are the schema authority. Both Parquet schema and (where wanted) a Pydantic row model are generated from those records. There is no PROPERTY_COLUMNS list to maintain.
Quality-filtered subsets are queries, not schema. The "canon" subset (words with coverage across N+ datasets) is a Polars filter expression over words.parquet, not a separate artifact and not a hardcoded count. Recent data work (PHON-72/73/76/81/82/83/84/85/86/87/88) has expanded coverage, so any historical figure (e.g., the "~44K canon words" referenced in older docs) is stale and the count is larger going forward. CLAUDE.md and other docs that quote a fixed count should be updated when this lands.
Architectural inversions from the trie-centric founding memo¶
The trie-centric findings memo (packages/generation/research/2026-05-05-trie-centric-rebuild/findings-and-scope.md) framed PHON-93's central abstraction as a tagged trie: every constraint dimension (spec compliance, selectional preferences, per-position locks) becomes a tag dimension on a single per-request trie. PHON-93's rescope inverts that framing: the unifying abstraction shifts from tagged trie to Polars expressions over Parquet. The Jira ticket's three named revisions need updating:
| Memo's revision | PHON-93 rescope |
|---|---|
| #1 — PHON-72's full 1.06M-doc sample is the corpus | Unchanged. Holds. |
| #2 — Selectional preferences become per-slot trie tagging, NOT a separate D1 table | Reverted. Selectional preferences become a separate Parquet table (selectional.parquet) joined per request via Polars. PHON-92's original "table" framing was structurally closer to right. |
| #3 — PPL as tiebreak becomes joint-mask MLM-PLL as the editor's optimization target | Unchanged, but moves to sibling ticket C (editor scope). |
The trie_tagger module the memo proposed (~150 LOC, NEW) is retired from PHON-93's scope. Constraint composition happens via Polars boolean algebra on Parquet columns, not via tag-dimension composition on a trie.
Mapping: each role the founding memo gave the trie¶
| Founding-memo role | New home |
|---|---|
| 1. Lexicon hydration | WordStore.from_parquet() at startup |
| 2. Spec compliance (per-spec tag) | WordStore.subset(spec_expr) — Polars filter expression |
| 3. Selectional preferences (per-slot tag) | selectional.parquet joined per request; columns expose count + PPMI |
| 4. Per-position locking | Editor-internal small per-position word/token-id sets — ticket C |
| 5. CFG enumeration walk | argstruc_enumerator queries WordStore.subset(slot_expr) for typed terminals — ticket C |
| 6. MLM editor logit-intersection | Editor's per-request prefix structure (boolean dict-trie) — ticket C |
| 7. Coherence ranking | Joint-mask MLM-PLL, editor-internal — ticket C |
| 8. Diversity (anti-rep, temperature) | Editor-internal — ticket C |
PHON-93's surface covers 1-3 directly; feeds 4-6 by exposing query primitives.
Binary vs continuous, resolved¶
PHON-92's memo §3 made the substantive decision: continuous PPMI per (verb, role, filler) triple, with an enumeration-time admission threshold at PMI ≥ 0 (mathematically calibrated: positive log-ratio = above chance) and a coverage gate (c(v, r, *) ≥ 50 to trust zero-PPMI rejections). PISA-vector continuous score was rejected partly because it requires per-verb learned thresholds; PMI's threshold is structural, not learned.
The trie-centric memo's §5.1 left this as an "open question" only because it wasn't sure how to translate continuous PMI into trie tag dimensions. The columnar pivot dissolves the question:
selectional.parquet
columns: verb, role, filler, count_v_r_f, count_v_r_star, ppmi
storage: continuous PPMI per (V, role, F) triple — PHON-92's decision intact
Consumer-side flexibility surfaces all three thresholding strategies PHON-92 named:
| Query | Polars expression |
|---|---|
| Default admission (PMI ≥ 0) | filter(pl.col("ppmi") > 0) |
| Tighter threshold | filter(pl.col("ppmi") > 1.0) |
| Coverage-gated rejection | filter((pl.col("ppmi") > 0) \| (pl.col("count_v_r_star") >= 50)) |
| Continuous bias term | read pl.col("ppmi") directly, add α·ppmi to LM logit |
The trie itself stays bounded to word-membership (boolean) — but only because the continuous data is preserved upstream in selectional.parquet. A boolean trie without selectional.parquet would discard PHON-92's signal at storage and would still be wrong; the columnar pivot is what makes boolean acceptable. This distinction matters for downstream readers.
PHON-93 ships selectional.parquet's schema; population (DEP corpus reannotation) is sibling ticket B.
Architecture¶
data/{norms,cmu,mappings}
│
▼
build_words() [unchanged]
│
▼
emit_parquet() [NEW — phonolex_data.runtime]
│
▼
data/runtime/words.parquet (LFS, canonical, ~30-60 MB)
data/runtime/edges.parquet (LFS, canonical, ~5-10 MB)
data/runtime/selectional.parquet (LFS, schema only — populated by ticket B)
│
┌────┴────────────────────────────┐
▼ ▼
emit_d1_sql() WordStore (Polars)
│ │
▼ ▼
scripts/d1-seed.sql editor + CFG enumerator (ticket C)
(CI artifact, not LFS) v6 generation server (migrates here)
│ │
▼ ▼
wrangler d1 execute per-request structures built from
│ WordStore.subset(...).to_list()
▼
Cloudflare D1
(Workers serving cache,
unchanged at request time)
Components¶
All in phonolex_data.runtime:
| Module | Role |
|---|---|
schema |
Codegen from config.py PropertyDef records → Polars schema dict + optional Pydantic row model. Single source of schema truth. |
emit_parquet |
Pipeline records → words.parquet + edges.parquet. Called at data-build time. |
emit_d1_sql |
Parquet → D1 DDL + INSERT statements. CI build step, output not LFS-tracked. |
WordStore |
Polars pl.DataFrame wrapper exposing the locked query set + dict[str, int] word→row_idx for O(1) get. |
Components elsewhere:
| Module | Change |
|---|---|
phonolex_governors.generation.trie.VocabTrie |
API preserved; underlying storage swaps Python dict-trie → marisa-trie + parallel dict[node_id, (banned, total)] for per-request tag counts. See §B below. |
packages/generation/scripts/build_runtime_data.py |
Deleted. Four JSON dumps go away. |
packages/generation/server/word_norms.py |
Migrates to read from WordStore (loaded from Parquet at server cold-start). |
PHON-64 spike's lexicon.py |
Deleted. Was a hand-rolled D1-SQLite reader; replaced by WordStore. |
packages/web/workers/scripts/export-to-d1.py |
Refactored: emit Parquet first; SQL emission moves to emit_d1_sql called from CI on the Parquet artifact. |
Three deletions. The fragmented runtime contract collapses to one canonical artifact + one runtime store.
WordStore API¶
class WordStore:
@classmethod
def from_parquet(cls, path: Path) -> "WordStore": ...
def get(self, word: str) -> dict | None: ... # O(1) via word→row_idx hash
def subset(self, expr: pl.Expr) -> pl.DataFrame: ... # Polars filter
def prefix(self, prefix: str) -> list[str]: ... # words starting with prefix
def iterate_typed(self, expr: pl.Expr) -> list[str]: ... # word list for CFG slot
def is_admitted(self, word: str, expr: pl.Expr) -> bool: ... # row lookup + expr eval
@property
def df(self) -> pl.DataFrame: ... # escape hatch for ad-hoc queries
The five queries match the consumer set locked during brainstorm (editor + CFG enumerator + v6 server). df is the escape hatch for one-off needs the API doesn't anticipate.
selectional.parquet schema (defined here, populated by ticket B)¶
| column | type | notes |
|---|---|---|
verb |
str | head verb (lemmatized) |
role |
str | dependency relation (nsubj, dobj, iobj, obl, …) |
filler |
str | argument lemma |
count_v_r_f |
u32 | observed c(v, r, f) |
count_v_r_star |
u32 | c(v, r, *) for the coverage gate |
ppmi |
f32 | per-role PPMI with add-α=0.01 smoothing, sparsity floor min_count=5 |
PHON-92's formula:
P̂(f | v, r) = (c(v, r, f) + α) / (c(v, r, *) + α · |F_r|)
P̂(f | r) = (c(*, r, f) + α) / (c(*, r, *) + α · |F_r|)
PPMI(v, r, f) = max(0, log₂( P̂(f | v, r) / P̂(f | r) ))
Triples with c(v, r, f) < 5 are not stored (sparsity floor — evidence-of-absence).
§B — VocabTrie swap to marisa-trie¶
The basic Python dict-trie at packages/governors/src/phonolex_governors/generation/trie.py swaps to a marisa-trie-backed implementation.
Why included in PHON-93: - Static global trie at v6 startup (built once from 126K words, lifetime of the process) is exactly marisa-trie's strength: 50-100× memory reduction, sub-µs query, sub-ms build at this scale. - Lands cleanly alongside the data-layer work — both are runtime-infrastructure cleanups; low coupling between them but they share the same surface area. - Per-request small structures (500-2K word spec lexicons) stay as Python dict-tries — marisa-trie is wasted at that grain because rebuilding 100×/day on small lists makes the static-rebuild cost dominate.
Wrinkle: marisa-trie is static (immutable after build). Current tag(banned_words) mutates per-request banned_below / total_below counts on every node. Solution: marisa-trie + parallel dict[node_id, (banned, total)] updated on tag(). Marisa-trie exposes node IDs; the parallel dict stores per-request counts; dead_end_ratio() queries marisa-trie for the node, then the dict for counts. API contract preserved end-to-end.
License: marisa-trie wrapper is MIT; C++ core is dual-licensed BSD-2 / LGPL-2.1. Pick BSD-2 at install time to avoid LGPL contamination of the proprietary build. Documented option in the wrapper. Consistent with the data-license audit's "permissive only" standard.
Data flow¶
Build-time / CI:
1. data/{norms,cmu,mappings} → phonolex_data.pipeline.build_words() → records (unchanged)
2. records → phonolex_data.runtime.emit_parquet() → data/runtime/{words,edges}.parquet (LFS-tracked)
3. data/runtime/{words,edges}.parquet → phonolex_data.runtime.emit_d1_sql() → scripts/d1-seed.sql (CI artifact, not LFS)
4. wrangler d1 execute phonolex --file scripts/d1-seed.sql → Cloudflare D1
CI trigger: data/{norms,cmu,mappings}/** change → CI rebuilds Parquet → derives d1-seed.sql → seeds D1. Existing dorny/paths-filter pattern in .github/workflows/deploy.yml extends naturally.
Runtime (generation server cold-start):
1. Load data/runtime/{words,edges}.parquet (baked into Docker image at build)
2. WordStore.from_parquet() → pl.DataFrame + word→row_idx hash
3. Marisa-trie built from df["word"].to_list() for v6 Reranker's global-trie role
Per-request (editor + CFG, ticket C — shape only):
1. Compose Polars expression from constraint set
2. WordStore.subset(expr) → filtered word list
3. Build per-request boolean dict-trie from word list
4. (Once selectional.parquet is populated) WordStore.subset(spec_expr).join(selectional, on=("verb","filler")).filter(pl.col("ppmi") > 0) → admitted list
Error handling¶
| Scenario | Behavior |
|---|---|
Parquet schema diverges from PropertyDef |
WordStore.from_parquet() fails fast, names missing/extra columns |
| Parquet missing at server startup | Hard fail, server doesn't start (vs lazy-fail at query time) |
Parquet older than data/{norms,cmu,mappings}/* source |
CI hook rebuilds; in dev mode, log a warn-not-fail message with explicit message |
WordStore.subset(...) returns empty |
Empty list — no error; caller decides |
selectional.parquet missing or empty |
Treat as "no selectional data available"; downstream consumers skip the join, fall back to spec-only filter (ticket C policy) |
d1-seed.sql derivation fails in CI |
Block deploy, surface schema diff in failure message |
Testing¶
- Schema codegen —
PropertyDef→ expected Polars schema (snapshot test againsttests/runtime/test_schema_snapshot.json) - Parquet roundtrip —
records → emit_parquet → WordStore.from_parquet → queryreturns expected values; tests N=100 sampled words across ~10 fields - D1-derivation regression —
emit_d1_sqloutput diffs against the current hand-builtd1-seed.sqlshape (DDL + first 100 INSERT rows). Allow column-order differences; assert row count + value parity. - Marisa-trie API parity — existing
VocabTrietest suite passes on the new implementation. Singlepytest -k vocab_triecommand should pass against both the old and new impl during the migration window. - v6 server migration — existing generation-server tests pass after
build_runtime_data.pyremoval andword_norms.pyrewire. - CI integration — data-file change triggers CI Parquet rebuild → row count + schema asserts → deploy.
Migration notes¶
Order of operations (single PR, but staged commits):
- Add
phonolex_data.runtimesubmodule (schema,emit_parquet,emit_d1_sql,WordStore) with tests. Doesn't touch any caller yet. - Generate
data/runtime/{words,edges}.parquetfrom current data. Verify roundtrip + D1-derivation parity. - Add Parquet artifacts to LFS (
git lfs track 'data/runtime/*.parquet'). - Migrate
packages/generation/server/word_norms.pytoWordStore. Deletebuild_runtime_data.py. Tests pass. - Swap
VocabTrieto marisa-trie + parallel-dict tag counts. Tests pass. - Refactor
export-to-d1.pyto defer SQL emission toemit_d1_sqlon the Parquet artifact. - Update
.github/workflows/*.ymlfor the new CI flow. - Last:
git lfs untrack 'scripts/d1-seed.sql', remove the file from LFS history (or accept residual LFS storage for prior commits — discuss before doing the history rewrite).
Backup before LFS removal: snapshot the current d1-seed.sql to scripts/d1-seed.sql.preflight-backup-$(date) outside LFS first. Verify the new CI-derived seed produces a working D1 in staging before committing the untrack.
v6 server cutover — single PR migrates v6 onto WordStore and deletes the four JSON dumps. No parallel-ship phase; per feedback_no_either_or and "I'm not here for bridge solutions," the dumps are the monster being killed.
Open questions for the impl phase¶
- Polars LazyFrame vs DataFrame at runtime — eager load is simpler and the dataset is small (~150K rows × ~98 cols ≈ ~30-60 MB compressed → ~150 MB in-memory). LazyFrame buys nothing here. Tentative: eager DataFrame. Confirm at impl.
- Schema migration policy — when a new
PropertyDeflands inconfig.py, does the Parquet rebuild auto-trigger in CI? Tentative: yes, on any change topackages/web/workers/scripts/config.py. CI hook to add. - Phoneme features parquet? —
packages/features/ships learned per-phoneme feature vectors. Currently loaded into the Worker's similarity route at cold start; the generation server doesn't use them yet. Out of scope unless a v7 editor strategy needs them. - Edges schema — single file or per-type? — current decision (§A): one
edges.parquetwith atypecolumn. Edge types stay extensible without changing words. Confirm during impl. - DuckDB layer — DuckDB-on-Parquet for ad-hoc SQL analysis is a free ergonomic win (no runtime dep — DuckDB CLI / notebook usage only). Worth documenting in the runbook but not a code dep.
Risk register¶
R1 — Polars dependency footprint. Adds ~30 MB wheel to phonolex_data. Severity: low. Mitigation: pin a specific Polars major version; track wheel size in CI.
R2 — Marisa-trie static-immutability assumption. If a future v6 feature needs to mutate the global trie post-build (incremental word additions, online learning), the static structure breaks. Severity: low — no such feature exists or is planned. Falsification trigger: any PR adding mutable global-trie operations.
R3 — D1 seed regression during CI derivation. First runs of emit_d1_sql may produce a seed that wrangler can't ingest, or that produces a D1 with subtly different row count / schema vs the current hand-built version. Severity: medium. Mitigation: regression test (test §3) catches this pre-deploy.
R4 — LFS untracking is irreversible-ish. Removing d1-seed.sql from LFS while leaving it in git history requires a history rewrite. Severity: medium. Mitigation: keep the LFS pointer for a release cycle; only do the history rewrite when staging proves stable. Backup snapshot before any rewrite.
R5 — v6 server tests don't fully cover JSON-dump → WordStore migration. Some word_norms.py paths may have implicit JSON-shape assumptions not surfaced by tests. Severity: medium. Mitigation: run the generation server end-to-end on the staging environment with a real spec before merging; the existing eval-harness-v1 provides a smoke check.
R6 — Schema codegen brittleness. Driving Polars schema from PropertyDef introspection assumes PropertyDef fields stay in sync with what Parquet expects (types, nullability). Severity: low. Mitigation: snapshot test for the codegen output; CI fails if schema drifts unexpectedly.
Decision-recommendation¶
GO. All architectural decisions resolved during brainstorm:
- Columnar (Polars + Parquet) over dict-of-dataclasses
- Parquet canonical, D1 derived in CI from Parquet
selectional.parquetschema defined here, population deferred to ticket B- v6 server migrates in same ticket (single moving target)
d1-seed.sqlexits LFS as a CI artifact (with snapshot backup)- Trie editor/CFG work scoped out (sibling ticket C); per-request prefix structures stay as Python dict-tries
- Marisa-trie swap for the static global
VocabTrieincluded as §B
Estimate: 6-8 working days for v1 implementation. Milestones in the migration order above.
Branch: feature/phon-93-trie-centric-rebuild off release/v5.2.0. Sibling tickets (B, C) branch separately.
Per feedback_verify_jira_state.md: at writing-plans handoff, JQL the next free PHON-XX before reserving sibling-ticket numbers. Also: edit PHON-93's Jira description to reflect this rescope (the original "trie-centric editor rebuild" framing is superseded by "runtime word-data layer"; editor scope moves to the new sibling ticket).