Skip to content

Runtime Word-Data Layer — Design (PHON-93, rescoped)

Date: 2026-05-05 Status: Spec — pending user review Branch: feature/phon-93-trie-centric-rebuild (off release/v5.2.0) Tickets: PHON-93 (rescoped — supersedes the trie-centric editor scope; editor + CFG enumerator move to a sibling ticket created at writing-plans handoff per feedback_verify_jira_state.md)


Problem

PhonoLex's runtime word data is currently fragmented across three independently-maintained paths to the same source-of-truth:

  • Workers API (production, phonolex.com) — queries Cloudflare D1 over HTTP
  • v6 generation server — reads four build-time JSON dumps (norms_dump.json 243 MB, vocab_dump.json 0.5 MB, phoneme_rates.json 1 KB, assoc_graph.json)
  • PHON-64 spike (and any future Python research) — opens the miniflare D1 SQLite directly, hand-writes a SELECT, hand-defines a LexEntry dataclass with ~22 cherry-picked columns

Each path ships its own subset of schema knowledge. packages/generation/scripts/build_runtime_data.py maintains a PROPERTY_COLUMNS list as a manual subset of packages/web/workers/scripts/config.py's PropertyDef records — the source-of-truth lives in config.py but consumers don't see it directly. PHON-88's freq-band columns landed in D1 weeks ago but haven't propagated to build_runtime_data.py. This is a sync-drift bug already in production.

Recent data work (PHON-72/73/76/81/82/83/84/85/86/87/88) added ~98 fields to D1 across a 4-table split (words, word_properties, word_percentiles, word_freq_bands). The fragmented runtime contract is collapsing under that growth — and won't survive the further additions PHON-93's downstream tickets require (DEP corpus reannotation, selectional preferences).

Goal: unify the runtime word-data layer behind a single columnar artifact and a single in-process store. Make the source-of-truth canonical, machine-derivable, and schema-driven. Eliminate the three independent paths.


Scope

In: - phonolex_data.runtime — new submodule containing schema, emit_parquet, emit_d1_sql, WordStore - data/runtime/words.parquet + data/runtime/edges.parquet — LFS-tracked canonical artifacts - selectional.parquet schema definition (population deferred — corpus DEP reannotation is sibling ticket B) - scripts/d1-seed.sql exits LFS, derived from Parquet at CI time - v6 generation server migrates off the four JSON dumps onto WordStore - phonolex_governors.generation.trie.VocabTrie swaps the Python dict-trie for a marisa-trie-backed implementation, API preserved - Schema-as-code: PropertyDef records drive Parquet schema + (optional) Pydantic row model

Out (sibling tickets): - C — Editor + CFG enumerator (mlm_iterative_editor, argstruc_enumerator, joint_mask_pll, per-request prefix structures). PHON-93 ends at WordStore.subset(expr).get_column("word").to_list(). - B — Corpus DEP reannotation (extends PHON-72 with spaCy DEP labels; populates selectional.parquet) - D — Frontend word-data cards (deferred per user direction) - E — Workers API rewire / D1 replacement (D1 stays as Workers' serving cache, regenerable) - v7 production integration (downstream)


Methodology principle

Source-of-truth migrates from D1 to Parquet. D1 stays as the Workers-API serving cache — it's a shape that fits Cloudflare's runtime, not a canonical store. Per CI, scripts/d1-seed.sql derives from data/runtime/*.parquet and is regenerable on demand; LFS shrinks from 274 MB (current d1-seed.sql) to ~50 MB (Parquet artifacts).

Datasets ship as columnar. Polars + Parquet is the standard 2026 shape for ML-adjacent reference data (HF Datasets / FineWeb-Edu / ConceptNet modern releases). Dict-of-dataclasses is legacy Python; doesn't compose with the rest of the ecosystem (vector ops, expression predicates, lazy eval, schema enforcement, compression).

Schema-as-code. config.py's PropertyDef records are the schema authority. Both Parquet schema and (where wanted) a Pydantic row model are generated from those records. There is no PROPERTY_COLUMNS list to maintain.

Quality-filtered subsets are queries, not schema. The "canon" subset (words with coverage across N+ datasets) is a Polars filter expression over words.parquet, not a separate artifact and not a hardcoded count. Recent data work (PHON-72/73/76/81/82/83/84/85/86/87/88) has expanded coverage, so any historical figure (e.g., the "~44K canon words" referenced in older docs) is stale and the count is larger going forward. CLAUDE.md and other docs that quote a fixed count should be updated when this lands.


Architectural inversions from the trie-centric founding memo

The trie-centric findings memo (packages/generation/research/2026-05-05-trie-centric-rebuild/findings-and-scope.md) framed PHON-93's central abstraction as a tagged trie: every constraint dimension (spec compliance, selectional preferences, per-position locks) becomes a tag dimension on a single per-request trie. PHON-93's rescope inverts that framing: the unifying abstraction shifts from tagged trie to Polars expressions over Parquet. The Jira ticket's three named revisions need updating:

Memo's revision PHON-93 rescope
#1 — PHON-72's full 1.06M-doc sample is the corpus Unchanged. Holds.
#2 — Selectional preferences become per-slot trie tagging, NOT a separate D1 table Reverted. Selectional preferences become a separate Parquet table (selectional.parquet) joined per request via Polars. PHON-92's original "table" framing was structurally closer to right.
#3 — PPL as tiebreak becomes joint-mask MLM-PLL as the editor's optimization target Unchanged, but moves to sibling ticket C (editor scope).

The trie_tagger module the memo proposed (~150 LOC, NEW) is retired from PHON-93's scope. Constraint composition happens via Polars boolean algebra on Parquet columns, not via tag-dimension composition on a trie.

Mapping: each role the founding memo gave the trie

Founding-memo role New home
1. Lexicon hydration WordStore.from_parquet() at startup
2. Spec compliance (per-spec tag) WordStore.subset(spec_expr) — Polars filter expression
3. Selectional preferences (per-slot tag) selectional.parquet joined per request; columns expose count + PPMI
4. Per-position locking Editor-internal small per-position word/token-id sets — ticket C
5. CFG enumeration walk argstruc_enumerator queries WordStore.subset(slot_expr) for typed terminals — ticket C
6. MLM editor logit-intersection Editor's per-request prefix structure (boolean dict-trie) — ticket C
7. Coherence ranking Joint-mask MLM-PLL, editor-internal — ticket C
8. Diversity (anti-rep, temperature) Editor-internal — ticket C

PHON-93's surface covers 1-3 directly; feeds 4-6 by exposing query primitives.


Binary vs continuous, resolved

PHON-92's memo §3 made the substantive decision: continuous PPMI per (verb, role, filler) triple, with an enumeration-time admission threshold at PMI ≥ 0 (mathematically calibrated: positive log-ratio = above chance) and a coverage gate (c(v, r, *) ≥ 50 to trust zero-PPMI rejections). PISA-vector continuous score was rejected partly because it requires per-verb learned thresholds; PMI's threshold is structural, not learned.

The trie-centric memo's §5.1 left this as an "open question" only because it wasn't sure how to translate continuous PMI into trie tag dimensions. The columnar pivot dissolves the question:

selectional.parquet
   columns: verb, role, filler, count_v_r_f, count_v_r_star, ppmi
   storage: continuous PPMI per (V, role, F) triple — PHON-92's decision intact

Consumer-side flexibility surfaces all three thresholding strategies PHON-92 named:

Query Polars expression
Default admission (PMI ≥ 0) filter(pl.col("ppmi") > 0)
Tighter threshold filter(pl.col("ppmi") > 1.0)
Coverage-gated rejection filter((pl.col("ppmi") > 0) \| (pl.col("count_v_r_star") >= 50))
Continuous bias term read pl.col("ppmi") directly, add α·ppmi to LM logit

The trie itself stays bounded to word-membership (boolean) — but only because the continuous data is preserved upstream in selectional.parquet. A boolean trie without selectional.parquet would discard PHON-92's signal at storage and would still be wrong; the columnar pivot is what makes boolean acceptable. This distinction matters for downstream readers.

PHON-93 ships selectional.parquet's schema; population (DEP corpus reannotation) is sibling ticket B.


Architecture

data/{norms,cmu,mappings}
        │
        ▼
build_words()          [unchanged]
        │
        ▼
emit_parquet()         [NEW — phonolex_data.runtime]
        │
        ▼
data/runtime/words.parquet  (LFS, canonical, ~30-60 MB)
data/runtime/edges.parquet  (LFS, canonical, ~5-10 MB)
data/runtime/selectional.parquet  (LFS, schema only — populated by ticket B)
        │
   ┌────┴────────────────────────────┐
   ▼                                 ▼
emit_d1_sql()              WordStore (Polars)
   │                                 │
   ▼                                 ▼
scripts/d1-seed.sql        editor + CFG enumerator (ticket C)
 (CI artifact, not LFS)    v6 generation server (migrates here)
   │                                 │
   ▼                                 ▼
wrangler d1 execute        per-request structures built from
   │                       WordStore.subset(...).to_list()
   ▼
Cloudflare D1
(Workers serving cache,
 unchanged at request time)

Components

All in phonolex_data.runtime:

Module Role
schema Codegen from config.py PropertyDef records → Polars schema dict + optional Pydantic row model. Single source of schema truth.
emit_parquet Pipeline records → words.parquet + edges.parquet. Called at data-build time.
emit_d1_sql Parquet → D1 DDL + INSERT statements. CI build step, output not LFS-tracked.
WordStore Polars pl.DataFrame wrapper exposing the locked query set + dict[str, int] word→row_idx for O(1) get.

Components elsewhere:

Module Change
phonolex_governors.generation.trie.VocabTrie API preserved; underlying storage swaps Python dict-trie → marisa-trie + parallel dict[node_id, (banned, total)] for per-request tag counts. See §B below.
packages/generation/scripts/build_runtime_data.py Deleted. Four JSON dumps go away.
packages/generation/server/word_norms.py Migrates to read from WordStore (loaded from Parquet at server cold-start).
PHON-64 spike's lexicon.py Deleted. Was a hand-rolled D1-SQLite reader; replaced by WordStore.
packages/web/workers/scripts/export-to-d1.py Refactored: emit Parquet first; SQL emission moves to emit_d1_sql called from CI on the Parquet artifact.

Three deletions. The fragmented runtime contract collapses to one canonical artifact + one runtime store.

WordStore API

class WordStore:
    @classmethod
    def from_parquet(cls, path: Path) -> "WordStore": ...

    def get(self, word: str) -> dict | None: ...          # O(1) via word→row_idx hash
    def subset(self, expr: pl.Expr) -> pl.DataFrame: ...  # Polars filter
    def prefix(self, prefix: str) -> list[str]: ...       # words starting with prefix
    def iterate_typed(self, expr: pl.Expr) -> list[str]: ... # word list for CFG slot
    def is_admitted(self, word: str, expr: pl.Expr) -> bool: ...  # row lookup + expr eval

    @property
    def df(self) -> pl.DataFrame: ...                      # escape hatch for ad-hoc queries

The five queries match the consumer set locked during brainstorm (editor + CFG enumerator + v6 server). df is the escape hatch for one-off needs the API doesn't anticipate.

selectional.parquet schema (defined here, populated by ticket B)

column type notes
verb str head verb (lemmatized)
role str dependency relation (nsubj, dobj, iobj, obl, …)
filler str argument lemma
count_v_r_f u32 observed c(v, r, f)
count_v_r_star u32 c(v, r, *) for the coverage gate
ppmi f32 per-role PPMI with add-α=0.01 smoothing, sparsity floor min_count=5

PHON-92's formula:

P̂(f | v, r) = (c(v, r, f) + α) / (c(v, r, *) + α · |F_r|)
P̂(f | r)    = (c(*, r, f) + α) / (c(*, r, *) + α · |F_r|)
PPMI(v, r, f) = max(0, log₂( P̂(f | v, r) / P̂(f | r) ))

Triples with c(v, r, f) < 5 are not stored (sparsity floor — evidence-of-absence).


§B — VocabTrie swap to marisa-trie

The basic Python dict-trie at packages/governors/src/phonolex_governors/generation/trie.py swaps to a marisa-trie-backed implementation.

Why included in PHON-93: - Static global trie at v6 startup (built once from 126K words, lifetime of the process) is exactly marisa-trie's strength: 50-100× memory reduction, sub-µs query, sub-ms build at this scale. - Lands cleanly alongside the data-layer work — both are runtime-infrastructure cleanups; low coupling between them but they share the same surface area. - Per-request small structures (500-2K word spec lexicons) stay as Python dict-tries — marisa-trie is wasted at that grain because rebuilding 100×/day on small lists makes the static-rebuild cost dominate.

Wrinkle: marisa-trie is static (immutable after build). Current tag(banned_words) mutates per-request banned_below / total_below counts on every node. Solution: marisa-trie + parallel dict[node_id, (banned, total)] updated on tag(). Marisa-trie exposes node IDs; the parallel dict stores per-request counts; dead_end_ratio() queries marisa-trie for the node, then the dict for counts. API contract preserved end-to-end.

License: marisa-trie wrapper is MIT; C++ core is dual-licensed BSD-2 / LGPL-2.1. Pick BSD-2 at install time to avoid LGPL contamination of the proprietary build. Documented option in the wrapper. Consistent with the data-license audit's "permissive only" standard.


Data flow

Build-time / CI: 1. data/{norms,cmu,mappings}phonolex_data.pipeline.build_words() → records (unchanged) 2. records → phonolex_data.runtime.emit_parquet()data/runtime/{words,edges}.parquet (LFS-tracked) 3. data/runtime/{words,edges}.parquetphonolex_data.runtime.emit_d1_sql()scripts/d1-seed.sql (CI artifact, not LFS) 4. wrangler d1 execute phonolex --file scripts/d1-seed.sql → Cloudflare D1

CI trigger: data/{norms,cmu,mappings}/** change → CI rebuilds Parquet → derives d1-seed.sql → seeds D1. Existing dorny/paths-filter pattern in .github/workflows/deploy.yml extends naturally.

Runtime (generation server cold-start): 1. Load data/runtime/{words,edges}.parquet (baked into Docker image at build) 2. WordStore.from_parquet()pl.DataFrame + word→row_idx hash 3. Marisa-trie built from df["word"].to_list() for v6 Reranker's global-trie role

Per-request (editor + CFG, ticket C — shape only): 1. Compose Polars expression from constraint set 2. WordStore.subset(expr) → filtered word list 3. Build per-request boolean dict-trie from word list 4. (Once selectional.parquet is populated) WordStore.subset(spec_expr).join(selectional, on=("verb","filler")).filter(pl.col("ppmi") > 0) → admitted list


Error handling

Scenario Behavior
Parquet schema diverges from PropertyDef WordStore.from_parquet() fails fast, names missing/extra columns
Parquet missing at server startup Hard fail, server doesn't start (vs lazy-fail at query time)
Parquet older than data/{norms,cmu,mappings}/* source CI hook rebuilds; in dev mode, log a warn-not-fail message with explicit message
WordStore.subset(...) returns empty Empty list — no error; caller decides
selectional.parquet missing or empty Treat as "no selectional data available"; downstream consumers skip the join, fall back to spec-only filter (ticket C policy)
d1-seed.sql derivation fails in CI Block deploy, surface schema diff in failure message

Testing

  • Schema codegenPropertyDef → expected Polars schema (snapshot test against tests/runtime/test_schema_snapshot.json)
  • Parquet roundtriprecords → emit_parquet → WordStore.from_parquet → query returns expected values; tests N=100 sampled words across ~10 fields
  • D1-derivation regressionemit_d1_sql output diffs against the current hand-built d1-seed.sql shape (DDL + first 100 INSERT rows). Allow column-order differences; assert row count + value parity.
  • Marisa-trie API parity — existing VocabTrie test suite passes on the new implementation. Single pytest -k vocab_trie command should pass against both the old and new impl during the migration window.
  • v6 server migration — existing generation-server tests pass after build_runtime_data.py removal and word_norms.py rewire.
  • CI integration — data-file change triggers CI Parquet rebuild → row count + schema asserts → deploy.

Migration notes

Order of operations (single PR, but staged commits):

  1. Add phonolex_data.runtime submodule (schema, emit_parquet, emit_d1_sql, WordStore) with tests. Doesn't touch any caller yet.
  2. Generate data/runtime/{words,edges}.parquet from current data. Verify roundtrip + D1-derivation parity.
  3. Add Parquet artifacts to LFS (git lfs track 'data/runtime/*.parquet').
  4. Migrate packages/generation/server/word_norms.py to WordStore. Delete build_runtime_data.py. Tests pass.
  5. Swap VocabTrie to marisa-trie + parallel-dict tag counts. Tests pass.
  6. Refactor export-to-d1.py to defer SQL emission to emit_d1_sql on the Parquet artifact.
  7. Update .github/workflows/*.yml for the new CI flow.
  8. Last: git lfs untrack 'scripts/d1-seed.sql', remove the file from LFS history (or accept residual LFS storage for prior commits — discuss before doing the history rewrite).

Backup before LFS removal: snapshot the current d1-seed.sql to scripts/d1-seed.sql.preflight-backup-$(date) outside LFS first. Verify the new CI-derived seed produces a working D1 in staging before committing the untrack.

v6 server cutover — single PR migrates v6 onto WordStore and deletes the four JSON dumps. No parallel-ship phase; per feedback_no_either_or and "I'm not here for bridge solutions," the dumps are the monster being killed.


Open questions for the impl phase

  1. Polars LazyFrame vs DataFrame at runtime — eager load is simpler and the dataset is small (~150K rows × ~98 cols ≈ ~30-60 MB compressed → ~150 MB in-memory). LazyFrame buys nothing here. Tentative: eager DataFrame. Confirm at impl.
  2. Schema migration policy — when a new PropertyDef lands in config.py, does the Parquet rebuild auto-trigger in CI? Tentative: yes, on any change to packages/web/workers/scripts/config.py. CI hook to add.
  3. Phoneme features parquet?packages/features/ ships learned per-phoneme feature vectors. Currently loaded into the Worker's similarity route at cold start; the generation server doesn't use them yet. Out of scope unless a v7 editor strategy needs them.
  4. Edges schema — single file or per-type? — current decision (§A): one edges.parquet with a type column. Edge types stay extensible without changing words. Confirm during impl.
  5. DuckDB layer — DuckDB-on-Parquet for ad-hoc SQL analysis is a free ergonomic win (no runtime dep — DuckDB CLI / notebook usage only). Worth documenting in the runbook but not a code dep.

Risk register

R1 — Polars dependency footprint. Adds ~30 MB wheel to phonolex_data. Severity: low. Mitigation: pin a specific Polars major version; track wheel size in CI.

R2 — Marisa-trie static-immutability assumption. If a future v6 feature needs to mutate the global trie post-build (incremental word additions, online learning), the static structure breaks. Severity: low — no such feature exists or is planned. Falsification trigger: any PR adding mutable global-trie operations.

R3 — D1 seed regression during CI derivation. First runs of emit_d1_sql may produce a seed that wrangler can't ingest, or that produces a D1 with subtly different row count / schema vs the current hand-built version. Severity: medium. Mitigation: regression test (test §3) catches this pre-deploy.

R4 — LFS untracking is irreversible-ish. Removing d1-seed.sql from LFS while leaving it in git history requires a history rewrite. Severity: medium. Mitigation: keep the LFS pointer for a release cycle; only do the history rewrite when staging proves stable. Backup snapshot before any rewrite.

R5 — v6 server tests don't fully cover JSON-dump → WordStore migration. Some word_norms.py paths may have implicit JSON-shape assumptions not surfaced by tests. Severity: medium. Mitigation: run the generation server end-to-end on the staging environment with a real spec before merging; the existing eval-harness-v1 provides a smoke check.

R6 — Schema codegen brittleness. Driving Polars schema from PropertyDef introspection assumes PropertyDef fields stay in sync with what Parquet expects (types, nullability). Severity: low. Mitigation: snapshot test for the codegen output; CI fails if schema drifts unexpectedly.


Decision-recommendation

GO. All architectural decisions resolved during brainstorm:

  • Columnar (Polars + Parquet) over dict-of-dataclasses
  • Parquet canonical, D1 derived in CI from Parquet
  • selectional.parquet schema defined here, population deferred to ticket B
  • v6 server migrates in same ticket (single moving target)
  • d1-seed.sql exits LFS as a CI artifact (with snapshot backup)
  • Trie editor/CFG work scoped out (sibling ticket C); per-request prefix structures stay as Python dict-tries
  • Marisa-trie swap for the static global VocabTrie included as §B

Estimate: 6-8 working days for v1 implementation. Milestones in the migration order above.

Branch: feature/phon-93-trie-centric-rebuild off release/v5.2.0. Sibling tickets (B, C) branch separately.

Per feedback_verify_jira_state.md: at writing-plans handoff, JQL the next free PHON-XX before reserving sibling-ticket numbers. Also: edit PHON-93's Jira description to reflect this rescope (the original "trie-centric editor rebuild" framing is superseded by "runtime word-data layer"; editor scope moves to the new sibling ticket).