Skip to content

Runtime Word-Data Layer — Implementation Plan (PHON-93)

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Unify PhonoLex's runtime word-data layer behind a Polars+Parquet columnar store, make Parquet the canonical source-of-truth (D1 derives in CI), and swap VocabTrie's underlying storage to marisa-trie.

Architecture: New phonolex_data.runtime submodule provides schema (codegen from PropertyDef), WordStore (Polars-backed query store), emit_parquet (records → Parquet), emit_d1_sql (Parquet → seed SQL). Three deletions: build_runtime_data.py, four JSON dumps, PHON-64's lexicon.py. v6 generation server migrates to WordStore; VocabTrie API preserved with marisa-trie underneath.

Tech Stack: Python 3.10+, Polars (new dep on phonolex_data), marisa-trie BSD-2 build (new dep on phonolex_governors), pytest, Cloudflare Workers Wrangler (existing).

Spec: docs/superpowers/specs/2026-05-05-phon-93-runtime-word-data-layer-design.md

Branch: feature/phon-93-trie-centric-rebuild (off release/v5.2.0)

Jira: PHON-93 (rescoped). At handoff: edit description to match this plan; create sibling tickets B (corpus DEP reannotation) and C (editor + CFG enumerator) per feedback_verify_jira_state.md.


File Structure

packages/data/
├── pyproject.toml                                      [modify: add polars dep]
├── src/phonolex_data/runtime/                          [NEW submodule]
│   ├── __init__.py                                     [NEW: public API]
│   ├── schema.py                                       [NEW: PropertyDef → Polars schema]
│   ├── store.py                                        [NEW: WordStore class]
│   ├── emit_parquet.py                                 [NEW: records → Parquet]
│   └── emit_d1_sql.py                                  [NEW: Parquet → D1 seed SQL]
└── tests/runtime/                                      [NEW test dir]
    ├── __init__.py
    ├── test_schema.py
    ├── test_store.py
    ├── test_emit_parquet.py
    └── test_emit_d1_sql.py

packages/governors/
├── pyproject.toml                                      [modify: add marisa-trie dep]
├── src/phonolex_governors/generation/trie.py          [rewrite: marisa-trie backend, API preserved]
└── tests/test_vocab_trie.py                            [NEW or extended: API contract tests]

packages/generation/
├── server/word_norms.py                                [rewrite: read from WordStore]
├── server/main.py                                      [modify: load WordStore at startup]
└── scripts/build_runtime_data.py                       [DELETE]

packages/web/workers/scripts/
└── export-to-d1.py                                     [refactor: emit Parquet first, defer SQL]

data/runtime/                                           [NEW LFS-tracked dir]
├── words.parquet                                       [generated artifact, LFS]
├── edges.parquet                                       [generated artifact, LFS]
└── selectional.parquet                                 [empty schema-only artifact, LFS]

scripts/d1-seed.sql                                     [exit LFS as CI build artifact]
.gitattributes                                          [modify LFS tracking]

.github/workflows/
├── deploy.yml                                          [modify: derive d1-seed.sql in CI]
└── deploy-staging.yml                                  [modify: same]

packages/generation/runtime_data/                       [DELETE 4 JSON dumps after migration]

Task 1: Add Polars and marisa-trie dependencies

Files: - Modify: packages/data/pyproject.toml - Modify: packages/governors/pyproject.toml

  • [ ] Step 1: Add polars to phonolex_data deps

Edit packages/data/pyproject.toml:

dependencies = [
    "openpyxl>=3.0",
    "polars>=1.0",
]
  • [ ] Step 2: Add marisa-trie to phonolex_governors deps

Edit packages/governors/pyproject.toml. Use the BSD-2 build to avoid LGPL contamination:

dependencies = [
    "phonolex-data",
    "marisa-trie>=1.2",
]

(The marisa-trie PyPI package is MIT-licensed wrapper; the bundled C++ is dual-licensed BSD-2 / LGPL-2.1, defaulting to BSD-2 in the wheel build. No additional flag needed at install time per current upstream packaging.)

  • [ ] Step 3: Sync workspace

Run: uv pip install -e packages/data -e packages/governors Expected: both packages resolve and install with new deps.

  • [ ] Step 4: Verify imports work

Run: python -c "import polars; import marisa_trie; print(polars.__version__, marisa_trie.__version__)" Expected: prints two version numbers.

  • [ ] Step 5: Commit
git add packages/data/pyproject.toml packages/governors/pyproject.toml
git commit -m "PHON-93: add polars + marisa-trie deps for runtime data layer"

Task 2: Schema codegen from PropertyDef

Driven by packages/web/workers/scripts/config.py's PropertyDef records, produce a Polars schema dict for words.parquet. Hard-coded extras for non-PropertyDef columns (word, pos, phonemes, syllables, etc.).

Files: - Create: packages/data/src/phonolex_data/runtime/__init__.py - Create: packages/data/src/phonolex_data/runtime/schema.py - Create: packages/data/tests/runtime/__init__.py - Create: packages/data/tests/runtime/test_schema.py

  • [ ] Step 1: Write failing test for schema codegen

Create packages/data/tests/runtime/test_schema.py:

import polars as pl
from phonolex_data.runtime.schema import (
    words_schema,
    edges_schema,
    selectional_schema,
)


def test_words_schema_includes_word_column():
    schema = words_schema()
    assert "word" in schema
    assert schema["word"] == pl.Utf8


def test_words_schema_has_phonology_column():
    schema = words_schema()
    assert "has_phonology" in schema
    assert schema["has_phonology"] == pl.Boolean


def test_words_schema_aoa_is_float():
    schema = words_schema()
    assert "aoa" in schema
    assert schema["aoa"] == pl.Float32


def test_words_schema_phoneme_count_is_integer():
    schema = words_schema()
    assert schema["phoneme_count"] == pl.Int32


def test_words_schema_phonemes_is_list_of_str():
    schema = words_schema()
    assert schema["phonemes"] == pl.List(pl.Utf8)


def test_edges_schema_includes_source_target():
    schema = edges_schema()
    assert "source" in schema
    assert "target" in schema
    assert schema["source"] == pl.Utf8
    assert schema["target"] == pl.Utf8


def test_selectional_schema_has_ppmi_column():
    schema = selectional_schema()
    assert "ppmi" in schema
    assert schema["ppmi"] == pl.Float32
    assert schema["verb"] == pl.Utf8
    assert schema["role"] == pl.Utf8
    assert schema["filler"] == pl.Utf8
    assert schema["count_v_r_f"] == pl.UInt32
    assert schema["count_v_r_star"] == pl.UInt32
  • [ ] Step 2: Run test to verify it fails

Run: uv run python -m pytest packages/data/tests/runtime/test_schema.py -v Expected: ImportError for phonolex_data.runtime.schema.

  • [ ] Step 3: Create runtime package init

Create packages/data/src/phonolex_data/runtime/__init__.py:

"""Runtime word-data layer.

Polars-backed columnar store for PhonoLex word records. Parquet on disk,
DataFrame in memory. Replaces the legacy 4-JSON-dump runtime contract.
"""

from phonolex_data.runtime.schema import (
    words_schema,
    edges_schema,
    selectional_schema,
)
from phonolex_data.runtime.store import WordStore
from phonolex_data.runtime.emit_parquet import emit_parquet
from phonolex_data.runtime.emit_d1_sql import emit_d1_sql

__all__ = [
    "words_schema",
    "edges_schema",
    "selectional_schema",
    "WordStore",
    "emit_parquet",
    "emit_d1_sql",
]
  • [ ] Step 4: Implement schema module

Create packages/data/src/phonolex_data/runtime/schema.py:

"""Codegen Polars schemas from `PropertyDef` records.

The Workers config (`packages/web/workers/scripts/config.py`) defines
`PropertyDef` records as the single source of property-schema truth.
This module derives Polars schemas from those records.
"""

from __future__ import annotations

from pathlib import Path
import sys
from typing import Mapping

import polars as pl

# Ensure config.py is importable. It lives in workers/scripts; add its parent.
_WORKERS_SCRIPTS = Path(__file__).resolve().parents[5] / "packages/web/workers/scripts"
if str(_WORKERS_SCRIPTS) not in sys.path:
    sys.path.insert(0, str(_WORKERS_SCRIPTS))

from config import PROPERTY_MAP, PropertyDef  # noqa: E402


def _propertydef_to_polars_dtype(p: PropertyDef) -> pl.DataType:
    """Map a PropertyDef to a Polars dtype.

    Integer properties (per PropertyDef.is_integer) → Int32.
    All other numeric norms → Float32 (matches scale precision needed).
    """
    if p.is_integer:
        return pl.Int32
    return pl.Float32


# Hard-coded core columns not covered by PropertyDef. These are the
# WordRecord identity + phonological + structural fields.
_CORE_WORDS_COLUMNS: dict[str, pl.DataType] = {
    "word": pl.Utf8,
    "has_phonology": pl.Boolean,
    "ipa": pl.Utf8,
    "phonemes": pl.List(pl.Utf8),
    "phonemes_str": pl.Utf8,           # pipe-delimited form for D1 LIKE-pattern queries
    "phoneme_count": pl.Int32,
    "syllables": pl.List(pl.Struct({"phonemes": pl.List(pl.Utf8), "stress": pl.Int32})),
    "syllable_count": pl.Int32,
    "initial_phoneme": pl.Utf8,
    "final_phoneme": pl.Utf8,
    "wcm_score": pl.Int32,
    "root": pl.Utf8,
    "variants": pl.List(pl.Struct({})),  # empty struct accepts any keys; impl detail
    "vocab_memberships": pl.List(pl.Utf8),
    "is_monomorphemic": pl.Boolean,    # MorphyNet-derived bool
}


def words_schema() -> Mapping[str, pl.DataType]:
    """Schema for words.parquet — core columns + all PropertyDef-defined norms."""
    schema: dict[str, pl.DataType] = dict(_CORE_WORDS_COLUMNS)
    for prop_id, prop_def in PROPERTY_MAP.items():
        if prop_id in schema:  # is_monomorphemic, etc. — already in core
            continue
        schema[prop_id] = _propertydef_to_polars_dtype(prop_def)
    return schema


def edges_schema() -> Mapping[str, pl.DataType]:
    """Schema for edges.parquet — association graph edges."""
    return {
        "source": pl.Utf8,
        "target": pl.Utf8,
        "edge_sources": pl.List(pl.Utf8),
        "usf_forward": pl.Float32,
        "usf_backward": pl.Float32,
        "men_relatedness": pl.Float32,
        "simlex_similarity": pl.Float32,
        "simlex_pos": pl.Utf8,
        "wordsim_relatedness": pl.Float32,
        "spp_first_priming": pl.Float32,
        "spp_other_priming": pl.Float32,
        "spp_fas": pl.Float32,
        "spp_lsa": pl.Float32,
        "eccc_consistency": pl.Float32,
        "eccc_n_instances": pl.Int32,
        "eccc_phoneme_distance": pl.Float32,
    }


def selectional_schema() -> Mapping[str, pl.DataType]:
    """Schema for selectional.parquet — per-(verb, role, filler) PPMI.

    Schema only; population is sibling ticket B (corpus DEP reannotation).
    """
    return {
        "verb": pl.Utf8,
        "role": pl.Utf8,
        "filler": pl.Utf8,
        "count_v_r_f": pl.UInt32,
        "count_v_r_star": pl.UInt32,
        "ppmi": pl.Float32,
    }
  • [ ] Step 5: Run tests to verify they pass

Run: uv run python -m pytest packages/data/tests/runtime/test_schema.py -v Expected: 7 PASSED.

  • [ ] Step 6: Commit
git add packages/data/src/phonolex_data/runtime/__init__.py \
        packages/data/src/phonolex_data/runtime/schema.py \
        packages/data/tests/runtime/__init__.py \
        packages/data/tests/runtime/test_schema.py
git commit -m "PHON-93: schema codegen from PropertyDef → Polars schemas"

Task 3: WordStore class — load + get + df

Polars DataFrame wrapper with O(1) get via word→row_idx hash. Subset/prefix/iterate_typed/is_admitted in next task.

Files: - Create: packages/data/src/phonolex_data/runtime/store.py - Create: packages/data/tests/runtime/test_store.py

  • [ ] Step 1: Write failing tests for WordStore basics

Create packages/data/tests/runtime/test_store.py:

from pathlib import Path

import polars as pl
import pytest

from phonolex_data.runtime.store import WordStore


@pytest.fixture
def tiny_df() -> pl.DataFrame:
    return pl.DataFrame({
        "word": ["cat", "dog", "fish"],
        "pos": ["NOUN", "NOUN", "NOUN"],
        "aoa": [3.0, 3.5, 4.0],
        "frequency": [100.0, 80.0, 30.0],
    })


@pytest.fixture
def store(tiny_df: pl.DataFrame) -> WordStore:
    return WordStore(tiny_df)


def test_store_exposes_dataframe(store: WordStore, tiny_df: pl.DataFrame):
    assert store.df.equals(tiny_df)


def test_store_word_count(store: WordStore):
    assert store.word_count == 3


def test_get_returns_row_dict(store: WordStore):
    row = store.get("cat")
    assert row is not None
    assert row["word"] == "cat"
    assert row["aoa"] == 3.0


def test_get_returns_none_for_missing_word(store: WordStore):
    assert store.get("zebra") is None


def test_from_parquet_roundtrip(tmp_path: Path, tiny_df: pl.DataFrame):
    p = tmp_path / "tiny.parquet"
    tiny_df.write_parquet(p)
    store = WordStore.from_parquet(p)
    assert store.word_count == 3
    assert store.get("dog")["frequency"] == 80.0


def test_from_parquet_missing_file_hard_fails(tmp_path: Path):
    with pytest.raises(FileNotFoundError):
        WordStore.from_parquet(tmp_path / "missing.parquet")
  • [ ] Step 2: Run tests to verify they fail

Run: uv run python -m pytest packages/data/tests/runtime/test_store.py -v Expected: ImportError for WordStore.

  • [ ] Step 3: Implement WordStore basics

Create packages/data/src/phonolex_data/runtime/store.py:

"""Polars-backed word-data store.

Replaces 4 JSON dumps (norms_dump, vocab_dump, phoneme_rates, assoc_graph)
with a single in-memory DataFrame + word→row_idx hash for O(1) get.
"""

from __future__ import annotations

from pathlib import Path
from typing import Any

import polars as pl


class WordStore:
    """Polars-backed runtime word-data layer.

    The DataFrame holds all `PropertyDef`-driven columns plus core
    phonological/structural fields. `get(word)` is O(1) via a precomputed
    word→row_idx hash; `subset(expr)` is a Polars filter; `prefix(str)` and
    `iterate_typed(expr)` materialize lists for downstream consumers.
    """

    def __init__(self, df: pl.DataFrame):
        self._df = df
        self._word_to_idx: dict[str, int] = {
            w: i for i, w in enumerate(df.get_column("word").to_list())
        }

    @classmethod
    def from_parquet(cls, path: Path) -> "WordStore":
        if not path.exists():
            raise FileNotFoundError(
                f"WordStore Parquet artifact not found at {path}. "
                f"Run `phonolex-data emit-parquet` to generate."
            )
        return cls(pl.read_parquet(path))

    @property
    def df(self) -> pl.DataFrame:
        return self._df

    @property
    def word_count(self) -> int:
        return self._df.height

    def get(self, word: str) -> dict[str, Any] | None:
        idx = self._word_to_idx.get(word)
        if idx is None:
            return None
        return self._df.row(idx, named=True)
  • [ ] Step 4: Run tests to verify they pass

Run: uv run python -m pytest packages/data/tests/runtime/test_store.py -v Expected: 6 PASSED.

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/runtime/store.py \
        packages/data/tests/runtime/test_store.py
git commit -m "PHON-93: WordStore basic load + get + df"

Task 4: WordStore queries — subset, prefix, iterate_typed, is_admitted

Files: - Modify: packages/data/src/phonolex_data/runtime/store.py - Modify: packages/data/tests/runtime/test_store.py

  • [ ] Step 1: Append query tests

Append to packages/data/tests/runtime/test_store.py:

def test_subset_filters_with_expression(store: WordStore):
    result = store.subset(pl.col("aoa") < 4.0)
    assert result.height == 2
    assert set(result.get_column("word").to_list()) == {"cat", "dog"}


def test_subset_returns_dataframe_for_chaining(store: WordStore):
    result = store.subset(pl.col("frequency") > 50.0)
    assert isinstance(result, pl.DataFrame)


def test_prefix_returns_words_starting_with(store: WordStore):
    assert sorted(store.prefix("c")) == ["cat"]
    assert sorted(store.prefix("d")) == ["dog"]
    assert sorted(store.prefix("xyz")) == []


def test_prefix_empty_string_returns_all(store: WordStore):
    assert sorted(store.prefix("")) == ["cat", "dog", "fish"]


def test_iterate_typed_returns_word_list(store: WordStore):
    words = store.iterate_typed(pl.col("frequency") > 50.0)
    assert sorted(words) == ["cat", "dog"]
    assert isinstance(words, list)


def test_is_admitted_true_for_matching_word(store: WordStore):
    assert store.is_admitted("cat", pl.col("aoa") < 4.0) is True


def test_is_admitted_false_for_non_matching_word(store: WordStore):
    assert store.is_admitted("fish", pl.col("aoa") < 4.0) is False


def test_is_admitted_false_for_missing_word(store: WordStore):
    assert store.is_admitted("zebra", pl.col("aoa") < 4.0) is False
  • [ ] Step 2: Run tests to verify they fail

Run: uv run python -m pytest packages/data/tests/runtime/test_store.py -v Expected: AttributeError for subset, prefix, iterate_typed, is_admitted.

  • [ ] Step 3: Implement query methods

Append to packages/data/src/phonolex_data/runtime/store.py inside the WordStore class:

    def subset(self, expr: pl.Expr) -> pl.DataFrame:
        """Filter the store by a Polars expression. Returns a DataFrame for chaining."""
        return self._df.filter(expr)

    def prefix(self, prefix: str) -> list[str]:
        """All words starting with the given prefix. Empty string returns all words."""
        if prefix == "":
            return self._df.get_column("word").to_list()
        return (
            self._df
            .filter(pl.col("word").str.starts_with(prefix))
            .get_column("word")
            .to_list()
        )

    def iterate_typed(self, expr: pl.Expr) -> list[str]:
        """Filter and materialize as a word list. Used by CFG enumerator slot fills."""
        return self._df.filter(expr).get_column("word").to_list()

    def is_admitted(self, word: str, expr: pl.Expr) -> bool:
        """Check whether a specific word satisfies the expression."""
        idx = self._word_to_idx.get(word)
        if idx is None:
            return False
        return self._df.slice(idx, 1).filter(expr).height > 0
  • [ ] Step 4: Run tests to verify they pass

Run: uv run python -m pytest packages/data/tests/runtime/test_store.py -v Expected: 14 PASSED.

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/runtime/store.py \
        packages/data/tests/runtime/test_store.py
git commit -m "PHON-93: WordStore queries (subset, prefix, iterate_typed, is_admitted)"

Task 5: emit_parquet — records → words.parquet + edges.parquet

Convert LexicalDatabase from phonolex_data.pipeline into Polars DataFrames and write Parquet artifacts.

Files: - Create: packages/data/src/phonolex_data/runtime/emit_parquet.py - Create: packages/data/tests/runtime/test_emit_parquet.py

  • [ ] Step 1: Write failing test for emit_parquet

Create packages/data/tests/runtime/test_emit_parquet.py:

from pathlib import Path

import polars as pl
import pytest

from phonolex_data.pipeline.schema import EdgeRecord, LexicalDatabase, WordRecord
from phonolex_data.runtime.emit_parquet import emit_parquet
from phonolex_data.runtime.store import WordStore


@pytest.fixture
def tiny_db() -> LexicalDatabase:
    return LexicalDatabase(
        words={
            "cat": WordRecord(
                word="cat",
                has_phonology=True,
                phonemes=["k", "æ", "t"],
                phoneme_count=3,
                syllable_count=1,
                aoa=3.0,
                frequency=100.0,
                vocab_memberships={"basic"},
            ),
            "dog": WordRecord(
                word="dog",
                has_phonology=True,
                phonemes=["d", "ɔ", "ɡ"],
                phoneme_count=3,
                syllable_count=1,
                aoa=3.5,
                frequency=80.0,
                vocab_memberships=set(),
            ),
        },
        edges=[
            EdgeRecord(
                source="cat",
                target="dog",
                edge_sources=["USF"],
                usf_forward=0.5,
            ),
        ],
    )


def test_emit_parquet_writes_words_file(tmp_path: Path, tiny_db: LexicalDatabase):
    emit_parquet(tiny_db, tmp_path)
    assert (tmp_path / "words.parquet").exists()


def test_emit_parquet_writes_edges_file(tmp_path: Path, tiny_db: LexicalDatabase):
    emit_parquet(tiny_db, tmp_path)
    assert (tmp_path / "edges.parquet").exists()


def test_emit_parquet_writes_empty_selectional_file(tmp_path: Path, tiny_db: LexicalDatabase):
    emit_parquet(tiny_db, tmp_path)
    sel_path = tmp_path / "selectional.parquet"
    assert sel_path.exists()
    df = pl.read_parquet(sel_path)
    assert df.height == 0
    assert "ppmi" in df.columns


def test_emit_parquet_roundtrip_via_wordstore(tmp_path: Path, tiny_db: LexicalDatabase):
    emit_parquet(tiny_db, tmp_path)
    store = WordStore.from_parquet(tmp_path / "words.parquet")
    assert store.word_count == 2
    cat = store.get("cat")
    assert cat["aoa"] == 3.0
    assert cat["phonemes"] == ["k", "æ", "t"]
    assert sorted(cat["vocab_memberships"]) == ["basic"]


def test_emit_parquet_edges_roundtrip(tmp_path: Path, tiny_db: LexicalDatabase):
    emit_parquet(tiny_db, tmp_path)
    edges = pl.read_parquet(tmp_path / "edges.parquet")
    assert edges.height == 1
    row = edges.row(0, named=True)
    assert row["source"] == "cat"
    assert row["target"] == "dog"
    assert row["usf_forward"] == 0.5
  • [ ] Step 2: Run tests to verify they fail

Run: uv run python -m pytest packages/data/tests/runtime/test_emit_parquet.py -v Expected: ImportError for emit_parquet.

  • [ ] Step 3: Implement emit_parquet

Create packages/data/src/phonolex_data/runtime/emit_parquet.py:

"""Emit Parquet artifacts from a LexicalDatabase.

Generates `words.parquet`, `edges.parquet`, and an empty schema-only
`selectional.parquet`. These are the LFS-tracked canonical artifacts.
"""

from __future__ import annotations

import dataclasses
from pathlib import Path

import polars as pl

from phonolex_data.pipeline.schema import LexicalDatabase
from phonolex_data.runtime.schema import (
    edges_schema,
    selectional_schema,
    words_schema,
)


def _word_record_to_row(record) -> dict:
    """Convert a WordRecord to a dict matching the words_schema."""
    row = {f.name: getattr(record, f.name) for f in dataclasses.fields(record)}
    # Set membership → sorted list (Parquet doesn't have set type)
    if isinstance(row.get("vocab_memberships"), set):
        row["vocab_memberships"] = sorted(row["vocab_memberships"])
    # Pipe-delimited phoneme string for D1 LIKE-pattern queries
    phonemes = row.get("phonemes") or []
    row["phonemes_str"] = "|" + "|".join(phonemes) + "|" if phonemes else None
    return row


def _edge_record_to_row(record) -> dict:
    return {f.name: getattr(record, f.name) for f in dataclasses.fields(record)}


def emit_parquet(db: LexicalDatabase, output_dir: Path) -> None:
    """Write words.parquet, edges.parquet, and empty selectional.parquet."""
    output_dir.mkdir(parents=True, exist_ok=True)

    # Words
    rows = [_word_record_to_row(r) for r in db.words.values()]
    words_df = pl.DataFrame(rows, schema=words_schema(), strict=False)
    words_df.write_parquet(output_dir / "words.parquet", compression="zstd")

    # Edges
    edge_rows = [_edge_record_to_row(e) for e in db.edges]
    edges_df = pl.DataFrame(edge_rows, schema=edges_schema(), strict=False)
    edges_df.write_parquet(output_dir / "edges.parquet", compression="zstd")

    # Selectional — schema only, populated by ticket B
    sel_df = pl.DataFrame(schema=selectional_schema())
    sel_df.write_parquet(output_dir / "selectional.parquet", compression="zstd")
  • [ ] Step 4: Run tests to verify they pass

Run: uv run python -m pytest packages/data/tests/runtime/test_emit_parquet.py -v Expected: 5 PASSED.

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/runtime/emit_parquet.py \
        packages/data/tests/runtime/test_emit_parquet.py
git commit -m "PHON-93: emit_parquet writes words/edges/selectional"

Task 6: emit_d1_sql — Parquet → D1 seed SQL with parity to existing

The current export-to-d1.py builds the seed SQL directly from records. We re-route: emit Parquet first, then derive SQL from Parquet. This task implements the Parquet-to-SQL path with a regression test for parity.

Files: - Create: packages/data/src/phonolex_data/runtime/emit_d1_sql.py - Create: packages/data/tests/runtime/test_emit_d1_sql.py

  • [ ] Step 1: Capture a baseline section of the current d1-seed.sql

The full file is ~274MB. We test against the DDL section + first N INSERT rows. Run:

head -200 packages/web/workers/scripts/d1-seed.sql > /tmp/d1-seed-head-200.sql

This snapshot is for reference during impl; not committed.

  • [ ] Step 2: Write failing tests for emit_d1_sql

Create packages/data/tests/runtime/test_emit_d1_sql.py:

from pathlib import Path

import polars as pl
import pytest

from phonolex_data.pipeline.schema import EdgeRecord, LexicalDatabase, WordRecord
from phonolex_data.runtime.emit_parquet import emit_parquet
from phonolex_data.runtime.emit_d1_sql import emit_d1_sql


@pytest.fixture
def parquet_dir(tmp_path: Path) -> Path:
    db = LexicalDatabase(
        words={
            "cat": WordRecord(
                word="cat", has_phonology=True,
                phonemes=["k", "æ", "t"], phoneme_count=3,
                syllable_count=1, aoa=3.0, frequency=100.0,
            ),
        },
        edges=[
            EdgeRecord(source="cat", target="dog", edge_sources=["USF"], usf_forward=0.5),
        ],
    )
    emit_parquet(db, tmp_path)
    return tmp_path


def test_emit_d1_sql_writes_file(tmp_path: Path, parquet_dir: Path):
    out = tmp_path / "out.sql"
    emit_d1_sql(parquet_dir, out)
    assert out.exists()


def test_emit_d1_sql_contains_create_table_words(tmp_path: Path, parquet_dir: Path):
    out = tmp_path / "out.sql"
    emit_d1_sql(parquet_dir, out)
    sql = out.read_text()
    assert "CREATE TABLE words" in sql
    assert "word TEXT PRIMARY KEY" in sql


def test_emit_d1_sql_contains_create_table_word_properties(tmp_path: Path, parquet_dir: Path):
    out = tmp_path / "out.sql"
    emit_d1_sql(parquet_dir, out)
    sql = out.read_text()
    assert "CREATE TABLE word_properties" in sql


def test_emit_d1_sql_contains_create_table_edges(tmp_path: Path, parquet_dir: Path):
    out = tmp_path / "out.sql"
    emit_d1_sql(parquet_dir, out)
    sql = out.read_text()
    assert "CREATE TABLE edges" in sql


def test_emit_d1_sql_inserts_words(tmp_path: Path, parquet_dir: Path):
    out = tmp_path / "out.sql"
    emit_d1_sql(parquet_dir, out)
    sql = out.read_text()
    assert "INSERT INTO words" in sql
    assert "'cat'" in sql


def test_emit_d1_sql_inserts_edges(tmp_path: Path, parquet_dir: Path):
    out = tmp_path / "out.sql"
    emit_d1_sql(parquet_dir, out)
    sql = out.read_text()
    assert "INSERT INTO edges" in sql
    assert "'cat'" in sql
    assert "'dog'" in sql
  • [ ] Step 3: Run tests to verify they fail

Run: uv run python -m pytest packages/data/tests/runtime/test_emit_d1_sql.py -v Expected: ImportError for emit_d1_sql.

  • [ ] Step 4: Read the current export-to-d1.py for SQL shape reference

Run: head -100 packages/web/workers/scripts/export-to-d1.py

The current script generates DDL + INSERT statements directly from WordRecord instances. We mirror that shape but read from Parquet.

  • [ ] Step 5: Implement emit_d1_sql

Create packages/data/src/phonolex_data/runtime/emit_d1_sql.py:

"""Emit a D1 seed SQL file from Parquet artifacts.

Reads `words.parquet` + `edges.parquet`, generates DDL for the 4-table
schema (words, word_properties, word_percentiles, word_freq_bands, edges,
plus auxiliary tables matching the existing seed), emits INSERT statements
in chunks. Output matches the shape `wrangler d1 execute --file` ingests.

This replaces the direct records→SQL path in `export-to-d1.py`. Parquet
becomes canonical; SQL is a CI-derived artifact.
"""

from __future__ import annotations

from pathlib import Path
from typing import Iterable

import polars as pl


# DDL for the words table (core columns only). word_properties, _percentiles,
# _freq_bands DDL is generated dynamically from the words.parquet schema.
_WORDS_DDL = """
CREATE TABLE words (
  word TEXT PRIMARY KEY,
  has_phonology INTEGER NOT NULL DEFAULT 1,
  ipa TEXT,
  phonemes TEXT,
  phonemes_str TEXT,
  syllables TEXT,
  phoneme_count INTEGER,
  syllable_count INTEGER,
  initial_phoneme TEXT,
  final_phoneme TEXT,
  root TEXT,
  variants TEXT,
  vocab_memberships TEXT
);
""".strip()

_EDGES_DDL = """
CREATE TABLE edges (
  source TEXT NOT NULL,
  target TEXT NOT NULL,
  edge_sources TEXT NOT NULL,
  usf_forward REAL,
  usf_backward REAL,
  men_relatedness REAL,
  eccc_consistency REAL,
  eccc_n_instances INTEGER,
  eccc_phoneme_distance REAL,
  spp_first_priming REAL,
  spp_other_priming REAL,
  spp_fas REAL,
  spp_lsa REAL,
  simlex_similarity REAL,
  simlex_pos TEXT,
  wordsim_relatedness REAL,
  PRIMARY KEY (source, target)
);
""".strip()


_CORE_WORDS_FIELDS = (
    "word", "has_phonology", "ipa", "phonemes", "phonemes_str", "syllables",
    "phoneme_count", "syllable_count", "initial_phoneme", "final_phoneme",
    "root", "variants", "vocab_memberships",
)


def _sql_escape(value) -> str:
    """SQL-quote a single Python value for INSERT statements."""
    if value is None:
        return "NULL"
    if isinstance(value, bool):
        return "1" if value else "0"
    if isinstance(value, (int, float)):
        return repr(value)
    if isinstance(value, list):
        # JSON-encode lists to TEXT — matches existing export-to-d1.py shape
        import json
        return "'" + json.dumps(value).replace("'", "''") + "'"
    text = str(value)
    return "'" + text.replace("'", "''") + "'"


def _emit_inserts(
    table: str,
    columns: Iterable[str],
    rows: Iterable[tuple],
    chunk_size: int = 200,
) -> Iterable[str]:
    """Yield INSERT statements in chunks."""
    cols = tuple(columns)
    col_list = "(" + ", ".join(cols) + ")"
    chunk: list[str] = []
    for row in rows:
        values = "(" + ", ".join(_sql_escape(v) for v in row) + ")"
        chunk.append(values)
        if len(chunk) >= chunk_size:
            yield f"INSERT INTO {table} {col_list} VALUES\n" + ",\n".join(chunk) + ";"
            chunk = []
    if chunk:
        yield f"INSERT INTO {table} {col_list} VALUES\n" + ",\n".join(chunk) + ";"


def emit_d1_sql(parquet_dir: Path, output_path: Path) -> None:
    """Generate D1 seed SQL from Parquet artifacts."""
    words_df = pl.read_parquet(parquet_dir / "words.parquet")
    edges_df = pl.read_parquet(parquet_dir / "edges.parquet")

    # Partition columns: core (words table) vs property columns (word_properties + percentiles + freq_bands)
    core_cols = [c for c in _CORE_WORDS_FIELDS if c in words_df.columns]
    property_cols = [c for c in words_df.columns if c not in _CORE_WORDS_FIELDS]

    parts: list[str] = []

    # Words table DDL + INSERT
    parts.append(_WORDS_DDL)
    parts.extend(_emit_inserts(
        "words",
        core_cols,
        words_df.select(core_cols).iter_rows(),
    ))

    # Word properties — DDL with all property cols, INSERT
    if property_cols:
        prop_ddl_cols = "\n  ".join(f"{c} REAL," for c in property_cols)
        parts.append(
            f"CREATE TABLE word_properties (\n  word TEXT PRIMARY KEY,\n  {prop_ddl_cols}\n  FOREIGN KEY(word) REFERENCES words(word)\n);"
        )
        parts.extend(_emit_inserts(
            "word_properties",
            ["word"] + property_cols,
            words_df.select(["word"] + property_cols).iter_rows(),
        ))

    # Edges DDL + INSERT
    parts.append(_EDGES_DDL)
    parts.extend(_emit_inserts(
        "edges",
        edges_df.columns,
        edges_df.iter_rows(),
    ))

    output_path.write_text("\n\n".join(parts) + "\n")

Note: This is a v0 emit that handles core words + word_properties + edges. The full PHON-88 4-table split (word_percentiles, word_freq_bands) needs the same partition logic with explicit column-set boundaries — see Task 12 for the full parity work against the existing export-to-d1.py. This task validates the basic emission shape.

  • [ ] Step 6: Run tests to verify they pass

Run: uv run python -m pytest packages/data/tests/runtime/test_emit_d1_sql.py -v Expected: 6 PASSED.

  • [ ] Step 7: Commit
git add packages/data/src/phonolex_data/runtime/emit_d1_sql.py \
        packages/data/tests/runtime/test_emit_d1_sql.py
git commit -m "PHON-93: emit_d1_sql v0 — words/word_properties/edges DDL + INSERT"

Task 7: Generate canonical Parquet artifacts + LFS-track

Files: - Create: data/runtime/words.parquet (LFS) - Create: data/runtime/edges.parquet (LFS) - Create: data/runtime/selectional.parquet (LFS, schema-only) - Modify: .gitattributes

  • [ ] Step 1: Track Parquet artifacts in LFS

Run:

git lfs track "data/runtime/*.parquet"

Verify .gitattributes now contains the line:

data/runtime/*.parquet filter=lfs diff=lfs merge=lfs -text

  • [ ] Step 2: Generate the canonical Parquet artifacts

Create a one-shot script packages/data/scripts/build_runtime_parquet.py:

"""Build canonical Parquet artifacts from current PhonoLex data."""

from pathlib import Path

from phonolex_data.pipeline.words import build_words
from phonolex_data.pipeline.edges import build_edges
from phonolex_data.pipeline.schema import LexicalDatabase
from phonolex_data.runtime.emit_parquet import emit_parquet


def main():
    print("Building word records…")
    words = build_words()
    print(f"  {len(words)} words")
    print("Building edges…")
    edges = build_edges(words)
    print(f"  {len(edges)} edges")
    db = LexicalDatabase(words=words, edges=edges)
    out = Path(__file__).resolve().parents[3] / "data/runtime"
    print(f"Emitting Parquet to {out}…")
    emit_parquet(db, out)
    print("Done.")


if __name__ == "__main__":
    main()

Run:

mkdir -p data/runtime
uv run python packages/data/scripts/build_runtime_parquet.py
ls -la data/runtime/

Expected: words.parquet (~30-60 MB), edges.parquet (~5-10 MB), selectional.parquet (~1 KB schema-only). Sizes will not match exactly; verify they're in the right ballpark.

  • [ ] Step 3: Verify artifacts roundtrip via WordStore

Run:

uv run python -c "
from pathlib import Path
from phonolex_data.runtime.store import WordStore
import polars as pl

store = WordStore.from_parquet(Path('data/runtime/words.parquet'))
print(f'Words: {store.word_count}')
edges = pl.read_parquet('data/runtime/edges.parquet')
print(f'Edges: {edges.height}')
sel = pl.read_parquet('data/runtime/selectional.parquet')
print(f'Selectional: {sel.height} (expected 0, schema-only)')
"

Expected: ~150K words, ~70K edges, 0 selectional rows.

  • [ ] Step 4: Commit Parquet artifacts (LFS)
git add .gitattributes
git add data/runtime/words.parquet data/runtime/edges.parquet data/runtime/selectional.parquet
git add packages/data/scripts/build_runtime_parquet.py
git commit -m "PHON-93: generate canonical words/edges/selectional Parquet artifacts (LFS)"

Verify LFS pointers:

git lfs ls-files | grep "data/runtime"

Expected: three files listed.


Task 8: Migrate v6 generation server to WordStore

packages/generation/server/word_norms.py currently reads from the 4 JSON dumps. Migrate to read from WordStore (loaded from data/runtime/words.parquet baked into the Docker image).

Files: - Modify: packages/generation/server/word_norms.py - Modify: packages/generation/server/main.py (load WordStore at startup) - Modify: packages/generation/Dockerfile (copy data/runtime/*.parquet into image)

  • [ ] Step 1: Read current word_norms.py to understand consumer surface

Run: cat packages/generation/server/word_norms.py | head -100

Note which methods are called externally (search the rest of packages/generation/server/ for imports of word_norms).

  • [ ] Step 2: Write a snapshot test capturing current behavior

Create packages/generation/server/tests/test_word_norms_migration.py:

"""Snapshot test for the word_norms migration to WordStore.

Captures current behavior on a known-good word set so we can verify the
WordStore-backed implementation matches.
"""

import pytest


SAMPLE_WORDS = ["cat", "dog", "tree", "compute", "exhibit"]


@pytest.mark.parametrize("word", SAMPLE_WORDS)
def test_word_lookup_returns_record(word: str):
    """Each sample word returns a non-empty record with norms."""
    from server.word_norms import WordNorms  # whatever the current API is
    norms = WordNorms()
    record = norms.get(word)
    assert record is not None
    # Confirm a few representative fields are present
    assert "frequency" in record or "log_frequency" in record


def test_word_lookup_returns_none_for_unknown():
    from server.word_norms import WordNorms
    norms = WordNorms()
    assert norms.get("zzqxzx") is None

Run: cd packages/generation && uv run python -m pytest server/tests/test_word_norms_migration.py -v Expected: PASS against the current 4-JSON-dump implementation. (If the current API differs, adjust the test to match — the goal is a snapshot of present behavior.)

  • [ ] Step 3: Rewrite word_norms.py to use WordStore

Replace packages/generation/server/word_norms.py:

"""Word-level norms loader, backed by WordStore.

Replaces the legacy 4-JSON-dump runtime. WordStore loads from
`data/runtime/words.parquet` baked into the Docker image at build time.
"""

from __future__ import annotations

from pathlib import Path

import polars as pl

from phonolex_data.runtime.store import WordStore

# Resolve runtime data directory. In Docker: /app/data/runtime/.
# In local dev: packages/generation/runtime_data/ for backward-compat
# fallback, or repo-root data/runtime/ if present.
_RUNTIME_DATA_DIR = Path(__file__).resolve().parents[2] / "runtime_data"
_REPO_RUNTIME_DIR = Path(__file__).resolve().parents[4] / "data/runtime"


def _resolve_words_parquet() -> Path:
    for candidate in (_RUNTIME_DATA_DIR / "words.parquet",
                      _REPO_RUNTIME_DIR / "words.parquet"):
        if candidate.exists():
            return candidate
    raise FileNotFoundError(
        "words.parquet not found. Looked in: "
        f"{_RUNTIME_DATA_DIR}, {_REPO_RUNTIME_DIR}. "
        "Run `phonolex-data emit-parquet` or rebuild the Docker image."
    )


class WordNorms:
    """Per-word norms accessor. Single global instance per server process."""

    _instance: "WordNorms | None" = None

    def __init__(self):
        self._store = WordStore.from_parquet(_resolve_words_parquet())

    @classmethod
    def instance(cls) -> "WordNorms":
        if cls._instance is None:
            cls._instance = cls()
        return cls._instance

    def get(self, word: str) -> dict | None:
        return self._store.get(word)

    @property
    def df(self) -> pl.DataFrame:
        return self._store.df

    @property
    def store(self) -> WordStore:
        return self._store
  • [ ] Step 4: Run snapshot test against new impl

Run: cd packages/generation && uv run python -m pytest server/tests/test_word_norms_migration.py -v Expected: PASS. The migrated module returns the same records.

  • [ ] Step 5: Update generation server main.py to load WordStore at startup

Find the existing startup hook in packages/generation/server/main.py:

grep -n "startup\|@app.on_event\|lifespan" packages/generation/server/main.py

Add a startup line that warms WordNorms:

# In the lifespan/startup handler:
from server.word_norms import WordNorms
WordNorms.instance()  # eager-load WordStore at startup, fail fast if Parquet missing
  • [ ] Step 6: Update Dockerfile to bake Parquet into image

Edit packages/generation/Dockerfile. Find any line that copies runtime_data/ and replace with copying data/runtime/*.parquet from the repo:

# Copy Parquet artifacts (replaces 4-JSON-dump pattern)
COPY data/runtime/*.parquet /app/data/runtime/
  • [ ] Step 7: Run all generation server tests

Run: cd packages/generation && uv run python -m pytest server/tests/ -v Expected: PASS. (If any pre-existing tests reference the JSON dumps directly, update them to use WordNorms.instance() or WordStore directly.)

  • [ ] Step 8: Commit
git add packages/generation/server/word_norms.py \
        packages/generation/server/main.py \
        packages/generation/server/tests/test_word_norms_migration.py \
        packages/generation/Dockerfile
git commit -m "PHON-93: migrate v6 word_norms.py to WordStore (Polars-backed)"

Task 9: Delete build_runtime_data.py and the 4 JSON dumps

Files: - Delete: packages/generation/scripts/build_runtime_data.py - Delete: packages/generation/runtime_data/norms_dump.json - Delete: packages/generation/runtime_data/vocab_dump.json - Delete: packages/generation/runtime_data/phoneme_rates.json - Delete: packages/generation/runtime_data/assoc_graph.json

  • [ ] Step 1: Verify no remaining callers of build_runtime_data.py

Run:

git grep -l "build_runtime_data" -- ':!docs/' ':!*.md'

Expected: empty (or only the file itself). If any other file references it, those need updating first.

  • [ ] Step 2: Delete the script
git rm packages/generation/scripts/build_runtime_data.py
  • [ ] Step 3: Delete the JSON dumps
git rm packages/generation/runtime_data/norms_dump.json \
       packages/generation/runtime_data/vocab_dump.json \
       packages/generation/runtime_data/phoneme_rates.json \
       packages/generation/runtime_data/assoc_graph.json

If runtime_data/ is now empty, git rm it too.

  • [ ] Step 4: Verify generation server tests still pass

Run: cd packages/generation && uv run python -m pytest server/tests/ -v Expected: PASS.

  • [ ] Step 5: Commit
git commit -m "PHON-93: delete build_runtime_data.py + 4 JSON dumps (replaced by WordStore + Parquet)"
  • [ ] Step 6: Delete PHON-64 spike's lexicon.py if present on this branch

The spec calls for deleting PHON-64's hand-rolled D1-SQLite reader. Check if it's tracked on this branch:

ls packages/generation/research/2026-05-04-phon-64-combinatorial-spike/lexicon.py 2>/dev/null

If the file exists:

git rm packages/generation/research/2026-05-04-phon-64-combinatorial-spike/lexicon.py
git commit -m "PHON-93: delete PHON-64 lexicon.py (replaced by WordStore)"

If it doesn't exist on this branch (it lives only on the spike branch), skip — no-op.


Task 10: Swap VocabTrie to marisa-trie + parallel-dict tag counts

API preserved: tag(), walk_to(), dead_end_ratio(), is_banned_word(), has_prefix(), word_count. Underlying storage swaps from Python dict-trie to marisa-trie + a dict[prefix_str, (total, banned)] for tag counts.

Files: - Modify: packages/governors/src/phonolex_governors/generation/trie.py - Possibly modify: packages/governors/tests/test_reranker_v2.py (only if it uses internal TrieNode shape; the public API tests should pass unchanged)

  • [ ] Step 1: Inventory existing trie tests

Run:

ls packages/governors/tests/ | grep -i trie
git grep -l "VocabTrie\|TrieNode" packages/governors/

Capture which methods are exercised by existing tests. If a test_vocab_trie.py exists, it's the contract; otherwise the contract lives in test_reranker_v2.py.

  • [ ] Step 2: Write/extend the API contract test

Create or extend packages/governors/tests/test_vocab_trie.py:

"""API contract tests for VocabTrie. The public API must hold across
storage backends (Python dict-trie → marisa-trie + parallel-dict)."""

import pytest

from phonolex_governors.generation.trie import VocabTrie


@pytest.fixture
def trie() -> VocabTrie:
    return VocabTrie(["cat", "cap", "car", "dog", "dot", "dote"])


def test_word_count(trie: VocabTrie):
    assert trie.word_count == 6


def test_has_prefix_true_for_existing(trie: VocabTrie):
    assert trie.has_prefix("ca") is True
    assert trie.has_prefix("d") is True
    assert trie.has_prefix("cat") is True


def test_has_prefix_false_for_nonexistent(trie: VocabTrie):
    assert trie.has_prefix("zz") is False
    assert trie.has_prefix("cax") is False


def test_is_banned_word_default_false(trie: VocabTrie):
    assert trie.is_banned_word("cat") is False


def test_tag_marks_word_as_banned(trie: VocabTrie):
    trie.tag({"cat", "dog"})
    assert trie.is_banned_word("cat") is True
    assert trie.is_banned_word("dog") is True
    assert trie.is_banned_word("cap") is False


def test_dead_end_ratio_zero_when_nothing_banned(trie: VocabTrie):
    trie.tag(set())
    assert trie.dead_end_ratio("ca") == 0.0


def test_dead_end_ratio_one_when_all_banned_below(trie: VocabTrie):
    trie.tag({"cat", "cap", "car"})
    assert trie.dead_end_ratio("ca") == 1.0


def test_dead_end_ratio_partial(trie: VocabTrie):
    trie.tag({"cat"})
    # 1 banned out of 3 ca-prefixed words
    assert trie.dead_end_ratio("ca") == pytest.approx(1.0 / 3.0)


def test_dead_end_ratio_unknown_prefix_returns_one(trie: VocabTrie):
    trie.tag(set())
    assert trie.dead_end_ratio("zz") == 1.0


def test_tag_is_idempotent(trie: VocabTrie):
    trie.tag({"cat"})
    ratio1 = trie.dead_end_ratio("ca")
    trie.tag({"cat"})
    ratio2 = trie.dead_end_ratio("ca")
    assert ratio1 == ratio2


def test_retag_replaces_previous_bans(trie: VocabTrie):
    trie.tag({"cat", "dog"})
    trie.tag({"cap"})
    assert trie.is_banned_word("cat") is False
    assert trie.is_banned_word("cap") is True
  • [ ] Step 3: Run tests against existing implementation

Run: uv run python -m pytest packages/governors/tests/test_vocab_trie.py -v Expected: PASS. Establishes the contract baseline.

  • [ ] Step 4: Rewrite trie.py with marisa-trie backend

Replace packages/governors/src/phonolex_governors/generation/trie.py:

"""Vocabulary trie for dead-end detection in constrained generation.

Backed by marisa-trie (LOUDS-encoded, static, ~50-100x memory reduction
over the previous Python dict-trie). API preserved end-to-end.

Performance (126K words):
    Build: sub-ms (marisa-trie static build)
    Tag:   O(total chars) — walks all words, builds per-prefix counts
    Lookup: sub-µs (dict lookup on per-prefix counts)
"""

from __future__ import annotations

import marisa_trie


class VocabTrie:
    """Static vocabulary trie with per-request tag counts.

    Build once from the full word list, then call ``tag()`` with each
    constraint set's banned list to update per-prefix counts. The trie
    structure is immutable; only the parallel `_counts` dict changes.
    """

    def __init__(self, words: list[str]):
        words_lower = [w.lower() for w in words]
        # Deduplicate (marisa-trie requires unique keys)
        unique_words = list(dict.fromkeys(words_lower))
        self._trie = marisa_trie.Trie(unique_words)
        self._words: list[str] = unique_words
        self._banned: frozenset[str] = frozenset()
        # Per-prefix counts: prefix → (total, banned)
        self._counts: dict[str, tuple[int, int]] = {}
        # Initialize with no bans so dead_end_ratio works before tag()
        self.tag(set())

    @property
    def word_count(self) -> int:
        return len(self._words)

    def tag(self, banned_words: set[str]) -> None:
        """Update per-prefix counts for the given banned set."""
        banned_lower = {w.lower() for w in banned_words}
        self._banned = frozenset(banned_lower)

        # Recompute per-prefix counts
        counts: dict[str, list[int]] = {}
        for word in self._words:
            is_banned = word in banned_lower
            for i in range(len(word) + 1):
                prefix = word[:i]
                if prefix not in counts:
                    counts[prefix] = [0, 0]
                counts[prefix][0] += 1
                if is_banned:
                    counts[prefix][1] += 1
        self._counts = {p: (t, b) for p, (t, b) in counts.items()}

    def has_prefix(self, prefix: str) -> bool:
        prefix = prefix.lower()
        # marisa-trie iterkeys returns at least one match if prefix exists
        try:
            next(iter(self._trie.iterkeys(prefix)))
            return True
        except StopIteration:
            return False

    def is_banned_word(self, word: str) -> bool:
        return word.lower() in self._banned

    def dead_end_ratio(self, prefix: str) -> float:
        prefix = prefix.lower()
        counts = self._counts.get(prefix)
        if counts is None or counts[0] == 0:
            return 1.0
        total, banned = counts
        return banned / total

    def walk_to(self, prefix: str) -> "_PrefixView | None":
        """Return a view object for the prefix, or None if not present."""
        if not self.has_prefix(prefix):
            return None
        return _PrefixView(self, prefix.lower())


class _PrefixView:
    """Backwards-compat shim for callers that expected a TrieNode-shaped object.

    Exposes `is_end`, `total_below`, `banned_below`. Does NOT expose `children`
    — if any caller iterates `children`, that caller needs migration to use
    `VocabTrie.has_prefix()` / `dead_end_ratio()` directly.
    """

    __slots__ = ("_trie", "_prefix")

    def __init__(self, trie: VocabTrie, prefix: str):
        self._trie = trie
        self._prefix = prefix

    @property
    def is_end(self) -> bool:
        return self._prefix in self._trie._words

    @property
    def total_below(self) -> int:
        counts = self._trie._counts.get(self._prefix)
        return counts[0] if counts else 0

    @property
    def banned_below(self) -> int:
        counts = self._trie._counts.get(self._prefix)
        return counts[1] if counts else 0
  • [ ] Step 5: Run trie contract tests

Run: uv run python -m pytest packages/governors/tests/test_vocab_trie.py -v Expected: PASS. All 11 contract tests.

  • [ ] Step 6: Run existing governor tests

Run: uv run python -m pytest packages/governors/tests/ -v Expected: PASS for all tests. If any test references TrieNode directly or accesses node.children, update it to use the public API (has_prefix, dead_end_ratio, walk_to).

  • [ ] Step 7: Run generation server tests for cross-package validation

Run: cd packages/generation && uv run python -m pytest server/tests/ -v Expected: PASS.

  • [ ] Step 8: Commit
git add packages/governors/src/phonolex_governors/generation/trie.py \
        packages/governors/tests/test_vocab_trie.py
git commit -m "PHON-93: swap VocabTrie to marisa-trie + parallel-dict tag counts"

Task 11: Refactor export-to-d1.py to chain through Parquet

The current export-to-d1.py builds the seed SQL directly from records. New flow: emit Parquet first, then call emit_d1_sql on the Parquet artifact. Achieves parity with current d1-seed.sql.

Files: - Modify: packages/web/workers/scripts/export-to-d1.py - Modify: packages/data/src/phonolex_data/runtime/emit_d1_sql.py (extend for full PHON-88 4-table split parity)

  • [ ] Step 1: Read current export-to-d1.py to understand its DDL output

Run: wc -l packages/web/workers/scripts/export-to-d1.py Run: grep -n "CREATE TABLE\|INSERT INTO" packages/web/workers/scripts/export-to-d1.py | head -20

Capture every table the existing script emits. Per the spec, that's at least: words, word_properties, word_percentiles, word_freq_bands, edges, minimal_pairs, phonemes, phoneme_dots, components, word_syllables.

  • [ ] Step 2: Add parity test against current d1-seed.sql

Create packages/data/tests/runtime/test_d1_parity.py:

"""Parity test: emit_d1_sql output matches current d1-seed.sql shape.

This is a regression guard — the new derivation path must produce a
seed SQL that wrangler ingests identically to the existing one.
"""

import re
import subprocess
from pathlib import Path

import pytest


REPO_ROOT = Path(__file__).resolve().parents[4]
EXISTING_SEED = REPO_ROOT / "packages/web/workers/scripts/d1-seed.sql"


def _extract_create_tables(sql_text: str) -> set[str]:
    """Extract CREATE TABLE statements (table-name set)."""
    return set(re.findall(r"CREATE TABLE (\w+)", sql_text))


@pytest.mark.slow
def test_emit_d1_sql_matches_existing_table_set(tmp_path: Path):
    """The set of tables emitted matches the existing seed."""
    if not EXISTING_SEED.exists():
        pytest.skip("Existing d1-seed.sql not present; cannot run parity test")

    # Generate via the new path
    runtime_dir = REPO_ROOT / "data/runtime"
    if not (runtime_dir / "words.parquet").exists():
        pytest.skip("data/runtime/words.parquet not present; run build_runtime_parquet.py first")

    from phonolex_data.runtime.emit_d1_sql import emit_d1_sql
    out = tmp_path / "new-seed.sql"
    emit_d1_sql(runtime_dir, out)

    existing_tables = _extract_create_tables(EXISTING_SEED.read_text())
    new_tables = _extract_create_tables(out.read_text())

    # All existing tables must be present in the new emission
    missing = existing_tables - new_tables
    assert not missing, f"New emission missing tables: {missing}"

Run: uv run python -m pytest packages/data/tests/runtime/test_d1_parity.py -v Expected: FAIL (most tables not yet emitted by emit_d1_sql v0).

  • [ ] Step 3: Extend emit_d1_sql to cover the full 4-table split + auxiliary tables

The v0 implementation in Task 6 covers words, word_properties, edges. Extend it with: - word_percentiles (per-property percentile columns) - word_freq_bands (PHON-88 freq-band granular columns) - minimal_pairs (existing) - phonemes (phoneme features) - phoneme_dots (similarity dots) - components (syllable components) - word_syllables (per-word syllable index)

For minimal_pairs, phonemes, phoneme_dots, components, word_syllables — these come from the derived field of LexicalDatabase. They need their own Parquet artifacts OR they live in a separate emit step.

Decision: keep derived data (minimal_pairs, phonemes, phoneme_dots, components, word_syllables) on the records→SQL path for now. Only words.parquet and edges.parquet are Parquet-derived. The full unification of every table to Parquet is a follow-up — it's more scope than this ticket should absorb.

Restructure export-to-d1.py:

# packages/web/workers/scripts/export-to-d1.py

from pathlib import Path

from phonolex_data.runtime.emit_parquet import emit_parquet
from phonolex_data.runtime.emit_d1_sql import emit_d1_sql
from phonolex_data.pipeline.words import build_words
from phonolex_data.pipeline.edges import build_edges
from phonolex_data.pipeline.derived import build_derived
from phonolex_data.pipeline.schema import LexicalDatabase

# ... existing imports for derived-table SQL emission ...


def main():
    # 1. Build records (unchanged)
    words = build_words()
    edges = build_edges(words)
    derived = build_derived(words)
    db = LexicalDatabase(words=words, edges=edges, derived=derived)

    # 2. Emit Parquet artifacts (new canonical path)
    runtime_dir = Path(__file__).resolve().parents[4] / "data/runtime"
    emit_parquet(db, runtime_dir)

    # 3. Emit SQL: words/word_properties/edges from Parquet, rest from records
    seed_path = Path(__file__).resolve().parent / "d1-seed.sql"
    emit_d1_sql(runtime_dir, seed_path)

    # 4. Append derived-table SQL (existing logic) to seed_path
    _append_derived_sql(db.derived, seed_path)
    # ... existing minimal_pairs, phonemes, phoneme_dots, components, word_syllables ...


def _append_derived_sql(derived, seed_path):
    # Lift the existing derived-table emission code from the original script.
    # No new logic — just relocated.
    ...
  • [ ] Step 4: Run the full export pipeline

Run:

cd packages/web/workers
uv run python scripts/export-to-d1.py

Expected: produces scripts/d1-seed.sql with all tables. Size should be in the same range as the existing seed (~270 MB).

  • [ ] Step 5: Run parity test

Run: uv run python -m pytest packages/data/tests/runtime/test_d1_parity.py -v Expected: PASS — all existing tables present in new emission.

  • [ ] Step 6: Verify wrangler ingests the new seed

Run:

cd packages/web/workers
npx wrangler d1 execute phonolex --local --file scripts/d1-seed.sql

Expected: succeeds. (If errors, the SQL has a syntax issue — fix and re-run.)

  • [ ] Step 7: Run Workers tests

Run: cd packages/web/workers && npm test Expected: PASS.

  • [ ] Step 8: Commit
git add packages/web/workers/scripts/export-to-d1.py \
        packages/data/src/phonolex_data/runtime/emit_d1_sql.py \
        packages/data/tests/runtime/test_d1_parity.py
git commit -m "PHON-93: refactor export-to-d1 to chain through Parquet for words/edges"

Task 12: Update CI workflows

Trigger Parquet rebuild on data file changes; derive d1-seed.sql in CI.

Files: - Modify: .github/workflows/deploy.yml - Modify: .github/workflows/deploy-staging.yml

  • [ ] Step 1: Read current deploy.yml to find data-pipeline trigger

Run: cat .github/workflows/deploy.yml

Identify the existing dorny/paths-filter block that triggers data work.

  • [ ] Step 2: Extend the paths-filter to include data/runtime

In .github/workflows/deploy.yml, modify the relevant filter:

- uses: dorny/paths-filter@v3
  id: changes
  with:
    filters: |
      data:
        - 'data/norms/**'
        - 'data/cmu/**'
        - 'data/mappings/**'
        - 'packages/web/workers/scripts/config.py'
        - 'packages/data/src/phonolex_data/runtime/**'
  • [ ] Step 3: Add a step that rebuilds Parquet + seed SQL

After data-pipeline steps that build records, add:

- name: Rebuild runtime Parquet
  if: steps.changes.outputs.data == 'true'
  run: |
    uv run python packages/data/scripts/build_runtime_parquet.py

- name: Derive d1-seed.sql from Parquet
  if: steps.changes.outputs.data == 'true'
  run: |
    cd packages/web/workers
    uv run python scripts/export-to-d1.py
  • [ ] Step 4: Apply the same updates to deploy-staging.yml

Mirror the same changes in .github/workflows/deploy-staging.yml.

  • [ ] Step 5: Verify YAML syntax

Run:

python -c "import yaml; yaml.safe_load(open('.github/workflows/deploy.yml'))"
python -c "import yaml; yaml.safe_load(open('.github/workflows/deploy-staging.yml'))"

Expected: no errors.

  • [ ] Step 6: Push to a PR-style branch and observe staging CI
git push origin feature/phon-93-trie-centric-rebuild

Open a draft PR against release/v5.2.0. Verify the CI runs and the new Parquet-rebuild step executes correctly.

  • [ ] Step 7: Commit
git add .github/workflows/deploy.yml .github/workflows/deploy-staging.yml
git commit -m "PHON-93: CI workflow — rebuild Parquet + derive d1-seed.sql on data changes"

Task 13: Backup d1-seed.sql + LFS untrack

Last task — only execute after staging has run successfully on the new CI flow with Parquet-derived seed.

Files: - Modify: .gitattributes (remove d1-seed.sql LFS tracking) - Create: packages/web/workers/scripts/d1-seed.sql.preflight-backup-2026-05-05.gz (snapshot)

  • [ ] Step 1: Verify staging is green on Parquet-derived seed

Confirm the most recent staging deploy used the new CI flow (Parquet rebuild → SQL derivation → wrangler ingest) and the staging Workers API answers correctly.

curl https://staging-api.phonolex.com/api/property-metadata | jq '.categories | length'

Expected: returns the right number of categories. Spot-check a few endpoints.

  • [ ] Step 2: Snapshot the current d1-seed.sql outside LFS
gzip -c packages/web/workers/scripts/d1-seed.sql > /tmp/d1-seed.sql.preflight-backup-$(date +%Y-%m-%d).gz
ls -la /tmp/d1-seed.sql.preflight-backup-*.gz

Expected: ~30-40 MB compressed snapshot.

Commit the backup tag (NOT the file — too big) by leaving a note in the spec:

echo "Preflight backup of d1-seed.sql before LFS untrack:
  /tmp/d1-seed.sql.preflight-backup-$(date +%Y-%m-%d).gz" \
  >> docs/superpowers/specs/2026-05-05-phon-93-runtime-word-data-layer-design.md

(Optional: copy to a longer-term location like ~/Backups/. Local responsibility.)

  • [ ] Step 3: Remove d1-seed.sql LFS tracking

Edit .gitattributes. Remove or comment out the line:

packages/web/workers/scripts/d1-seed.sql filter=lfs diff=lfs merge=lfs -text

(If the seed wasn't tracked at the file level but via a *.sql pattern, scope the change carefully.)

  • [ ] Step 4: Migrate the file out of LFS storage
git lfs untrack 'packages/web/workers/scripts/d1-seed.sql'
git rm --cached packages/web/workers/scripts/d1-seed.sql
echo "packages/web/workers/scripts/d1-seed.sql" >> .gitignore

Now d1-seed.sql is local-only (CI-derived). Its absence is OK because CI rebuilds it on every deploy.

  • [ ] Step 5: Verify locally that re-running the export pipeline regenerates the seed
cd packages/web/workers
rm -f scripts/d1-seed.sql
uv run python scripts/export-to-d1.py
ls -la scripts/d1-seed.sql

Expected: file regenerated. Wrangler ingest still works.

  • [ ] Step 6: Commit
git add .gitattributes .gitignore
git commit -m "PHON-93: untrack d1-seed.sql from LFS — derived in CI from Parquet"

Note: This commit removes the LFS pointer; previous LFS history remains. A later cleanup ticket can do git lfs prune and history rewrite if storage cost matters. Per spec §R4, defer the history rewrite until a release cycle has passed cleanly.


Task 14: Final verification + Jira hygiene

  • [ ] Step 1: Run all tests across the repo
uv run python -m pytest packages/data/tests -v
uv run python -m pytest packages/governors/tests -v
cd packages/generation && uv run python -m pytest server/tests/ -v
cd ../web/workers && npm test

Expected: all tests pass.

  • [ ] Step 2: Verify nothing references the deleted JSON dumps or build_runtime_data.py
git grep -l "norms_dump\|vocab_dump\|phoneme_rates\|assoc_graph\|build_runtime_data" -- ':!docs/' ':!*.md'

Expected: empty (or only leftover docs that should be updated).

  • [ ] Step 3: Verify Parquet artifacts are in LFS
git lfs ls-files | grep "data/runtime"

Expected: 3 files listed (words, edges, selectional).

  • [ ] Step 4: Verify d1-seed.sql is no longer in LFS
git lfs ls-files | grep "d1-seed.sql"

Expected: empty.

  • [ ] Step 5: JQL Jira for next free PHON-XX before opening sibling tickets

Per feedback_verify_jira_state.md, before reserving ticket numbers for sibling B (corpus DEP reannotation) and C (editor + CFG enumerator):

# Use the Atlassian MCP to JQL: "project = PHON AND statusCategory != Done ORDER BY key DESC"
# Note the highest current PHON-XX; sibling tickets get the next free numbers.

Per the spec's decision-recommendation: edit PHON-93's description to reflect this rescope. Open siblings B and C with the scope language from the spec.

  • [ ] Step 6: Update CLAUDE.md to remove stale references

Remove or update any reference to: - ~44K canon words (use a queryable-subset framing instead) - 35 properties (now ~98+ via PHON-88) - 4 JSON dumps (norms_dump.json, etc. — replaced by WordStore + Parquet) - Generation Runtime Data Contract section (rewrite for new layer)

git add CLAUDE.md
git commit -m "PHON-93: update CLAUDE.md for runtime word-data layer + Parquet-canonical"
  • [ ] Step 7: Push final branch
git push origin feature/phon-93-trie-centric-rebuild

Open PR against release/v5.2.0.


Open implementation questions (resolve at impl time, not blocking)

  1. WordRecord dataclass coverage of all 98 PHON-88 fields — verify when you reach Task 5 that dataclasses.fields(WordRecord) covers everything export-to-d1.py currently writes to D1. If gaps exist, add fields to WordRecord (it's the canonical record shape; emit_parquet should not reach into other dataclasses).

  2. Polars schema strict modepl.DataFrame(rows, schema=..., strict=False) is used in emit_parquet. Decide at impl whether to flip to strict mode (catches type errors at write time) once schemas are stable.

  3. Variants column shapeWordRecord.variants is list[dict] of pronunciation variants. Polars wants a struct schema for nested types. Tentative: pl.List(pl.Struct({})) is loose; tighten to a specific struct schema during impl.

  4. Edge schema fieldsEdgeRecord has more fields than the current edges_schema() covers (e.g., eccc_* family). Audit at impl and align.

  5. CI runtime budget — Parquet rebuild + SQL derivation runs full build_words() pipeline. If CI runtime grows too long, consider caching the records by data-input hash. Out of scope unless impl time becomes prohibitive.