Skip to content

D1-only Generation + Drop CSP + Unified Tables — Design Spec

Date: 2026-05-15 (revised) Status: Design — pending plan Owner: Jared Neumann Branch: refactor/d1-only-drop-csp off origin/develop (425d992f, post-hotfix)

Problem

Three categories of misalignment to fix in one pass:

  1. Live /api/sentences runs in a FastAPI container, not the Worker. The container reads parquet files directly. Everything else in the Worker reads D1. Running a separate Python service for a couple of polars joins is architectural debt.
  2. The CSP synthetic generation stack is deprecated dead code. packages/generators/src/phonolex_generators/csp/ solver + reranker, the FastAPI server, the entire CSP→reranker dependency chain (spaCy + en_core_web_sm + sentence-transformers + bge-base + torch + transformers + LightGBM) ships to staging but has zero live consumers.
  3. The words and pairs D1 tables are split by v5.2 POS filter. D1's words has only the 47K NOUN/VERB/ADJ/ADV content-POS rows; the other ~78K phonology-bearing rows (PROPN/PRON/closed-class) live in words_full.parquet (read directly by the container). Same for pairs (60K) vs pairs_full (642K). The Worker can't see the full vocabulary today. The right shape is one unified table per concept with an is_canonical flag.

Goals

  1. One service: the Worker. All /api/* endpoints — including /api/sentences — served from the Worker, reading D1. No container, no FastAPI, no Durable Object generation binding.
  2. Unified D1 tables. words carries all ~125K phonology-bearing rows with is_canonical INTEGER flagging the 47K content-POS subset. pairs carries all ~642K rows with is_canonical flagging the ~60K content-POS minimal pairs. The words_full / pairs_full parquet artifacts disappear.
  3. Drop dead Python. packages/generators/ (csp + MLM editor + all submodules), packages/governors/ (orphan after generators dies), packages/generation/ (FastAPI + Dockerfile + tests). Drop spaCy + sentence-transformers + torch + transformers + lightgbm from any remaining pyproject.
  4. Drop dead data. selectional.parquet, skeletons.parquet, naturalness_scorer_head.pt, naturalness_reference.npy, naturalness_reference_meta.jsonl, reranker_v2.pkl. Drop the 7 spaCy-derived morph columns (number, person, tense, verb_form, mood, aspect, degree) from the words schema entirely — only consumer was CSP solver agreement, which is gone.
  5. One file in LFS: d1-seed.sql. All parquet/pkl/pt/npy/jsonl files in data/runtime/ become gitignored developer-local build cache.
  6. CI runs no Python pipeline. Deploy workflow pulls LFS, applies seed via wrangler d1 execute --remote, deploys Worker + Pages. No model downloads, no container build, no source-TSV access.

Non-goals

  • Not rebuilding the data pipeline itself. build_lexical_database() and build_runtime_parquet.py remain the local dev tools that produce the artifacts. They just stop running in CI.
  • Not removing /api/similarity/search or unhooking sound similarity from Lookup. Sound similarity in Lookup is an info feature per user direction; the endpoint stays live.
  • Not redesigning corpus retrieval semantics. Same hard_filter_expr shape (constraints applied to the canonical vocabulary), same surface-keyed membership match, same pre-computed naturalness_score ordering. Just moved from Polars container to Worker SQL.
  • Not adding the Lookup inline-phoneme-breakdown enhancement. Feature creep; see memory/project_lookup_inline_phoneme_breakdown.md for the followup note.
  • Not touching packages/features/ (Bayesian learned feature vectors). Used at parquet-build time to compute feature_distance in pairs.parquet. Stays.

Empirical input

User direction (this session, 2026-05-15): - "we should be using d1 tables and drop all the csp only stuff" - "there is no csp solver in the ui; i told you to unhook it. It's dead." - "why do we have 9000 files in lfs, and not 1: d1" - "do we not have full words in d1 already? if not, we should, and membership to the canonical word list is a column?" - "sound similarity is an info feature [in Lookup]"

Scan results (verified against current code):

Find Action
phonolex_generators imported only by packages/generation/research/** + csp submodule + packages/generation/server/corpus.py:33 Delete the package once generation/ + csp/ are gone
phonolex_governors imported only by packages/generators/src/.../editor/** + tests Orphan after generators dies; delete
/api/similarity/search consumed by LookupTool.tsx:174,234 Live — keep
Worker route routes/generation.ts Proxy to dead container; delete
packages/data/pyproject.toml has spacy Needed only by _populate_morph_features → orphan once CSP solver is gone; drop
data/runtime/*.parquet LFS-tracked Drop — local build cache; only seed ships

Design

1. Unified words table

Build pipeline change (packages/data/src/phonolex_data/pipeline/words.py):

  • Drop the post-v5.2 POS filter that removes non-content-POS rows from words.parquet
  • Add is_canonical: bool = False to WordRecord; set is_canonical = (pos in {"NOUN","VERB","ADJ","ADV"}) after POS is populated
  • Drop the entire _populate_morph_features(words) function and its call site
  • words.parquet grows from ~47K to ~125K rows; gains one column (is_canonical); loses 7 columns (number/person/tense/verb_form/mood/aspect/degree)

D1 schema change (packages/data/src/phonolex_data/runtime/emit_d1_sql.py):

CREATE TABLE words (
  word TEXT PRIMARY KEY,
  has_phonology INTEGER NOT NULL DEFAULT 1,
  is_canonical INTEGER NOT NULL DEFAULT 0,
  ipa TEXT,
  phonemes TEXT,
  phonemes_str TEXT,
  syllables TEXT,
  phoneme_count INTEGER,
  syllable_count INTEGER,
  initial_phoneme TEXT,
  final_phoneme TEXT,
  root TEXT,
  pos TEXT,
  pos_alt TEXT,
  all_pos TEXT,
  all_freqs TEXT,
  variants TEXT,
  cv_shape TEXT
);
CREATE INDEX idx_words_is_canonical ON words (is_canonical);

Add is_canonical to _CORE_WORDS_FIELDS. The 7 morph columns vanish from the schema entirely.

word_properties, word_freq_bands, word_percentiles rows only exist for is_canonical = 1 words (the 78K non-canonical rows don't have norm data). Either: - (a) Emit word_properties rows only for canonical words. Non-canonical rows have no corresponding row in these tables. INNER JOIN naturally filters. - (b) Emit NULL rows for non-canonical. LEFT JOIN works; INNER JOIN excludes them.

Choose (a) — saves space, matches data reality.

Worker route changes: every route that today implicitly assumes words means "content-POS words" needs WHERE w.is_canonical = 1:

Route File Status
POST /api/words/search routes/words.ts Add wordsWhere.push('w.is_canonical = 1')
POST /api/words/word-list routes/words.ts Same
POST /api/words/batch routes/words.ts Same
POST /api/words/norms-dump routes/words.ts Same
GET /api/words/:word routes/words.ts No change — looking up by primary key; phonology-full words are valid lookup targets
POST /api/similarity/search routes/similarity.ts No change — sound similarity over the full vocabulary is the right behavior
POST /api/contrastive/... routes/contrastive.ts Depends on route — verify per-route

The compileWordFilter helper (extracted as part of this work — see §3) carries the is_canonical clause so all consumers get it consistently.

2. Unified pairs table

Same pattern. Drop pairs_full.parquet. The build pipeline emits one pairs.parquet (~642K rows) with is_canonical flagging the ~60K content-POS subset.

CREATE TABLE pairs (
  word1 TEXT NOT NULL,
  word2 TEXT NOT NULL,
  is_canonical INTEGER NOT NULL DEFAULT 0,
  phoneme1 TEXT NOT NULL,
  phoneme2 TEXT NOT NULL,
  position INTEGER NOT NULL,
  position_type TEXT NOT NULL,
  feature_distance REAL NOT NULL,
  sonorant_diff REAL NOT NULL
);
CREATE INDEX idx_pairs_phonemes ON pairs (phoneme1, phoneme2);
CREATE INDEX idx_pairs_is_canonical ON pairs (is_canonical);

Contrastive Sets tool browsing → WHERE is_canonical = 1. Corpus retrieval contrastive matching → no filter. Same shape as words.

3. Corpus retrieval in the Worker

New D1 tables:

CREATE TABLE corpus_sentences_index (
  sentence_id INTEGER PRIMARY KEY,
  text TEXT NOT NULL,
  source TEXT NOT NULL,
  source_record_id TEXT,
  n_tokens INTEGER,
  n_content_in_vocab INTEGER,
  n_content_oov INTEGER,
  naturalness_score REAL
);
CREATE INDEX idx_corpus_index_source ON corpus_sentences_index (source);
CREATE INDEX idx_corpus_index_naturalness ON corpus_sentences_index (naturalness_score DESC);

CREATE TABLE corpus_sentences (
  sentence_id INTEGER NOT NULL,
  surface TEXT NOT NULL,
  is_content INTEGER NOT NULL,
  PRIMARY KEY (sentence_id, surface)
);
CREATE INDEX idx_corpus_mem_surface ON corpus_sentences (surface);

New Worker route POST /api/sentences (packages/web/workers/src/routes/sentences.ts):

  1. Compile constraints to filter the canonical vocabulary (WHERE words.is_canonical = 1) — the user-facing constraint applies to content-POS words.
  2. Find sentences where every is_content = 1 membership row's surface is in the filtered word set. Closed-class / PROPN / PRON tokens in those sentences don't need to satisfy the constraint — they're skipped (they're is_content = 0).
  3. Apply Exclude-phonemes against the FULL vocabulary (words without the canonical filter) — exclusion is a phonological constraint, not a vocabulary-class constraint.
  4. Order surviving sentences by naturalness_score DESC NULLS LAST, take top_k.

SQL skeleton:

WITH canonical_surviving AS (
  SELECT w.word FROM words w
  INNER JOIN word_properties wp ON w.word = wp.word
  WHERE w.is_canonical = 1
    AND <pattern + norm + cv_shape clauses>
),
matching_sentences AS (
  SELECT cs.sentence_id
  FROM corpus_sentences cs
  WHERE cs.is_content = 1
  GROUP BY cs.sentence_id
  HAVING SUM(CASE WHEN cs.surface NOT IN (SELECT word FROM canonical_surviving) THEN 1 ELSE 0 END) = 0
)
SELECT csi.* FROM corpus_sentences_index csi
INNER JOIN matching_sentences USING (sentence_id)
ORDER BY csi.naturalness_score DESC NULLS LAST
LIMIT ?

Exclude-phonemes adds a second NOT EXISTS clause against words (no canonical filter) to ensure no token's phonemes_str contains the excluded phonemes.

Shared with /api/words/search: extract compileWordFilter(body) from routes/words.ts into lib/wordFilter.ts. Both routes consume it.

4. Worker route + container retirement

Delete: - packages/web/workers/src/routes/generation.ts (the proxy) - packages/web/workers/src/containers/generation.ts (the GenerationServer DO class) - packages/web/workers/src/__tests__/generation.test.ts

Update wrangler.toml: - Remove [[containers]] blocks (production + staging) - Remove [[durable_objects.bindings]] for GENERATION_SERVICE (production + staging) - Add [[migrations]] with tag = "v2" and deleted_classes = ["GenerationServer"] to release Cloudflare-side DO storage cleanly (production + staging) - Remove GENERATION_SERVICE from Env typing in src/types.ts

Index.ts: remove the generation route mount; add the new /api/sentences route mount.

5. Delete dead Python

git rm -r packages/generators/
git rm -r packages/governors/
git rm -r packages/generation/

Drop the corresponding workspace entries from root pyproject.toml. Drop spaCy + sentence-transformers + torch + transformers + lightgbm from packages/data/pyproject.toml (the only remaining Python package). The whole ML chain disappears.

uv sync --all-packages after the cleanup confirms the resolved graph stays tight.

6. Delete dead data

git rm data/runtime/selectional.parquet
git rm data/runtime/skeletons.parquet
git rm data/runtime/naturalness_scorer_head.pt
git rm data/runtime/naturalness_reference.npy
git rm data/runtime/naturalness_reference_meta.jsonl
git rm data/runtime/reranker_v2.pkl
git rm data/runtime/words_full.parquet
git rm data/runtime/pairs_full.parquet

The remaining data/runtime/*.parquet (words, pairs, edges, corpus_sentences{,_index}) become regular regen targets and stop being tracked at all — see §7.

7. LFS scope reduction

.gitattributes reduces to one line:

packages/web/workers/scripts/d1-seed.sql filter=lfs diff=lfs merge=lfs -text

.gitignore gets:

# Local build cache. Rebuild via:
#   uv run python packages/data/scripts/build_runtime_parquet.py
#   uv run python packages/web/workers/scripts/export-to-d1.py
# Only d1-seed.sql is committed (LFS).
data/runtime/*.parquet
data/runtime/*.pt
data/runtime/*.pkl
data/runtime/*.npy
data/runtime/*.jsonl

The remaining parquets are git rm'd from tracking but the developer-local files survive on disk because they're now gitignored.

8. CI deploy workflow

.github/workflows/deploy-staging.yml and .github/workflows/deploy.yml:

  • Detect Data Changes paths-filter narrows to a single watch: packages/web/workers/scripts/d1-seed.sql.
  • The rebuild step is already gone (hotfix PR #112 landed).
  • No Python install in the data path of the workflow. Only uv + Python for MkDocs.
  • No container build step. No Setup Docker, no model downloads, no spaCy install.

9. CLAUDE.md rewrite

Major updates to multiple sections: - "Architecture" — replace CSP-Phase-1-and-2 description with corpus-retrieval-from-D1 - "Generation Runtime Data Contract" — drop selectional/skeletons/naturalness references; describe unified words+pairs with is_canonical; describe corpus_sentences in D1 - "Project Structure" — remove packages/generation/, packages/generators/, packages/governors/ - "Dev Setup" — drop the generation server start command and the heavy build dep description; the local dev flow is build_runtime_parquet.py + export-to-d1.py + wrangler dev - "Gotchas" — drop references to the FastAPI server, the container cold-start

10. Frontend

The "Sentences" tool (packages/web/frontend/src/components/tools/GovernedGenerationTool/): - Drop any "first request takes ~60s" cold-start copy in index.tsx (no more container). - Confirm tool name + description still make sense — it's pure corpus retrieval now. The "Governed Generation" name may need a rename in a followup; this PR keeps it.

File map

Created: - packages/web/workers/src/routes/sentences.ts - packages/web/workers/src/lib/wordFilter.ts - packages/web/workers/src/__tests__/sentences.test.ts

Modified: - packages/data/src/phonolex_data/pipeline/words.py (drop POS filter; add is_canonical; remove _populate_morph_features call + function) - packages/data/src/phonolex_data/pipeline/schema.py (drop 7 morph fields from WordRecord; add is_canonical) - packages/data/src/phonolex_data/pipeline/derived.py (drop pairs canonical filter; emit one pairs.parquet with is_canonical) - packages/data/src/phonolex_data/runtime/schema.py (drop morph columns; add is_canonical to _CORE_WORDS_COLUMNS) - packages/data/src/phonolex_data/runtime/emit_d1_sql.py (DDL changes + corpus_sentences emit + is_canonical wiring) - packages/data/pyproject.toml (drop spaCy / sentence-transformers / torch / transformers) - packages/web/workers/scripts/export-to-d1.py (DROP TABLE additions + new emit calls + drop morph filter logic if any) - packages/web/workers/src/routes/words.ts (extract compileWordFilter; add is_canonical = 1 to needed routes) - packages/web/workers/src/routes/contrastive.ts (add is_canonical = 1 where browsing canonical pairs) - packages/web/workers/src/routes/similarity.ts (no change — similarity over full vocab) - packages/web/workers/src/types.ts (drop GENERATION_SERVICE; add is_canonical?: number to WordRow) - packages/web/workers/src/index.ts (route mount changes) - packages/web/workers/wrangler.toml (drop containers, DO bindings; add v2 migration) - packages/web/workers/src/__tests__/api.test.ts (account for unified words; rename/update as needed) - .gitattributes - .gitignore - .github/workflows/deploy-staging.yml - .github/workflows/deploy.yml - packages/web/frontend/src/components/tools/GovernedGenerationTool/index.tsx (drop cold-start copy) - CLAUDE.md - pyproject.toml (root workspace — drop deleted packages)

Deleted (entire trees): - packages/generators/ - packages/governors/ - packages/generation/ - packages/web/workers/src/routes/generation.ts - packages/web/workers/src/containers/generation.ts - packages/web/workers/src/__tests__/generation.test.ts

Deleted (single files): - data/runtime/selectional.parquet - data/runtime/skeletons.parquet - data/runtime/naturalness_scorer_head.pt - data/runtime/naturalness_reference.npy - data/runtime/naturalness_reference_meta.jsonl - data/runtime/reranker_v2.pkl - data/runtime/words_full.parquet - data/runtime/pairs_full.parquet

Plus: data/runtime/words.parquet, data/runtime/pairs.parquet, data/runtime/edges.parquet, data/runtime/corpus_sentences.parquet, data/runtime/corpus_sentences_index.parquet are git rm'd from tracking but kept on disk (gitignored as build cache).

Risks

  1. D1 seed size grows ~50%. Going from 47K → 125K words rows (with the ~78K new rows having compact data — phonemes_str + cv_shape + NULL norms) adds maybe 5 MB. Pairs going 60K → 642K is the bigger jump — ~90 MB. Plus corpus_sentences (~15 MB) + corpus_sentences_index (~12 MB). Total seed: ~250 MB → ~370 MB. Chunks at 40 MB → 10 chunks. Still works given the Task-13 chunking fix.

  2. words.is_canonical = 1 filter must be applied EVERYWHERE that today implicitly assumes canonical scope. This is the biggest correctness risk. Plan: route the filter through compileWordFilter (extracted lib) so every consumer gets it for free; audit each route that doesn't use the helper.

  3. DurableObject retirement. Cloudflare requires explicit deleted_classes migration to release DO storage. Get the tag = "v2" migration right or the deploy errors. Test the migration via wrangler deploy --dry-run on staging before production.

  4. Build pipeline produces unified words.parquet with NULL norms for 78K rows. Polars handles this naturally — NULL values just don't survive the WHERE is_canonical=1 filter when norm-bounds are applied. But: extra_word_props=derived.percentiles in export-to-d1.py is keyed by word; need to skip non-canonical words when emitting word_properties / word_percentiles / word_freq_bands to avoid NULL rows that bloat D1.

  5. Local dev parquet artifacts move from LFS to gitignored. Existing collaborators with checkouts have the LFS parquets locally. After this PR merges they need to regenerate (or the parquets stay on disk because git won't auto-delete files that become gitignored). Document the regen flow in CLAUDE.md.

  6. Container cold-start went from ~25s to ~0s. Users may notice the latency drop. Frontend cold-start copy must be removed; if any messaging adapts to "first request slow", it's now misleading.

Open questions

None — user has signed off on items 1–16 from the previous summary; item 17 (Lookup inline phoneme breakdown) is filed as a followup memory.

References

  • Hotfix PR (already merged): https://github.com/neumanns-workshop/PhonoLex/pull/112
  • Memory: memory/project_csp_and_reranker_deprecated.md, memory/project_ci_deploy_paradigm.md, memory/feedback_dont_narrate_deprecated_paradigms.md, memory/project_lookup_inline_phoneme_breakdown.md