D1-only Generation + Drop CSP + Unified Tables — Design Spec¶
Date: 2026-05-15 (revised)
Status: Design — pending plan
Owner: Jared Neumann
Branch: refactor/d1-only-drop-csp off origin/develop (425d992f, post-hotfix)
Problem¶
Three categories of misalignment to fix in one pass:
- Live
/api/sentencesruns in a FastAPI container, not the Worker. The container reads parquet files directly. Everything else in the Worker reads D1. Running a separate Python service for a couple of polars joins is architectural debt. - The CSP synthetic generation stack is deprecated dead code.
packages/generators/src/phonolex_generators/csp/solver + reranker, the FastAPI server, the entire CSP→reranker dependency chain (spaCy + en_core_web_sm + sentence-transformers + bge-base + torch + transformers + LightGBM) ships to staging but has zero live consumers. - The
wordsandpairsD1 tables are split by v5.2 POS filter. D1'swordshas only the 47K NOUN/VERB/ADJ/ADV content-POS rows; the other ~78K phonology-bearing rows (PROPN/PRON/closed-class) live inwords_full.parquet(read directly by the container). Same forpairs(60K) vspairs_full(642K). The Worker can't see the full vocabulary today. The right shape is one unified table per concept with anis_canonicalflag.
Goals¶
- One service: the Worker. All
/api/*endpoints — including/api/sentences— served from the Worker, reading D1. No container, no FastAPI, no Durable Object generation binding. - Unified D1 tables.
wordscarries all ~125K phonology-bearing rows withis_canonical INTEGERflagging the 47K content-POS subset.pairscarries all ~642K rows withis_canonicalflagging the ~60K content-POS minimal pairs. Thewords_full/pairs_fullparquet artifacts disappear. - Drop dead Python.
packages/generators/(csp + MLM editor + all submodules),packages/governors/(orphan after generators dies),packages/generation/(FastAPI + Dockerfile + tests). Drop spaCy + sentence-transformers + torch + transformers + lightgbm from any remaining pyproject. - Drop dead data.
selectional.parquet,skeletons.parquet,naturalness_scorer_head.pt,naturalness_reference.npy,naturalness_reference_meta.jsonl,reranker_v2.pkl. Drop the 7 spaCy-derived morph columns (number,person,tense,verb_form,mood,aspect,degree) from the words schema entirely — only consumer was CSP solver agreement, which is gone. - One file in LFS:
d1-seed.sql. All parquet/pkl/pt/npy/jsonl files indata/runtime/become gitignored developer-local build cache. - CI runs no Python pipeline. Deploy workflow pulls LFS, applies seed via
wrangler d1 execute --remote, deploys Worker + Pages. No model downloads, no container build, no source-TSV access.
Non-goals¶
- Not rebuilding the data pipeline itself.
build_lexical_database()andbuild_runtime_parquet.pyremain the local dev tools that produce the artifacts. They just stop running in CI. - Not removing
/api/similarity/searchor unhooking sound similarity from Lookup. Sound similarity in Lookup is an info feature per user direction; the endpoint stays live. - Not redesigning corpus retrieval semantics. Same
hard_filter_exprshape (constraints applied to the canonical vocabulary), same surface-keyed membership match, same pre-computed naturalness_score ordering. Just moved from Polars container to Worker SQL. - Not adding the Lookup inline-phoneme-breakdown enhancement. Feature creep; see
memory/project_lookup_inline_phoneme_breakdown.mdfor the followup note. - Not touching
packages/features/(Bayesian learned feature vectors). Used at parquet-build time to computefeature_distancein pairs.parquet. Stays.
Empirical input¶
User direction (this session, 2026-05-15): - "we should be using d1 tables and drop all the csp only stuff" - "there is no csp solver in the ui; i told you to unhook it. It's dead." - "why do we have 9000 files in lfs, and not 1: d1" - "do we not have full words in d1 already? if not, we should, and membership to the canonical word list is a column?" - "sound similarity is an info feature [in Lookup]"
Scan results (verified against current code):
| Find | Action |
|---|---|
phonolex_generators imported only by packages/generation/research/** + csp submodule + packages/generation/server/corpus.py:33 |
Delete the package once generation/ + csp/ are gone |
phonolex_governors imported only by packages/generators/src/.../editor/** + tests |
Orphan after generators dies; delete |
/api/similarity/search consumed by LookupTool.tsx:174,234 |
Live — keep |
Worker route routes/generation.ts |
Proxy to dead container; delete |
packages/data/pyproject.toml has spacy |
Needed only by _populate_morph_features → orphan once CSP solver is gone; drop |
data/runtime/*.parquet LFS-tracked |
Drop — local build cache; only seed ships |
Design¶
1. Unified words table¶
Build pipeline change (packages/data/src/phonolex_data/pipeline/words.py):
- Drop the post-v5.2 POS filter that removes non-content-POS rows from words.parquet
- Add
is_canonical: bool = FalsetoWordRecord; setis_canonical = (pos in {"NOUN","VERB","ADJ","ADV"})after POS is populated - Drop the entire
_populate_morph_features(words)function and its call site - words.parquet grows from ~47K to ~125K rows; gains one column (
is_canonical); loses 7 columns (number/person/tense/verb_form/mood/aspect/degree)
D1 schema change (packages/data/src/phonolex_data/runtime/emit_d1_sql.py):
CREATE TABLE words (
word TEXT PRIMARY KEY,
has_phonology INTEGER NOT NULL DEFAULT 1,
is_canonical INTEGER NOT NULL DEFAULT 0,
ipa TEXT,
phonemes TEXT,
phonemes_str TEXT,
syllables TEXT,
phoneme_count INTEGER,
syllable_count INTEGER,
initial_phoneme TEXT,
final_phoneme TEXT,
root TEXT,
pos TEXT,
pos_alt TEXT,
all_pos TEXT,
all_freqs TEXT,
variants TEXT,
cv_shape TEXT
);
CREATE INDEX idx_words_is_canonical ON words (is_canonical);
Add is_canonical to _CORE_WORDS_FIELDS. The 7 morph columns vanish from the schema entirely.
word_properties, word_freq_bands, word_percentiles rows only exist for is_canonical = 1 words (the 78K non-canonical rows don't have norm data). Either:
- (a) Emit word_properties rows only for canonical words. Non-canonical rows have no corresponding row in these tables. INNER JOIN naturally filters.
- (b) Emit NULL rows for non-canonical. LEFT JOIN works; INNER JOIN excludes them.
Choose (a) — saves space, matches data reality.
Worker route changes: every route that today implicitly assumes words means "content-POS words" needs WHERE w.is_canonical = 1:
| Route | File | Status |
|---|---|---|
POST /api/words/search |
routes/words.ts |
Add wordsWhere.push('w.is_canonical = 1') |
POST /api/words/word-list |
routes/words.ts |
Same |
POST /api/words/batch |
routes/words.ts |
Same |
POST /api/words/norms-dump |
routes/words.ts |
Same |
GET /api/words/:word |
routes/words.ts |
No change — looking up by primary key; phonology-full words are valid lookup targets |
POST /api/similarity/search |
routes/similarity.ts |
No change — sound similarity over the full vocabulary is the right behavior |
POST /api/contrastive/... |
routes/contrastive.ts |
Depends on route — verify per-route |
The compileWordFilter helper (extracted as part of this work — see §3) carries the is_canonical clause so all consumers get it consistently.
2. Unified pairs table¶
Same pattern. Drop pairs_full.parquet. The build pipeline emits one pairs.parquet (~642K rows) with is_canonical flagging the ~60K content-POS subset.
CREATE TABLE pairs (
word1 TEXT NOT NULL,
word2 TEXT NOT NULL,
is_canonical INTEGER NOT NULL DEFAULT 0,
phoneme1 TEXT NOT NULL,
phoneme2 TEXT NOT NULL,
position INTEGER NOT NULL,
position_type TEXT NOT NULL,
feature_distance REAL NOT NULL,
sonorant_diff REAL NOT NULL
);
CREATE INDEX idx_pairs_phonemes ON pairs (phoneme1, phoneme2);
CREATE INDEX idx_pairs_is_canonical ON pairs (is_canonical);
Contrastive Sets tool browsing → WHERE is_canonical = 1. Corpus retrieval contrastive matching → no filter. Same shape as words.
3. Corpus retrieval in the Worker¶
New D1 tables:
CREATE TABLE corpus_sentences_index (
sentence_id INTEGER PRIMARY KEY,
text TEXT NOT NULL,
source TEXT NOT NULL,
source_record_id TEXT,
n_tokens INTEGER,
n_content_in_vocab INTEGER,
n_content_oov INTEGER,
naturalness_score REAL
);
CREATE INDEX idx_corpus_index_source ON corpus_sentences_index (source);
CREATE INDEX idx_corpus_index_naturalness ON corpus_sentences_index (naturalness_score DESC);
CREATE TABLE corpus_sentences (
sentence_id INTEGER NOT NULL,
surface TEXT NOT NULL,
is_content INTEGER NOT NULL,
PRIMARY KEY (sentence_id, surface)
);
CREATE INDEX idx_corpus_mem_surface ON corpus_sentences (surface);
New Worker route POST /api/sentences (packages/web/workers/src/routes/sentences.ts):
- Compile constraints to filter the canonical vocabulary (
WHERE words.is_canonical = 1) — the user-facing constraint applies to content-POS words. - Find sentences where every
is_content = 1membership row'ssurfaceis in the filtered word set. Closed-class / PROPN / PRON tokens in those sentences don't need to satisfy the constraint — they're skipped (they'reis_content = 0). - Apply Exclude-phonemes against the FULL vocabulary (
wordswithout the canonical filter) — exclusion is a phonological constraint, not a vocabulary-class constraint. - Order surviving sentences by
naturalness_score DESC NULLS LAST, taketop_k.
SQL skeleton:
WITH canonical_surviving AS (
SELECT w.word FROM words w
INNER JOIN word_properties wp ON w.word = wp.word
WHERE w.is_canonical = 1
AND <pattern + norm + cv_shape clauses>
),
matching_sentences AS (
SELECT cs.sentence_id
FROM corpus_sentences cs
WHERE cs.is_content = 1
GROUP BY cs.sentence_id
HAVING SUM(CASE WHEN cs.surface NOT IN (SELECT word FROM canonical_surviving) THEN 1 ELSE 0 END) = 0
)
SELECT csi.* FROM corpus_sentences_index csi
INNER JOIN matching_sentences USING (sentence_id)
ORDER BY csi.naturalness_score DESC NULLS LAST
LIMIT ?
Exclude-phonemes adds a second NOT EXISTS clause against words (no canonical filter) to ensure no token's phonemes_str contains the excluded phonemes.
Shared with /api/words/search: extract compileWordFilter(body) from routes/words.ts into lib/wordFilter.ts. Both routes consume it.
4. Worker route + container retirement¶
Delete:
- packages/web/workers/src/routes/generation.ts (the proxy)
- packages/web/workers/src/containers/generation.ts (the GenerationServer DO class)
- packages/web/workers/src/__tests__/generation.test.ts
Update wrangler.toml:
- Remove [[containers]] blocks (production + staging)
- Remove [[durable_objects.bindings]] for GENERATION_SERVICE (production + staging)
- Add [[migrations]] with tag = "v2" and deleted_classes = ["GenerationServer"] to release Cloudflare-side DO storage cleanly (production + staging)
- Remove GENERATION_SERVICE from Env typing in src/types.ts
Index.ts: remove the generation route mount; add the new /api/sentences route mount.
5. Delete dead Python¶
git rm -r packages/generators/
git rm -r packages/governors/
git rm -r packages/generation/
Drop the corresponding workspace entries from root pyproject.toml. Drop spaCy + sentence-transformers + torch + transformers + lightgbm from packages/data/pyproject.toml (the only remaining Python package). The whole ML chain disappears.
uv sync --all-packages after the cleanup confirms the resolved graph stays tight.
6. Delete dead data¶
git rm data/runtime/selectional.parquet
git rm data/runtime/skeletons.parquet
git rm data/runtime/naturalness_scorer_head.pt
git rm data/runtime/naturalness_reference.npy
git rm data/runtime/naturalness_reference_meta.jsonl
git rm data/runtime/reranker_v2.pkl
git rm data/runtime/words_full.parquet
git rm data/runtime/pairs_full.parquet
The remaining data/runtime/*.parquet (words, pairs, edges, corpus_sentences{,_index}) become regular regen targets and stop being tracked at all — see §7.
7. LFS scope reduction¶
.gitattributes reduces to one line:
packages/web/workers/scripts/d1-seed.sql filter=lfs diff=lfs merge=lfs -text
.gitignore gets:
# Local build cache. Rebuild via:
# uv run python packages/data/scripts/build_runtime_parquet.py
# uv run python packages/web/workers/scripts/export-to-d1.py
# Only d1-seed.sql is committed (LFS).
data/runtime/*.parquet
data/runtime/*.pt
data/runtime/*.pkl
data/runtime/*.npy
data/runtime/*.jsonl
The remaining parquets are git rm'd from tracking but the developer-local files survive on disk because they're now gitignored.
8. CI deploy workflow¶
.github/workflows/deploy-staging.yml and .github/workflows/deploy.yml:
Detect Data Changespaths-filter narrows to a single watch:packages/web/workers/scripts/d1-seed.sql.- The rebuild step is already gone (hotfix PR #112 landed).
- No Python install in the data path of the workflow. Only uv + Python for MkDocs.
- No container build step. No
Setup Docker, no model downloads, no spaCy install.
9. CLAUDE.md rewrite¶
Major updates to multiple sections:
- "Architecture" — replace CSP-Phase-1-and-2 description with corpus-retrieval-from-D1
- "Generation Runtime Data Contract" — drop selectional/skeletons/naturalness references; describe unified words+pairs with is_canonical; describe corpus_sentences in D1
- "Project Structure" — remove packages/generation/, packages/generators/, packages/governors/
- "Dev Setup" — drop the generation server start command and the heavy build dep description; the local dev flow is build_runtime_parquet.py + export-to-d1.py + wrangler dev
- "Gotchas" — drop references to the FastAPI server, the container cold-start
10. Frontend¶
The "Sentences" tool (packages/web/frontend/src/components/tools/GovernedGenerationTool/):
- Drop any "first request takes ~60s" cold-start copy in index.tsx (no more container).
- Confirm tool name + description still make sense — it's pure corpus retrieval now. The "Governed Generation" name may need a rename in a followup; this PR keeps it.
File map¶
Created:
- packages/web/workers/src/routes/sentences.ts
- packages/web/workers/src/lib/wordFilter.ts
- packages/web/workers/src/__tests__/sentences.test.ts
Modified:
- packages/data/src/phonolex_data/pipeline/words.py (drop POS filter; add is_canonical; remove _populate_morph_features call + function)
- packages/data/src/phonolex_data/pipeline/schema.py (drop 7 morph fields from WordRecord; add is_canonical)
- packages/data/src/phonolex_data/pipeline/derived.py (drop pairs canonical filter; emit one pairs.parquet with is_canonical)
- packages/data/src/phonolex_data/runtime/schema.py (drop morph columns; add is_canonical to _CORE_WORDS_COLUMNS)
- packages/data/src/phonolex_data/runtime/emit_d1_sql.py (DDL changes + corpus_sentences emit + is_canonical wiring)
- packages/data/pyproject.toml (drop spaCy / sentence-transformers / torch / transformers)
- packages/web/workers/scripts/export-to-d1.py (DROP TABLE additions + new emit calls + drop morph filter logic if any)
- packages/web/workers/src/routes/words.ts (extract compileWordFilter; add is_canonical = 1 to needed routes)
- packages/web/workers/src/routes/contrastive.ts (add is_canonical = 1 where browsing canonical pairs)
- packages/web/workers/src/routes/similarity.ts (no change — similarity over full vocab)
- packages/web/workers/src/types.ts (drop GENERATION_SERVICE; add is_canonical?: number to WordRow)
- packages/web/workers/src/index.ts (route mount changes)
- packages/web/workers/wrangler.toml (drop containers, DO bindings; add v2 migration)
- packages/web/workers/src/__tests__/api.test.ts (account for unified words; rename/update as needed)
- .gitattributes
- .gitignore
- .github/workflows/deploy-staging.yml
- .github/workflows/deploy.yml
- packages/web/frontend/src/components/tools/GovernedGenerationTool/index.tsx (drop cold-start copy)
- CLAUDE.md
- pyproject.toml (root workspace — drop deleted packages)
Deleted (entire trees):
- packages/generators/
- packages/governors/
- packages/generation/
- packages/web/workers/src/routes/generation.ts
- packages/web/workers/src/containers/generation.ts
- packages/web/workers/src/__tests__/generation.test.ts
Deleted (single files):
- data/runtime/selectional.parquet
- data/runtime/skeletons.parquet
- data/runtime/naturalness_scorer_head.pt
- data/runtime/naturalness_reference.npy
- data/runtime/naturalness_reference_meta.jsonl
- data/runtime/reranker_v2.pkl
- data/runtime/words_full.parquet
- data/runtime/pairs_full.parquet
Plus: data/runtime/words.parquet, data/runtime/pairs.parquet, data/runtime/edges.parquet, data/runtime/corpus_sentences.parquet, data/runtime/corpus_sentences_index.parquet are git rm'd from tracking but kept on disk (gitignored as build cache).
Risks¶
-
D1 seed size grows ~50%. Going from 47K → 125K words rows (with the ~78K new rows having compact data — phonemes_str + cv_shape + NULL norms) adds maybe 5 MB. Pairs going 60K → 642K is the bigger jump — ~90 MB. Plus corpus_sentences (~15 MB) + corpus_sentences_index (~12 MB). Total seed: ~250 MB → ~370 MB. Chunks at 40 MB → 10 chunks. Still works given the Task-13 chunking fix.
-
words.is_canonical = 1filter must be applied EVERYWHERE that today implicitly assumes canonical scope. This is the biggest correctness risk. Plan: route the filter throughcompileWordFilter(extracted lib) so every consumer gets it for free; audit each route that doesn't use the helper. -
DurableObject retirement. Cloudflare requires explicit
deleted_classesmigration to release DO storage. Get thetag = "v2"migration right or the deploy errors. Test the migration viawrangler deploy --dry-runon staging before production. -
Build pipeline produces unified words.parquet with NULL norms for 78K rows. Polars handles this naturally — NULL values just don't survive the WHERE is_canonical=1 filter when norm-bounds are applied. But:
extra_word_props=derived.percentilesinexport-to-d1.pyis keyed by word; need to skip non-canonical words when emittingword_properties/word_percentiles/word_freq_bandsto avoid NULL rows that bloat D1. -
Local dev parquet artifacts move from LFS to gitignored. Existing collaborators with checkouts have the LFS parquets locally. After this PR merges they need to regenerate (or the parquets stay on disk because git won't auto-delete files that become gitignored). Document the regen flow in CLAUDE.md.
-
Container cold-start went from ~25s to ~0s. Users may notice the latency drop. Frontend cold-start copy must be removed; if any messaging adapts to "first request slow", it's now misleading.
Open questions¶
None — user has signed off on items 1–16 from the previous summary; item 17 (Lookup inline phoneme breakdown) is filed as a followup memory.
References¶
- Hotfix PR (already merged): https://github.com/neumanns-workshop/PhonoLex/pull/112
- Memory:
memory/project_csp_and_reranker_deprecated.md,memory/project_ci_deploy_paradigm.md,memory/feedback_dont_narrate_deprecated_paradigms.md,memory/project_lookup_inline_phoneme_breakdown.md