D1-only + Drop CSP + Unified Tables — Implementation Plan¶
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task.
Goal: Move corpus retrieval into the Worker reading D1; unify words + pairs into single tables with is_canonical flag; delete the entire CSP/MLM/governor/generation-server Python stack; reduce LFS to a single tracked file.
Spec: docs/superpowers/specs/2026-05-15-d1-only-drop-csp-design.md.
Branch: refactor/d1-only-drop-csp off origin/develop @ 425d992f.
Task 1: Unify words.parquet — add is_canonical, drop morph features¶
Files:
- Modify: packages/data/src/phonolex_data/pipeline/words.py
- Modify: packages/data/src/phonolex_data/pipeline/schema.py
- Modify: packages/data/src/phonolex_data/runtime/schema.py
- [ ] Step 1: In
pipeline/schema.pyWordRecord: - Add
is_canonical: bool = False -
Remove the 7 morph fields:
number,person,tense,verb_form,mood,aspect,degree -
[ ] Step 2: In
pipeline/words.py: - Find the post-v5.2 POS filter (search for
pos in {"NOUN"oris_content_posor similar — the filter that drops PROPN/PRON rows). Remove it so all phonology-bearing rows survive. - After POS is populated on each WordRecord, set
record.is_canonical = (record.pos in {"NOUN", "VERB", "ADJ", "ADV"}). -
Delete the entire
_populate_morph_features(words)function and its call site (search for_populate_morph_features(). Drop the spacy import. -
[ ] Step 3: In
runtime/schema.py_CORE_WORDS_COLUMNS: - Add
"is_canonical": pl.Booleanright after"has_phonology" -
Remove the 7 morph columns
-
[ ] Step 4: Regenerate parquet locally:
cd /Users/jneumann/Repos/PhonoLex
uv run python packages/data/scripts/build_runtime_parquet.py 2>&1 | tail -10
uv run python -c "
import polars as pl
df = pl.read_parquet('data/runtime/words.parquet')
print('rows:', df.height)
print('is_canonical in cols:', 'is_canonical' in df.columns)
print('morph cols (should be 0):', sum(c in df.columns for c in ['number','person','tense','verb_form','mood','aspect','degree']))
print('canonical=1 count:', df.filter(pl.col('is_canonical')).height)
print('canonical=0 count:', df.filter(~pl.col('is_canonical')).height)
"
Expected: ~125,756 total; ~47,384 canonical=1; ~78K canonical=0; zero morph columns.
- [ ] Step 5: Run data tests:
uv run python -m pytest packages/data/tests/ \
--ignore=packages/data/tests/test_datasets.py \
--ignore=packages/data/tests/test_new_loaders.py 2>&1 | tail -10
If any test asserts on the old POS-filtered shape or morph columns, fix the test to match the new schema (don't relax assertions).
- [ ] Step 6: Commit:
git add packages/data/ data/runtime/words.parquet
git commit -m "feat(data): unified words.parquet with is_canonical; drop morph features
words.parquet now carries all ~125K phonology-bearing entries (was 47K
after the v5.2 POS filter). is_canonical column flags the 47K NOUN/VERB/
ADJ/ADV content-POS subset. Removed _populate_morph_features and the
7 spaCy-derived morph columns (number/person/tense/verb_form/mood/aspect/
degree) — only consumer was the now-deprecated CSP solver.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 2: Unify pairs.parquet — drop pairs_full, add is_canonical¶
Files:
- Modify: packages/data/src/phonolex_data/pipeline/derived.py (or wherever minimal_pairs / pairs are computed)
- Modify: packages/data/src/phonolex_data/runtime/emit_parquet.py (the pairs + pairs_full emit functions)
- [ ] Step 1: Inspect current pairs emit logic:
grep -n "pairs\|_compute_minimal_pairs\|emit_pairs" packages/data/src/phonolex_data/pipeline/derived.py packages/data/src/phonolex_data/runtime/emit_parquet.py | head -20
Find where pairs.parquet gets filtered to content-POS-only vs pairs_full.parquet (which keeps all). The build pipeline computes the full set, then filters for pairs.parquet.
-
[ ] Step 2: Change the emit logic to produce ONE pairs.parquet (all ~642K rows) with
is_canonicalcolumn. Drop the separateemit_pairs_full_parquetfunction and thepairs_full.parquetartifact. -
Each pair row's
is_canonicalis set to True iff BOTH word1 and word2 are in the canonical (is_canonical=1) subset of words. -
The two emit functions collapse into one.
-
[ ] Step 3: Update CLAUDE.md sections that reference pairs_full as a separate artifact (commit covered in T14).
-
[ ] Step 4: Regenerate + verify:
uv run python packages/data/scripts/build_runtime_parquet.py 2>&1 | tail -5
uv run python -c "
import polars as pl
df = pl.read_parquet('data/runtime/pairs.parquet')
print('rows:', df.height)
print('is_canonical in cols:', 'is_canonical' in df.columns)
print('canonical=1 count:', df.filter(pl.col('is_canonical')).height)
import os
print('pairs_full exists:', os.path.exists('data/runtime/pairs_full.parquet'))
"
Expected: ~642K rows, ~60K canonical, pairs_full.parquet does not exist.
- [ ] Step 5: Commit:
git add packages/data/ data/runtime/pairs.parquet
[ -f data/runtime/pairs_full.parquet ] && git rm data/runtime/pairs_full.parquet
git commit -m "feat(data): unified pairs.parquet with is_canonical; drop pairs_full
pairs.parquet now carries all ~642K minimal-pair rows. is_canonical
column flags the ~60K pairs where both words are in the canonical subset.
pairs_full.parquet retired — replaced by WHERE is_canonical = 0/1 filter.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 3: emit_d1_sql — DDL changes + corpus_sentences emit¶
Files:
- Modify: packages/data/src/phonolex_data/runtime/emit_d1_sql.py
- Modify: packages/web/workers/scripts/export-to-d1.py
- [ ] Step 1: In
emit_d1_sql.py: - Add
"is_canonical"to_CORE_WORDS_FIELDS(right after"has_phonology") - Update
_WORDS_DDLto includeis_canonical INTEGER NOT NULL DEFAULT 0andCREATE INDEX idx_words_is_canonical ON words (is_canonical); - Add
"is_canonical"to_PAIRS_FIELDS - Update
_PAIRS_DDLto includeis_canonical INTEGER NOT NULL DEFAULT 0andCREATE INDEX idx_pairs_is_canonical ON pairs (is_canonical); - Drop the 7 morph cols from
_partition_property_colsif they're whitelisted there -
Add new constants
_CORPUS_INDEX_DDL,_CORPUS_INDEX_FIELDS,_CORPUS_MEMBERSHIP_DDL,_CORPUS_MEMBERSHIP_FIELDSas in the spec §3 -
[ ] Step 2: Update
_normalise_pairs_row(or add a new_normalise_pairs_row_with_canonical) so the emitted row tuple matches the new 9-field shape (existing 8 + is_canonical). -
[ ] Step 3: Add new emit functions at the bottom of
emit_d1_sql.py:
def emit_corpus_sentences_d1(parquet_dir: Path, output_path: Path) -> None:
"""Append corpus_sentences_index + corpus_sentences DDL + INSERTs."""
idx_path = parquet_dir / "corpus_sentences_index.parquet"
mem_path = parquet_dir / "corpus_sentences.parquet"
if not idx_path.exists() or not mem_path.exists():
print(f" WARNING: corpus sentence parquets missing; skipping")
return
idx_df = pl.read_parquet(idx_path)
mem_df = pl.read_parquet(mem_path)
with output_path.open("a", encoding="utf-8") as fh:
fh.write(_CORPUS_INDEX_DDL + "\n\n")
for stmt in _emit_inserts("corpus_sentences_index", _CORPUS_INDEX_FIELDS,
(tuple(r[f] for f in _CORPUS_INDEX_FIELDS)
for r in idx_df.iter_rows(named=True))):
fh.write(stmt + "\n\n")
fh.write(_CORPUS_MEMBERSHIP_DDL + "\n\n")
for stmt in _emit_inserts("corpus_sentences", _CORPUS_MEMBERSHIP_FIELDS,
(tuple(r[f] for f in _CORPUS_MEMBERSHIP_FIELDS)
for r in mem_df.iter_rows(named=True))):
fh.write(stmt + "\n\n")
print(f" corpus_sentences_index: {idx_df.height:,} rows")
print(f" corpus_sentences: {mem_df.height:,} rows")
-
[ ] Step 4: In
emit_d1_sql.py's mainemit_d1_sql()function, theword_properties/word_freq_bands/word_percentilesemitters currently iterate over all words in words.parquet. Update each to skip rows whereis_canonical = 0(those rows don't have norm data; emitting NULL bloats D1 with no value). The simplest path: filter words_df bypl.col("is_canonical")before computing_prop_rows()/_band_rows()/_pct_rows(). -
[ ] Step 5: In
export-to-d1.py: - Add
DROP TABLE IF EXISTS corpus_sentences;,DROP TABLE IF EXISTS corpus_sentences_index;to the drop block (in reverse-dependency order — drop children before parents). - Remove
DROP TABLE IF EXISTS minimal_pairs;(legacy; we havepairs). -
After the existing
emit_d1_sql(...)call, add:from phonolex_data.runtime.emit_d1_sql import emit_corpus_sentences_d1 emit_corpus_sentences_d1(RUNTIME_DIR, OUTPUT_PATH) -
[ ] Step 6: Regenerate the seed:
uv run python packages/web/workers/scripts/export-to-d1.py 2>&1 | tail -10
grep "^CREATE TABLE" packages/web/workers/scripts/d1-seed.sql
Expected: 13 CREATE TABLE statements (was 11 — added corpus_sentences + corpus_sentences_index; pairs_full is gone; words_full is gone; words/pairs gain is_canonical column).
- [ ] Step 7: Re-chunk + apply locally:
uv run python packages/web/workers/scripts/chunk-seed-sql.py 2>&1 | tail -10
cd packages/web/workers
for i in $(seq -w 0 12); do
f="scripts/d1-chunks/chunk_${i}.sql"
[ -f "$f" ] || break
echo "=== chunk_${i} ==="
npx wrangler d1 execute phonolex --local --file "$f" 2>&1 | grep -E "ERROR|executed successfully"
done
# Verify row counts
for t in words pairs corpus_sentences_index corpus_sentences; do
npx wrangler d1 execute phonolex --local \
--command "SELECT COUNT(*) FROM $t;" 2>&1 | grep -A1 '"COUNT'
done
# Verify is_canonical column
npx wrangler d1 execute phonolex --local \
--command "SELECT COUNT(*) AS canonical FROM words WHERE is_canonical = 1;" 2>&1 | grep -A1 'canonical'
Expected: all chunks succeed; counts match parquet; canonical=47K of 125K words.
- [ ] Step 8: Commit:
cd /Users/jneumann/Repos/PhonoLex
git add packages/data/src/phonolex_data/runtime/emit_d1_sql.py \
packages/web/workers/scripts/export-to-d1.py \
packages/web/workers/scripts/d1-seed.sql \
packages/web/workers/scripts/d1-chunks/
git commit -m "feat(d1): is_canonical on words/pairs + corpus_sentences{,_index} tables
emit_d1_sql learns the new schema: is_canonical column on words + pairs;
two new corpus retrieval tables (index + membership) backed by the
existing corpus_sentences*.parquet artifacts. Skip non-canonical rows
when emitting word_properties / word_freq_bands / word_percentiles
(no norm data for them).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 4: Extract compileWordFilter helper¶
Files:
- Create: packages/web/workers/src/lib/wordFilter.ts
- Modify: packages/web/workers/src/routes/words.ts
-
[ ] Step 1: Read the current pattern + filter + exclude_phonemes + cv_shape + similar_to handling in
routes/words.ts(lines ~100-150 currently). This is what gets lifted. -
[ ] Step 2: Create
lib/wordFilter.ts:
import type { Pattern, WordSearchBody } from '../types';
import { buildPatternClauses } from './patterns';
import { partitionFilterColumns, isWordsTableColumn } from './queries';
import { normalizePhoneme } from './normalize';
export interface CompiledFilter {
wordsWhere: string[];
propsWhere: string[];
params: unknown[];
needsMedialPostFilter: boolean;
medialSequences: string[][];
}
/**
* Compile a search body into SQL WHERE-clause fragments + bind params.
* Used by /api/words/search and /api/sentences — both apply the same
* filter semantics to a different result projection.
*
* The `canonical_only` option appends `w.is_canonical = 1` to wordsWhere
* so the caller doesn't have to remember. Default true (Word Lists and
* corpus retrieval both want canonical scope for content-POS constraints).
*/
export function compileWordFilter(
body: WordSearchBody,
opts: { canonical_only?: boolean } = {},
): CompiledFilter {
const canonicalOnly = opts.canonical_only ?? true;
// ... port the constraint-translation block from routes/words.ts verbatim ...
// Add at the start of wordsWhere:
// if (canonicalOnly) wordsWhere.push('w.is_canonical = 1');
}
-
[ ] Step 3: Update
routes/words.tsto import and callcompileWordFilter(body). Delete the inline duplication. Routes that need full vocab (none for now) pass{ canonical_only: false }. -
[ ] Step 4: Re-run worker tests:
cd packages/web/workers && npx tsc --noEmit && npm test 2>&1 | tail -10
Expected: all tests pass (this is a refactor — same behavior).
- [ ] Step 5: Commit:
git add packages/web/workers/src/lib/wordFilter.ts packages/web/workers/src/routes/words.ts
git commit -m "refactor(workers): extract compileWordFilter; default is_canonical=1
No behavior change for /api/words/search (the canonical filter matches
the implicit assumption today — words.parquet was already content-POS-
filtered). The new /api/sentences endpoint will reuse this helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 5: Worker /api/sentences route (D1-backed)¶
Files:
- Create: packages/web/workers/src/routes/sentences.ts
- Modify: packages/web/workers/src/index.ts
- Create: packages/web/workers/src/__tests__/sentences.test.ts
- [ ] Step 1: Write tests first (TDD). Create
__tests__/sentences.test.ts:
import { describe, expect, it } from 'vitest';
import { SELF } from 'cloudflare:test';
describe('POST /api/sentences', () => {
it('returns sentences ordered by naturalness desc', async () => {
const res = await SELF.fetch('http://localhost/api/sentences', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ top_k: 5 }),
});
expect([200, 500]).toContain(res.status);
if (res.status === 200) {
const body = (await res.json()) as { items: Array<{ naturalness_score: number | null; sentence_id: number; text: string }> };
expect(Array.isArray(body.items)).toBe(true);
const scores = body.items.map((i) => i.naturalness_score ?? -Infinity);
for (let i = 1; i < scores.length; i++) {
expect(scores[i]).toBeLessThanOrEqual(scores[i - 1]);
}
}
});
it('intersects with cv_shape filter', async () => {
const res = await SELF.fetch('http://localhost/api/sentences', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ cv_shape: ['CVC'], top_k: 5 }),
});
expect([200, 500]).toContain(res.status);
});
it('returns valid top_k cap', async () => {
const res = await SELF.fetch('http://localhost/api/sentences', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ top_k: 1000 }), // server caps at 500
});
expect([200, 500]).toContain(res.status);
if (res.status === 200) {
const body = (await res.json()) as { items: unknown[] };
expect(body.items.length).toBeLessThanOrEqual(500);
}
});
});
- [ ] Step 2: Implement
routes/sentences.ts:
import { Hono } from 'hono';
import type { Env, WordSearchBody } from '../types';
import { compileWordFilter } from '../lib/wordFilter';
const sentences = new Hono<{ Bindings: Env }>();
interface SentencesBody extends WordSearchBody {
top_k?: number;
}
sentences.post('/', async (c) => {
const body = await c.req.json<SentencesBody>();
const topK = Math.min(Math.max(body.top_k ?? 50, 1), 500);
// Default: filter words to canonical scope (content-POS) for the user's
// constraint matching. Sentences are matched on their is_content=1 tokens.
const { wordsWhere, propsWhere, params } = compileWordFilter(body);
const wordsWhereSQL = wordsWhere.length ? wordsWhere.join(' AND ') : '1=1';
const propsWhereSQL = propsWhere.length ? propsWhere.join(' AND ') : '1=1';
const sql = `
WITH canonical_surviving AS (
SELECT w.word
FROM words w
INNER JOIN word_properties wp ON w.word = wp.word
WHERE ${wordsWhereSQL} AND ${propsWhereSQL}
),
matching_sentences AS (
SELECT cs.sentence_id
FROM corpus_sentences cs
WHERE cs.is_content = 1
GROUP BY cs.sentence_id
HAVING SUM(CASE WHEN cs.surface NOT IN (SELECT word FROM canonical_surviving) THEN 1 ELSE 0 END) = 0
)
SELECT csi.sentence_id, csi.text, csi.source, csi.source_record_id,
csi.n_tokens, csi.n_content_in_vocab, csi.naturalness_score
FROM corpus_sentences_index csi
INNER JOIN matching_sentences USING (sentence_id)
ORDER BY csi.naturalness_score DESC NULLS LAST
LIMIT ?
`;
const { results } = await c.env.DB.prepare(sql)
.bind(...params, topK)
.all<{
sentence_id: number;
text: string;
source: string;
source_record_id: string | null;
n_tokens: number;
n_content_in_vocab: number;
naturalness_score: number | null;
}>();
return c.json({ items: results, total: results.length });
});
export default sentences;
- [ ] Step 3: Mount in
index.ts. Remove thegenerationroute import; add thesentencesimport:
import sentences from './routes/sentences';
// remove: import generation from './routes/generation';
app.route('/api/sentences', sentences);
// remove: app.route('/api/generation', generation);
- [ ] Step 4: Run tests + smoke:
cd packages/web/workers && npm test -- sentences 2>&1 | tail -10
npx wrangler dev > /tmp/wrangler.log 2>&1 &
sleep 8
curl -s -X POST http://localhost:8787/api/sentences \
-H 'Content-Type: application/json' \
-d '{"top_k": 5}' | python3 -m json.tool
curl -s -X POST http://localhost:8787/api/sentences \
-H 'Content-Type: application/json' \
-d '{"cv_shape": ["CVC"], "top_k": 5}' | python3 -m json.tool
pkill -f "wrangler dev"
Expected: 3 unit tests pass; both curl calls return sentences ordered by naturalness; the cv_shape query returns sentences containing only CVC content words.
- [ ] Step 5: Commit:
git add packages/web/workers/src/routes/sentences.ts \
packages/web/workers/src/index.ts \
packages/web/workers/src/__tests__/sentences.test.ts
git commit -m "feat(workers): /api/sentences served from D1 in the Worker
Replaces the FastAPI container proxy with a native Worker handler.
Reuses compileWordFilter so /api/sentences shares constraint semantics
with /api/words/search. Pre-computed naturalness_score on corpus_sentences_index
drives ordering — no live reranker needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 6: Delete Worker generation proxy + container¶
Files:
- Delete: packages/web/workers/src/routes/generation.ts
- Delete: packages/web/workers/src/containers/generation.ts
- Delete: packages/web/workers/src/__tests__/generation.test.ts
- Modify: packages/web/workers/src/types.ts (drop GENERATION_SERVICE from Env)
- [ ] Step 1: Confirm no remaining imports:
grep -rn "GENERATION_SERVICE\|containers/generation\|routes/generation\|GenerationServer" \
packages/web/workers/src/ 2>&1 | head -10
- [ ] Step 2: Delete files + update types:
git rm packages/web/workers/src/routes/generation.ts
git rm packages/web/workers/src/containers/generation.ts
git rm packages/web/workers/src/__tests__/generation.test.ts
# Edit types.ts to remove GENERATION_SERVICE binding + GenerationServer import
- [ ] Step 3: Typecheck + tests:
cd packages/web/workers && npx tsc --noEmit && npm test 2>&1 | tail -10
- [ ] Step 4: Commit:
git commit -am "refactor(workers): delete generation proxy + GenerationServer DO class
Generation now lives entirely in the Worker via /api/sentences (T5).
The container proxy + DurableObject class are dead. wrangler.toml
container/migration entries get removed in T7.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 7: Drop containers from wrangler.toml + v2 DO migration¶
Files:
- Modify: packages/web/workers/wrangler.toml
- [ ] Step 1: Add the v2 migration tags BEFORE removing the bindings (so Cloudflare's DO state migrates cleanly):
[[migrations]]
tag = "v2"
deleted_classes = ["GenerationServer"]
[[env.staging.migrations]]
tag = "v2"
deleted_classes = ["GenerationServer"]
These go AFTER the existing v1 migration entries.
-
[ ] Step 2: Remove the container + DO binding blocks:
-
Production
[[containers]]block - Production
[[durable_objects.bindings]]block forGENERATION_SERVICE - Staging
[[env.staging.containers]]block -
Staging
[[env.staging.durable_objects.bindings]]block forGENERATION_SERVICE -
[ ] Step 3: Dry-run validate:
cd packages/web/workers && npx wrangler deploy --env staging --dry-run 2>&1 | tail -10
Expected: deploy plan mentions migration v2 retiring GenerationServer; no container image build.
- [ ] Step 4: Commit:
git add packages/web/workers/wrangler.toml
git commit -m "chore(workers): retire GenerationServer DurableObject + container
Adds v2 migration to release DO storage; removes container + binding
entries from production and staging. Next deploy builds no container
and releases the GENERATION_SERVICE namespace cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 8: Add is_canonical = 1 to existing routes that need it¶
Files:
- Modify: packages/web/workers/src/routes/words.ts (audit /api/words/word-list, /api/words/batch, /api/words/norms-dump)
- Modify: packages/web/workers/src/routes/contrastive.ts
- [ ] Step 1: Audit each route file that queries
wordsorpairsdirectly (not viacompileWordFilter):
grep -rn "FROM words\|FROM pairs\|words w\b\|pairs p\b" packages/web/workers/src/routes/ | head -20
For each match, decide:
- The route is presenting content-POS scope to the user → add WHERE w.is_canonical = 1 (or p.is_canonical = 1)
- The route is doing phonological work (similarity, exclusion) across full vocab → leave alone
Specifically:
- /api/words/search — uses compileWordFilter → already canonical=1 by default
- /api/words/word-list — likely canonical scope (clinical word lists) → add filter
- /api/words/batch — depends; if it's IN (?, ?, ...) lookup it doesn't need the filter
- /api/words/norms-dump — canonical scope
- /api/words/:word — single-word lookup; no filter
- /api/similarity/search — full vocab (sound similarity over any phonology-bearing word) → no filter
- /api/contrastive/* — Contrastive Sets tool; canonical scope for both words in a pair → WHERE p.is_canonical = 1
-
[ ] Step 2: Apply the audit. For routes using inline SQL, add the clause manually. For routes via the helper, they're already covered.
-
[ ] Step 3: Run tests:
cd packages/web/workers && npm test 2>&1 | tail -15
- [ ] Step 4: Manual smoke against local D1 to verify the canonical filter is applied correctly:
cd packages/web/workers && npx wrangler dev > /tmp/wrangler.log 2>&1 &
sleep 8
# word-list should return only content-POS words
curl -s -X POST http://localhost:8787/api/words/word-list \
-H 'Content-Type: application/json' \
-d '{"include_phonemes":["k"]}' | python3 -c "
import sys, json
data = json.load(sys.stdin)
print('count:', data.get('total'))
words = data.get('items', [])[:5]
print('sample POS:', [w.get('pos') for w in words])
"
# Similar query on /api/words/search (already through helper)
curl -s -X POST http://localhost:8787/api/words/search \
-H 'Content-Type: application/json' \
-d '{"cv_shape":["CVC"],"limit":5}' | python3 -c "
import sys, json
data = json.load(sys.stdin)
print('count:', data.get('total'))
print('sample POS:', [it.get('pos') for it in data.get('items', [])])
"
pkill -f "wrangler dev"
Expected: sample POS values are all NOUN/VERB/ADJ/ADV; no PROPN / PRON / DET / AUX.
- [ ] Step 5: Commit:
git add packages/web/workers/src/routes/
git commit -m "feat(workers): WHERE is_canonical = 1 on routes presenting content-POS scope
Word Lists, Custom Word Lists, norms-dump, and contrastive routes get
the canonical filter. Sound similarity and single-word lookup keep
full-vocabulary scope.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 9: Delete packages/generators/, packages/governors/, packages/generation/¶
Files:
- Delete entire dirs
- Modify: root pyproject.toml (drop workspace entries)
- [ ] Step 1: Confirm no live importers:
grep -rn "from phonolex_generators\|import phonolex_generators\|from phonolex_governors\|import phonolex_governors" \
packages/ --include="*.py" 2>&1 | grep -vE "packages/(generators|governors|generation)/" | head -10
Expected: no live consumers outside the deleted dirs (the corpus.py reference is in packages/generation/ which is also being deleted).
- [ ] Step 2: Delete:
git rm -r packages/generators/
git rm -r packages/governors/
git rm -r packages/generation/
- [ ] Step 3: Update root pyproject.toml — find workspace members and remove the three:
grep -n "members\|packages/generation\|packages/generators\|packages/governors" pyproject.toml
Remove the entries.
- [ ] Step 4: Drop the now-orphan Python ML deps from
packages/data/pyproject.toml. Check what's there:
cat packages/data/pyproject.toml | grep -E "spacy|sentence-transformers|torch|transformers|lightgbm|en_core"
Drop spaCy (no more morph features). Anything else that's only consumed by deleted code → drop.
- [ ] Step 5: Re-sync:
uv sync --all-packages 2>&1 | tail -10
- [ ] Step 6: Run remaining tests:
uv run python -m pytest packages/data/tests/ \
--ignore=packages/data/tests/test_datasets.py \
--ignore=packages/data/tests/test_new_loaders.py 2>&1 | tail -10
Expected: all data tests pass (the morph feature test, if any, was removed in T1).
- [ ] Step 7: Commit:
git add pyproject.toml packages/data/pyproject.toml uv.lock
git commit -am "chore: delete generators/, governors/, generation/ packages
All three are dead post-CSP-retirement. generators carries csp/ +
MLM editor + reranker; governors is its trie-checking dependency
chain; generation is the FastAPI server. No live consumer survives.
Also drops spaCy from packages/data/pyproject.toml — the only
consumer was _populate_morph_features (deleted in T1).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 10: LFS scope reduction + delete dead artifacts¶
Files:
- Modify: .gitattributes, .gitignore
- Delete: 6 dead data/runtime/ files + 2 retired (words_full, pairs_full)
- [ ] Step 1: Delete dead artifacts:
for f in selectional.parquet skeletons.parquet \
naturalness_scorer_head.pt naturalness_reference.npy \
naturalness_reference_meta.jsonl reranker_v2.pkl \
words_full.parquet pairs_full.parquet; do
[ -f "data/runtime/$f" ] && git rm "data/runtime/$f"
done
- [ ] Step 2: Update
.gitignore— append:
# data/runtime/ is local build cache. Rebuild via:
# uv run python packages/data/scripts/build_runtime_parquet.py
# uv run python packages/web/workers/scripts/export-to-d1.py
# Only d1-seed.sql is committed (LFS).
data/runtime/*.parquet
data/runtime/*.pt
data/runtime/*.pkl
data/runtime/*.npy
data/runtime/*.jsonl
- [ ] Step 3: LFS-untrack the remaining parquets:
git lfs untrack "data/runtime/*.parquet"
git lfs untrack "data/runtime/*.pkl"
git lfs untrack "data/runtime/corpus_sentences.parquet"
git lfs untrack "data/runtime/corpus_sentences_index.parquet"
After this, .gitattributes should contain only the d1-seed.sql LFS line.
- [ ] Step 4: Remove the tracked parquets from HEAD (they're now gitignored; on-disk copies survive):
git rm --cached data/runtime/words.parquet
git rm --cached data/runtime/pairs.parquet
[ -f data/runtime/edges.parquet ] && git rm --cached data/runtime/edges.parquet
[ -f data/runtime/corpus_sentences.parquet ] && git rm --cached data/runtime/corpus_sentences.parquet
[ -f data/runtime/corpus_sentences_index.parquet ] && git rm --cached data/runtime/corpus_sentences_index.parquet
- [ ] Step 5: Verify:
cat .gitattributes
ls data/runtime/ 2>&1 | head -10
git ls-files data/runtime/
Expected: .gitattributes only contains d1-seed.sql LFS line; data/runtime/ on disk has the regenerated parquets (untracked); git ls-files data/runtime/ shows no files.
- [ ] Step 6: Commit:
git add .gitattributes .gitignore
git commit -m "chore: drop dead artifacts; LFS narrows to d1-seed.sql only
Removes selectional/skeletons/naturalness*/reranker_v2/words_full/
pairs_full from the tree. .gitattributes scopes to just the seed.
data/runtime/* is gitignored as developer-local build cache going
forward; rebuild via build_runtime_parquet.py + export-to-d1.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 11: Trim deploy workflows paths-filter¶
Files:
- Modify: .github/workflows/deploy-staging.yml
- Modify: .github/workflows/deploy.yml
- [ ] Step 1: Update
data:paths-filter in both workflows to:
data:
- 'packages/web/workers/scripts/d1-seed.sql'
Drop all the other path filters (packages/data/**, data/runtime/**, packages/web/workers/scripts/export-to-d1.py, packages/web/workers/scripts/config.py, etc.). The seed is the only thing CI touches.
-
[ ] Step 2: Eyeball commit-message-only checks (no automated linter for actions YAML).
-
[ ] Step 3: Commit:
git add .github/workflows/
git commit -m "ci: paths-filter only watches d1-seed.sql
After the D1-only refactor, the seed is the only artifact whose
change should trigger a re-seed. All other paths (source TSVs,
parquets, build scripts) are local-dev concerns now.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 12: CLAUDE.md rewrite + frontend cold-start copy¶
Files:
- Modify: CLAUDE.md
- Modify: packages/web/frontend/src/components/tools/GovernedGenerationTool/ (find cold-start copy)
- [ ] Step 1: Find cold-start copy:
grep -rn "60s\|cold start\|cold-start\|generation server\|first request" \
packages/web/frontend/src/components/ 2>&1 | head -10
Drop any warnings that no longer apply.
-
[ ] Step 2: CLAUDE.md rewrites — there are several sections to update:
-
"Architecture" section: replace CSP-Phase-1-and-2 description with corpus-retrieval-from-D1
- "Generation Runtime Data Contract" section: drop selectional/skeletons/naturalness references; describe the unified words+pairs schema with is_canonical; describe corpus_sentences in D1
- "Project Structure" tree: remove
packages/generation/,packages/generators/,packages/governors/ - "Dev Setup": drop the generation server start command + the heavy ML dep descriptions; flow is
build_runtime_parquet.py+export-to-d1.py+wrangler dev(local) or LFS + wrangler deploy (CI) - "Gotchas": drop references to the FastAPI server, container cold-start, spaCy / morph features
This is a real rewrite. ~30% of CLAUDE.md changes.
- [ ] Step 3: Commit:
git add packages/web/frontend/src/ CLAUDE.md
git commit -m "docs: rewrite CLAUDE.md + frontend copy for D1-only architecture
CSP/reranker/container described as retired; unified words+pairs with
is_canonical described as the new schema; corpus retrieval described
as Worker-D1 not container-Polars. Frontend drops the 60s cold-start
warning (no more container).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"
Task 13: E2E smoke + push + PR¶
- [ ] Step 1: Full local CI-equivalent run:
cd /Users/jneumann/Repos/PhonoLex
(cd packages/web/workers && npx tsc --noEmit && npm test 2>&1 | tail -10)
(cd packages/web/frontend && npm run type-check 2>&1 | tail -5)
(cd packages/web/frontend && npm run lint 2>&1 | tail -5)
(cd packages/web/frontend && npm run build 2>&1 | tail -5)
uv run python -m pytest packages/data/tests/ \
--ignore=packages/data/tests/test_datasets.py \
--ignore=packages/data/tests/test_new_loaders.py 2>&1 | tail -10
Expected: all green.
- [ ] Step 2: End-to-end Worker smoke:
cd packages/web/workers && npx wrangler dev > /tmp/wrangler.log 2>&1 &
sleep 8
# Sentences (new D1-backed)
curl -s -X POST http://localhost:8787/api/sentences \
-H 'Content-Type: application/json' \
-d '{"top_k": 5}' | python3 -m json.tool | head -25
# Existing endpoints (verify is_canonical filter works correctly)
curl -s -X POST http://localhost:8787/api/words/search \
-H 'Content-Type: application/json' \
-d '{"cv_shape":["CVC"],"limit":5}' | python3 -m json.tool | head -30
# Generation routes are gone
curl -s -o /dev/null -w "/api/generation/generate-sentences: %{http_code}\n" \
http://localhost:8787/api/generation/generate-sentences
curl -s -o /dev/null -w "/api/generation/sentences: %{http_code}\n" \
http://localhost:8787/api/generation/sentences
pkill -f "wrangler dev"
Expected: /api/sentences returns sentences; /api/words/search returns content-POS words with CVC shape; /api/generation/* returns 404.
- [ ] Step 3: Push:
git push -u origin refactor/d1-only-drop-csp 2>&1 | tail -10
- [ ] Step 4: Open PR:
gh pr create --base develop --head refactor/d1-only-drop-csp \
--title "refactor: D1-only generation; drop CSP + unify words/pairs" \
--body-file - <<'EOF'
## Summary
- Moves /api/sentences into the Worker (D1-backed); no more FastAPI container
- Unifies `words` (47K → 125K rows) and `pairs` (60K → 642K rows) with an
`is_canonical` column flagging the content-POS subset
- Adds `corpus_sentences{,_index}` D1 tables
- Deletes `packages/generators/`, `packages/governors/`, `packages/generation/`
- Drops spaCy + sentence-transformers + torch + transformers + lightgbm
- Drops the 7 spaCy-derived morph columns from words schema
- Reduces LFS to only `d1-seed.sql`; everything else in `data/runtime/` is
local build cache (gitignored)
- CI paths-filter narrows to just the seed
## Why
The CSP synthetic generation paradigm was already deprecated; the live UI
only uses corpus retrieval. The whole stack was dragging an unused ML
dependency chain (~2 GB container, 30s cold start) into the deploy. The
canonical-vs-full split was hidden behind two parquet files the container
read directly. Consolidating into one Worker + D1 surface gets us:
- One service to deploy
- Sub-100ms cold start
- One LFS file
- No model downloads in CI
- Full vocabulary visible to the Worker (no more container-only data)
## Test plan
- [ ] Worker tests green
- [ ] Frontend lint + build green
- [ ] Python data tests green (CI scope)
- [ ] /api/sentences live in dev with corpus retrieval
- [ ] /api/words/search returns only content-POS words by default (is_canonical=1)
- [ ] /api/similarity/search and /api/words/:word still work over the full vocabulary
- [ ] /api/generation/* returns 404
- [ ] Staging deploy succeeds without building a container
🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
Closing checklist¶
- [ ] 13 tasks committed
- [ ] Local CI-equivalent checks green
- [ ] /api/sentences works against D1
- [ ] /api/words/search correctly filters to canonical
- [ ] /api/generation/* removed
- [ ] .gitattributes lists only d1-seed.sql
- [ ] No generators/, governors/, generation/ in HEAD
- [ ] No spaCy / morph features in HEAD
- [ ] PR opened against develop
- [ ] CI green on the pushed branch
- [ ] Deploy Staging succeeds (no container build, no Python pipeline)