Skip to content

D1-only + Drop CSP + Unified Tables — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task.

Goal: Move corpus retrieval into the Worker reading D1; unify words + pairs into single tables with is_canonical flag; delete the entire CSP/MLM/governor/generation-server Python stack; reduce LFS to a single tracked file.

Spec: docs/superpowers/specs/2026-05-15-d1-only-drop-csp-design.md. Branch: refactor/d1-only-drop-csp off origin/develop @ 425d992f.


Task 1: Unify words.parquet — add is_canonical, drop morph features

Files: - Modify: packages/data/src/phonolex_data/pipeline/words.py - Modify: packages/data/src/phonolex_data/pipeline/schema.py - Modify: packages/data/src/phonolex_data/runtime/schema.py

  • [ ] Step 1: In pipeline/schema.py WordRecord:
  • Add is_canonical: bool = False
  • Remove the 7 morph fields: number, person, tense, verb_form, mood, aspect, degree

  • [ ] Step 2: In pipeline/words.py:

  • Find the post-v5.2 POS filter (search for pos in {"NOUN" or is_content_pos or similar — the filter that drops PROPN/PRON rows). Remove it so all phonology-bearing rows survive.
  • After POS is populated on each WordRecord, set record.is_canonical = (record.pos in {"NOUN", "VERB", "ADJ", "ADV"}).
  • Delete the entire _populate_morph_features(words) function and its call site (search for _populate_morph_features(). Drop the spacy import.

  • [ ] Step 3: In runtime/schema.py _CORE_WORDS_COLUMNS:

  • Add "is_canonical": pl.Boolean right after "has_phonology"
  • Remove the 7 morph columns

  • [ ] Step 4: Regenerate parquet locally:

cd /Users/jneumann/Repos/PhonoLex
uv run python packages/data/scripts/build_runtime_parquet.py 2>&1 | tail -10
uv run python -c "
import polars as pl
df = pl.read_parquet('data/runtime/words.parquet')
print('rows:', df.height)
print('is_canonical in cols:', 'is_canonical' in df.columns)
print('morph cols (should be 0):', sum(c in df.columns for c in ['number','person','tense','verb_form','mood','aspect','degree']))
print('canonical=1 count:', df.filter(pl.col('is_canonical')).height)
print('canonical=0 count:', df.filter(~pl.col('is_canonical')).height)
"

Expected: ~125,756 total; ~47,384 canonical=1; ~78K canonical=0; zero morph columns.

  • [ ] Step 5: Run data tests:
uv run python -m pytest packages/data/tests/ \
  --ignore=packages/data/tests/test_datasets.py \
  --ignore=packages/data/tests/test_new_loaders.py 2>&1 | tail -10

If any test asserts on the old POS-filtered shape or morph columns, fix the test to match the new schema (don't relax assertions).

  • [ ] Step 6: Commit:
git add packages/data/ data/runtime/words.parquet
git commit -m "feat(data): unified words.parquet with is_canonical; drop morph features

words.parquet now carries all ~125K phonology-bearing entries (was 47K
after the v5.2 POS filter). is_canonical column flags the 47K NOUN/VERB/
ADJ/ADV content-POS subset. Removed _populate_morph_features and the
7 spaCy-derived morph columns (number/person/tense/verb_form/mood/aspect/
degree) — only consumer was the now-deprecated CSP solver.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 2: Unify pairs.parquet — drop pairs_full, add is_canonical

Files: - Modify: packages/data/src/phonolex_data/pipeline/derived.py (or wherever minimal_pairs / pairs are computed) - Modify: packages/data/src/phonolex_data/runtime/emit_parquet.py (the pairs + pairs_full emit functions)

  • [ ] Step 1: Inspect current pairs emit logic:
grep -n "pairs\|_compute_minimal_pairs\|emit_pairs" packages/data/src/phonolex_data/pipeline/derived.py packages/data/src/phonolex_data/runtime/emit_parquet.py | head -20

Find where pairs.parquet gets filtered to content-POS-only vs pairs_full.parquet (which keeps all). The build pipeline computes the full set, then filters for pairs.parquet.

  • [ ] Step 2: Change the emit logic to produce ONE pairs.parquet (all ~642K rows) with is_canonical column. Drop the separate emit_pairs_full_parquet function and the pairs_full.parquet artifact.

  • Each pair row's is_canonical is set to True iff BOTH word1 and word2 are in the canonical (is_canonical=1) subset of words.

  • The two emit functions collapse into one.

  • [ ] Step 3: Update CLAUDE.md sections that reference pairs_full as a separate artifact (commit covered in T14).

  • [ ] Step 4: Regenerate + verify:

uv run python packages/data/scripts/build_runtime_parquet.py 2>&1 | tail -5
uv run python -c "
import polars as pl
df = pl.read_parquet('data/runtime/pairs.parquet')
print('rows:', df.height)
print('is_canonical in cols:', 'is_canonical' in df.columns)
print('canonical=1 count:', df.filter(pl.col('is_canonical')).height)
import os
print('pairs_full exists:', os.path.exists('data/runtime/pairs_full.parquet'))
"

Expected: ~642K rows, ~60K canonical, pairs_full.parquet does not exist.

  • [ ] Step 5: Commit:
git add packages/data/ data/runtime/pairs.parquet
[ -f data/runtime/pairs_full.parquet ] && git rm data/runtime/pairs_full.parquet
git commit -m "feat(data): unified pairs.parquet with is_canonical; drop pairs_full

pairs.parquet now carries all ~642K minimal-pair rows. is_canonical
column flags the ~60K pairs where both words are in the canonical subset.
pairs_full.parquet retired — replaced by WHERE is_canonical = 0/1 filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 3: emit_d1_sql — DDL changes + corpus_sentences emit

Files: - Modify: packages/data/src/phonolex_data/runtime/emit_d1_sql.py - Modify: packages/web/workers/scripts/export-to-d1.py

  • [ ] Step 1: In emit_d1_sql.py:
  • Add "is_canonical" to _CORE_WORDS_FIELDS (right after "has_phonology")
  • Update _WORDS_DDL to include is_canonical INTEGER NOT NULL DEFAULT 0 and CREATE INDEX idx_words_is_canonical ON words (is_canonical);
  • Add "is_canonical" to _PAIRS_FIELDS
  • Update _PAIRS_DDL to include is_canonical INTEGER NOT NULL DEFAULT 0 and CREATE INDEX idx_pairs_is_canonical ON pairs (is_canonical);
  • Drop the 7 morph cols from _partition_property_cols if they're whitelisted there
  • Add new constants _CORPUS_INDEX_DDL, _CORPUS_INDEX_FIELDS, _CORPUS_MEMBERSHIP_DDL, _CORPUS_MEMBERSHIP_FIELDS as in the spec §3

  • [ ] Step 2: Update _normalise_pairs_row (or add a new _normalise_pairs_row_with_canonical) so the emitted row tuple matches the new 9-field shape (existing 8 + is_canonical).

  • [ ] Step 3: Add new emit functions at the bottom of emit_d1_sql.py:

def emit_corpus_sentences_d1(parquet_dir: Path, output_path: Path) -> None:
    """Append corpus_sentences_index + corpus_sentences DDL + INSERTs."""
    idx_path = parquet_dir / "corpus_sentences_index.parquet"
    mem_path = parquet_dir / "corpus_sentences.parquet"
    if not idx_path.exists() or not mem_path.exists():
        print(f"  WARNING: corpus sentence parquets missing; skipping")
        return
    idx_df = pl.read_parquet(idx_path)
    mem_df = pl.read_parquet(mem_path)
    with output_path.open("a", encoding="utf-8") as fh:
        fh.write(_CORPUS_INDEX_DDL + "\n\n")
        for stmt in _emit_inserts("corpus_sentences_index", _CORPUS_INDEX_FIELDS,
                                  (tuple(r[f] for f in _CORPUS_INDEX_FIELDS)
                                   for r in idx_df.iter_rows(named=True))):
            fh.write(stmt + "\n\n")
        fh.write(_CORPUS_MEMBERSHIP_DDL + "\n\n")
        for stmt in _emit_inserts("corpus_sentences", _CORPUS_MEMBERSHIP_FIELDS,
                                  (tuple(r[f] for f in _CORPUS_MEMBERSHIP_FIELDS)
                                   for r in mem_df.iter_rows(named=True))):
            fh.write(stmt + "\n\n")
    print(f"  corpus_sentences_index: {idx_df.height:,} rows")
    print(f"  corpus_sentences: {mem_df.height:,} rows")
  • [ ] Step 4: In emit_d1_sql.py's main emit_d1_sql() function, the word_properties / word_freq_bands / word_percentiles emitters currently iterate over all words in words.parquet. Update each to skip rows where is_canonical = 0 (those rows don't have norm data; emitting NULL bloats D1 with no value). The simplest path: filter words_df by pl.col("is_canonical") before computing _prop_rows() / _band_rows() / _pct_rows().

  • [ ] Step 5: In export-to-d1.py:

  • Add DROP TABLE IF EXISTS corpus_sentences;, DROP TABLE IF EXISTS corpus_sentences_index; to the drop block (in reverse-dependency order — drop children before parents).
  • Remove DROP TABLE IF EXISTS minimal_pairs; (legacy; we have pairs).
  • After the existing emit_d1_sql(...) call, add:

    from phonolex_data.runtime.emit_d1_sql import emit_corpus_sentences_d1
    emit_corpus_sentences_d1(RUNTIME_DIR, OUTPUT_PATH)
    
  • [ ] Step 6: Regenerate the seed:

uv run python packages/web/workers/scripts/export-to-d1.py 2>&1 | tail -10
grep "^CREATE TABLE" packages/web/workers/scripts/d1-seed.sql

Expected: 13 CREATE TABLE statements (was 11 — added corpus_sentences + corpus_sentences_index; pairs_full is gone; words_full is gone; words/pairs gain is_canonical column).

  • [ ] Step 7: Re-chunk + apply locally:
uv run python packages/web/workers/scripts/chunk-seed-sql.py 2>&1 | tail -10
cd packages/web/workers
for i in $(seq -w 0 12); do
  f="scripts/d1-chunks/chunk_${i}.sql"
  [ -f "$f" ] || break
  echo "=== chunk_${i} ==="
  npx wrangler d1 execute phonolex --local --file "$f" 2>&1 | grep -E "ERROR|executed successfully"
done

# Verify row counts
for t in words pairs corpus_sentences_index corpus_sentences; do
  npx wrangler d1 execute phonolex --local \
    --command "SELECT COUNT(*) FROM $t;" 2>&1 | grep -A1 '"COUNT'
done

# Verify is_canonical column
npx wrangler d1 execute phonolex --local \
  --command "SELECT COUNT(*) AS canonical FROM words WHERE is_canonical = 1;" 2>&1 | grep -A1 'canonical'

Expected: all chunks succeed; counts match parquet; canonical=47K of 125K words.

  • [ ] Step 8: Commit:
cd /Users/jneumann/Repos/PhonoLex
git add packages/data/src/phonolex_data/runtime/emit_d1_sql.py \
        packages/web/workers/scripts/export-to-d1.py \
        packages/web/workers/scripts/d1-seed.sql \
        packages/web/workers/scripts/d1-chunks/
git commit -m "feat(d1): is_canonical on words/pairs + corpus_sentences{,_index} tables

emit_d1_sql learns the new schema: is_canonical column on words + pairs;
two new corpus retrieval tables (index + membership) backed by the
existing corpus_sentences*.parquet artifacts. Skip non-canonical rows
when emitting word_properties / word_freq_bands / word_percentiles
(no norm data for them).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 4: Extract compileWordFilter helper

Files: - Create: packages/web/workers/src/lib/wordFilter.ts - Modify: packages/web/workers/src/routes/words.ts

  • [ ] Step 1: Read the current pattern + filter + exclude_phonemes + cv_shape + similar_to handling in routes/words.ts (lines ~100-150 currently). This is what gets lifted.

  • [ ] Step 2: Create lib/wordFilter.ts:

import type { Pattern, WordSearchBody } from '../types';
import { buildPatternClauses } from './patterns';
import { partitionFilterColumns, isWordsTableColumn } from './queries';
import { normalizePhoneme } from './normalize';

export interface CompiledFilter {
  wordsWhere: string[];
  propsWhere: string[];
  params: unknown[];
  needsMedialPostFilter: boolean;
  medialSequences: string[][];
}

/**
 * Compile a search body into SQL WHERE-clause fragments + bind params.
 * Used by /api/words/search and /api/sentences — both apply the same
 * filter semantics to a different result projection.
 *
 * The `canonical_only` option appends `w.is_canonical = 1` to wordsWhere
 * so the caller doesn't have to remember. Default true (Word Lists and
 * corpus retrieval both want canonical scope for content-POS constraints).
 */
export function compileWordFilter(
  body: WordSearchBody,
  opts: { canonical_only?: boolean } = {},
): CompiledFilter {
  const canonicalOnly = opts.canonical_only ?? true;
  // ... port the constraint-translation block from routes/words.ts verbatim ...
  // Add at the start of wordsWhere:
  //   if (canonicalOnly) wordsWhere.push('w.is_canonical = 1');
}
  • [ ] Step 3: Update routes/words.ts to import and call compileWordFilter(body). Delete the inline duplication. Routes that need full vocab (none for now) pass { canonical_only: false }.

  • [ ] Step 4: Re-run worker tests:

cd packages/web/workers && npx tsc --noEmit && npm test 2>&1 | tail -10

Expected: all tests pass (this is a refactor — same behavior).

  • [ ] Step 5: Commit:
git add packages/web/workers/src/lib/wordFilter.ts packages/web/workers/src/routes/words.ts
git commit -m "refactor(workers): extract compileWordFilter; default is_canonical=1

No behavior change for /api/words/search (the canonical filter matches
the implicit assumption today — words.parquet was already content-POS-
filtered). The new /api/sentences endpoint will reuse this helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 5: Worker /api/sentences route (D1-backed)

Files: - Create: packages/web/workers/src/routes/sentences.ts - Modify: packages/web/workers/src/index.ts - Create: packages/web/workers/src/__tests__/sentences.test.ts

  • [ ] Step 1: Write tests first (TDD). Create __tests__/sentences.test.ts:
import { describe, expect, it } from 'vitest';
import { SELF } from 'cloudflare:test';

describe('POST /api/sentences', () => {
  it('returns sentences ordered by naturalness desc', async () => {
    const res = await SELF.fetch('http://localhost/api/sentences', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ top_k: 5 }),
    });
    expect([200, 500]).toContain(res.status);
    if (res.status === 200) {
      const body = (await res.json()) as { items: Array<{ naturalness_score: number | null; sentence_id: number; text: string }> };
      expect(Array.isArray(body.items)).toBe(true);
      const scores = body.items.map((i) => i.naturalness_score ?? -Infinity);
      for (let i = 1; i < scores.length; i++) {
        expect(scores[i]).toBeLessThanOrEqual(scores[i - 1]);
      }
    }
  });

  it('intersects with cv_shape filter', async () => {
    const res = await SELF.fetch('http://localhost/api/sentences', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ cv_shape: ['CVC'], top_k: 5 }),
    });
    expect([200, 500]).toContain(res.status);
  });

  it('returns valid top_k cap', async () => {
    const res = await SELF.fetch('http://localhost/api/sentences', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ top_k: 1000 }),  // server caps at 500
    });
    expect([200, 500]).toContain(res.status);
    if (res.status === 200) {
      const body = (await res.json()) as { items: unknown[] };
      expect(body.items.length).toBeLessThanOrEqual(500);
    }
  });
});
  • [ ] Step 2: Implement routes/sentences.ts:
import { Hono } from 'hono';
import type { Env, WordSearchBody } from '../types';
import { compileWordFilter } from '../lib/wordFilter';

const sentences = new Hono<{ Bindings: Env }>();

interface SentencesBody extends WordSearchBody {
  top_k?: number;
}

sentences.post('/', async (c) => {
  const body = await c.req.json<SentencesBody>();
  const topK = Math.min(Math.max(body.top_k ?? 50, 1), 500);

  // Default: filter words to canonical scope (content-POS) for the user's
  // constraint matching. Sentences are matched on their is_content=1 tokens.
  const { wordsWhere, propsWhere, params } = compileWordFilter(body);

  const wordsWhereSQL = wordsWhere.length ? wordsWhere.join(' AND ') : '1=1';
  const propsWhereSQL = propsWhere.length ? propsWhere.join(' AND ') : '1=1';

  const sql = `
    WITH canonical_surviving AS (
      SELECT w.word
      FROM words w
      INNER JOIN word_properties wp ON w.word = wp.word
      WHERE ${wordsWhereSQL} AND ${propsWhereSQL}
    ),
    matching_sentences AS (
      SELECT cs.sentence_id
      FROM corpus_sentences cs
      WHERE cs.is_content = 1
      GROUP BY cs.sentence_id
      HAVING SUM(CASE WHEN cs.surface NOT IN (SELECT word FROM canonical_surviving) THEN 1 ELSE 0 END) = 0
    )
    SELECT csi.sentence_id, csi.text, csi.source, csi.source_record_id,
           csi.n_tokens, csi.n_content_in_vocab, csi.naturalness_score
    FROM corpus_sentences_index csi
    INNER JOIN matching_sentences USING (sentence_id)
    ORDER BY csi.naturalness_score DESC NULLS LAST
    LIMIT ?
  `;

  const { results } = await c.env.DB.prepare(sql)
    .bind(...params, topK)
    .all<{
      sentence_id: number;
      text: string;
      source: string;
      source_record_id: string | null;
      n_tokens: number;
      n_content_in_vocab: number;
      naturalness_score: number | null;
    }>();

  return c.json({ items: results, total: results.length });
});

export default sentences;
  • [ ] Step 3: Mount in index.ts. Remove the generation route import; add the sentences import:
import sentences from './routes/sentences';
// remove: import generation from './routes/generation';
app.route('/api/sentences', sentences);
// remove: app.route('/api/generation', generation);
  • [ ] Step 4: Run tests + smoke:
cd packages/web/workers && npm test -- sentences 2>&1 | tail -10
npx wrangler dev > /tmp/wrangler.log 2>&1 &
sleep 8
curl -s -X POST http://localhost:8787/api/sentences \
  -H 'Content-Type: application/json' \
  -d '{"top_k": 5}' | python3 -m json.tool
curl -s -X POST http://localhost:8787/api/sentences \
  -H 'Content-Type: application/json' \
  -d '{"cv_shape": ["CVC"], "top_k": 5}' | python3 -m json.tool
pkill -f "wrangler dev"

Expected: 3 unit tests pass; both curl calls return sentences ordered by naturalness; the cv_shape query returns sentences containing only CVC content words.

  • [ ] Step 5: Commit:
git add packages/web/workers/src/routes/sentences.ts \
        packages/web/workers/src/index.ts \
        packages/web/workers/src/__tests__/sentences.test.ts
git commit -m "feat(workers): /api/sentences served from D1 in the Worker

Replaces the FastAPI container proxy with a native Worker handler.
Reuses compileWordFilter so /api/sentences shares constraint semantics
with /api/words/search. Pre-computed naturalness_score on corpus_sentences_index
drives ordering — no live reranker needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 6: Delete Worker generation proxy + container

Files: - Delete: packages/web/workers/src/routes/generation.ts - Delete: packages/web/workers/src/containers/generation.ts - Delete: packages/web/workers/src/__tests__/generation.test.ts - Modify: packages/web/workers/src/types.ts (drop GENERATION_SERVICE from Env)

  • [ ] Step 1: Confirm no remaining imports:
grep -rn "GENERATION_SERVICE\|containers/generation\|routes/generation\|GenerationServer" \
  packages/web/workers/src/ 2>&1 | head -10
  • [ ] Step 2: Delete files + update types:
git rm packages/web/workers/src/routes/generation.ts
git rm packages/web/workers/src/containers/generation.ts
git rm packages/web/workers/src/__tests__/generation.test.ts
# Edit types.ts to remove GENERATION_SERVICE binding + GenerationServer import
  • [ ] Step 3: Typecheck + tests:
cd packages/web/workers && npx tsc --noEmit && npm test 2>&1 | tail -10
  • [ ] Step 4: Commit:
git commit -am "refactor(workers): delete generation proxy + GenerationServer DO class

Generation now lives entirely in the Worker via /api/sentences (T5).
The container proxy + DurableObject class are dead. wrangler.toml
container/migration entries get removed in T7.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 7: Drop containers from wrangler.toml + v2 DO migration

Files: - Modify: packages/web/workers/wrangler.toml

  • [ ] Step 1: Add the v2 migration tags BEFORE removing the bindings (so Cloudflare's DO state migrates cleanly):
[[migrations]]
tag = "v2"
deleted_classes = ["GenerationServer"]

[[env.staging.migrations]]
tag = "v2"
deleted_classes = ["GenerationServer"]

These go AFTER the existing v1 migration entries.

  • [ ] Step 2: Remove the container + DO binding blocks:

  • Production [[containers]] block

  • Production [[durable_objects.bindings]] block for GENERATION_SERVICE
  • Staging [[env.staging.containers]] block
  • Staging [[env.staging.durable_objects.bindings]] block for GENERATION_SERVICE

  • [ ] Step 3: Dry-run validate:

cd packages/web/workers && npx wrangler deploy --env staging --dry-run 2>&1 | tail -10

Expected: deploy plan mentions migration v2 retiring GenerationServer; no container image build.

  • [ ] Step 4: Commit:
git add packages/web/workers/wrangler.toml
git commit -m "chore(workers): retire GenerationServer DurableObject + container

Adds v2 migration to release DO storage; removes container + binding
entries from production and staging. Next deploy builds no container
and releases the GENERATION_SERVICE namespace cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 8: Add is_canonical = 1 to existing routes that need it

Files: - Modify: packages/web/workers/src/routes/words.ts (audit /api/words/word-list, /api/words/batch, /api/words/norms-dump) - Modify: packages/web/workers/src/routes/contrastive.ts

  • [ ] Step 1: Audit each route file that queries words or pairs directly (not via compileWordFilter):
grep -rn "FROM words\|FROM pairs\|words w\b\|pairs p\b" packages/web/workers/src/routes/ | head -20

For each match, decide: - The route is presenting content-POS scope to the user → add WHERE w.is_canonical = 1 (or p.is_canonical = 1) - The route is doing phonological work (similarity, exclusion) across full vocab → leave alone

Specifically: - /api/words/search — uses compileWordFilter → already canonical=1 by default - /api/words/word-list — likely canonical scope (clinical word lists) → add filter - /api/words/batch — depends; if it's IN (?, ?, ...) lookup it doesn't need the filter - /api/words/norms-dump — canonical scope - /api/words/:word — single-word lookup; no filter - /api/similarity/search — full vocab (sound similarity over any phonology-bearing word) → no filter - /api/contrastive/* — Contrastive Sets tool; canonical scope for both words in a pair → WHERE p.is_canonical = 1

  • [ ] Step 2: Apply the audit. For routes using inline SQL, add the clause manually. For routes via the helper, they're already covered.

  • [ ] Step 3: Run tests:

cd packages/web/workers && npm test 2>&1 | tail -15
  • [ ] Step 4: Manual smoke against local D1 to verify the canonical filter is applied correctly:
cd packages/web/workers && npx wrangler dev > /tmp/wrangler.log 2>&1 &
sleep 8
# word-list should return only content-POS words
curl -s -X POST http://localhost:8787/api/words/word-list \
  -H 'Content-Type: application/json' \
  -d '{"include_phonemes":["k"]}' | python3 -c "
import sys, json
data = json.load(sys.stdin)
print('count:', data.get('total'))
words = data.get('items', [])[:5]
print('sample POS:', [w.get('pos') for w in words])
"
# Similar query on /api/words/search (already through helper)
curl -s -X POST http://localhost:8787/api/words/search \
  -H 'Content-Type: application/json' \
  -d '{"cv_shape":["CVC"],"limit":5}' | python3 -c "
import sys, json
data = json.load(sys.stdin)
print('count:', data.get('total'))
print('sample POS:', [it.get('pos') for it in data.get('items', [])])
"
pkill -f "wrangler dev"

Expected: sample POS values are all NOUN/VERB/ADJ/ADV; no PROPN / PRON / DET / AUX.

  • [ ] Step 5: Commit:
git add packages/web/workers/src/routes/
git commit -m "feat(workers): WHERE is_canonical = 1 on routes presenting content-POS scope

Word Lists, Custom Word Lists, norms-dump, and contrastive routes get
the canonical filter. Sound similarity and single-word lookup keep
full-vocabulary scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 9: Delete packages/generators/, packages/governors/, packages/generation/

Files: - Delete entire dirs - Modify: root pyproject.toml (drop workspace entries)

  • [ ] Step 1: Confirm no live importers:
grep -rn "from phonolex_generators\|import phonolex_generators\|from phonolex_governors\|import phonolex_governors" \
  packages/ --include="*.py" 2>&1 | grep -vE "packages/(generators|governors|generation)/" | head -10

Expected: no live consumers outside the deleted dirs (the corpus.py reference is in packages/generation/ which is also being deleted).

  • [ ] Step 2: Delete:
git rm -r packages/generators/
git rm -r packages/governors/
git rm -r packages/generation/
  • [ ] Step 3: Update root pyproject.toml — find workspace members and remove the three:
grep -n "members\|packages/generation\|packages/generators\|packages/governors" pyproject.toml

Remove the entries.

  • [ ] Step 4: Drop the now-orphan Python ML deps from packages/data/pyproject.toml. Check what's there:
cat packages/data/pyproject.toml | grep -E "spacy|sentence-transformers|torch|transformers|lightgbm|en_core"

Drop spaCy (no more morph features). Anything else that's only consumed by deleted code → drop.

  • [ ] Step 5: Re-sync:
uv sync --all-packages 2>&1 | tail -10
  • [ ] Step 6: Run remaining tests:
uv run python -m pytest packages/data/tests/ \
  --ignore=packages/data/tests/test_datasets.py \
  --ignore=packages/data/tests/test_new_loaders.py 2>&1 | tail -10

Expected: all data tests pass (the morph feature test, if any, was removed in T1).

  • [ ] Step 7: Commit:
git add pyproject.toml packages/data/pyproject.toml uv.lock
git commit -am "chore: delete generators/, governors/, generation/ packages

All three are dead post-CSP-retirement. generators carries csp/ +
MLM editor + reranker; governors is its trie-checking dependency
chain; generation is the FastAPI server. No live consumer survives.

Also drops spaCy from packages/data/pyproject.toml — the only
consumer was _populate_morph_features (deleted in T1).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 10: LFS scope reduction + delete dead artifacts

Files: - Modify: .gitattributes, .gitignore - Delete: 6 dead data/runtime/ files + 2 retired (words_full, pairs_full)

  • [ ] Step 1: Delete dead artifacts:
for f in selectional.parquet skeletons.parquet \
         naturalness_scorer_head.pt naturalness_reference.npy \
         naturalness_reference_meta.jsonl reranker_v2.pkl \
         words_full.parquet pairs_full.parquet; do
  [ -f "data/runtime/$f" ] && git rm "data/runtime/$f"
done
  • [ ] Step 2: Update .gitignore — append:
# data/runtime/ is local build cache. Rebuild via:
#   uv run python packages/data/scripts/build_runtime_parquet.py
#   uv run python packages/web/workers/scripts/export-to-d1.py
# Only d1-seed.sql is committed (LFS).
data/runtime/*.parquet
data/runtime/*.pt
data/runtime/*.pkl
data/runtime/*.npy
data/runtime/*.jsonl
  • [ ] Step 3: LFS-untrack the remaining parquets:
git lfs untrack "data/runtime/*.parquet"
git lfs untrack "data/runtime/*.pkl"
git lfs untrack "data/runtime/corpus_sentences.parquet"
git lfs untrack "data/runtime/corpus_sentences_index.parquet"

After this, .gitattributes should contain only the d1-seed.sql LFS line.

  • [ ] Step 4: Remove the tracked parquets from HEAD (they're now gitignored; on-disk copies survive):
git rm --cached data/runtime/words.parquet
git rm --cached data/runtime/pairs.parquet
[ -f data/runtime/edges.parquet ] && git rm --cached data/runtime/edges.parquet
[ -f data/runtime/corpus_sentences.parquet ] && git rm --cached data/runtime/corpus_sentences.parquet
[ -f data/runtime/corpus_sentences_index.parquet ] && git rm --cached data/runtime/corpus_sentences_index.parquet
  • [ ] Step 5: Verify:
cat .gitattributes
ls data/runtime/ 2>&1 | head -10
git ls-files data/runtime/

Expected: .gitattributes only contains d1-seed.sql LFS line; data/runtime/ on disk has the regenerated parquets (untracked); git ls-files data/runtime/ shows no files.

  • [ ] Step 6: Commit:
git add .gitattributes .gitignore
git commit -m "chore: drop dead artifacts; LFS narrows to d1-seed.sql only

Removes selectional/skeletons/naturalness*/reranker_v2/words_full/
pairs_full from the tree. .gitattributes scopes to just the seed.
data/runtime/* is gitignored as developer-local build cache going
forward; rebuild via build_runtime_parquet.py + export-to-d1.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 11: Trim deploy workflows paths-filter

Files: - Modify: .github/workflows/deploy-staging.yml - Modify: .github/workflows/deploy.yml

  • [ ] Step 1: Update data: paths-filter in both workflows to:
            data:
              - 'packages/web/workers/scripts/d1-seed.sql'

Drop all the other path filters (packages/data/**, data/runtime/**, packages/web/workers/scripts/export-to-d1.py, packages/web/workers/scripts/config.py, etc.). The seed is the only thing CI touches.

  • [ ] Step 2: Eyeball commit-message-only checks (no automated linter for actions YAML).

  • [ ] Step 3: Commit:

git add .github/workflows/
git commit -m "ci: paths-filter only watches d1-seed.sql

After the D1-only refactor, the seed is the only artifact whose
change should trigger a re-seed. All other paths (source TSVs,
parquets, build scripts) are local-dev concerns now.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 12: CLAUDE.md rewrite + frontend cold-start copy

Files: - Modify: CLAUDE.md - Modify: packages/web/frontend/src/components/tools/GovernedGenerationTool/ (find cold-start copy)

  • [ ] Step 1: Find cold-start copy:
grep -rn "60s\|cold start\|cold-start\|generation server\|first request" \
  packages/web/frontend/src/components/ 2>&1 | head -10

Drop any warnings that no longer apply.

  • [ ] Step 2: CLAUDE.md rewrites — there are several sections to update:

  • "Architecture" section: replace CSP-Phase-1-and-2 description with corpus-retrieval-from-D1

  • "Generation Runtime Data Contract" section: drop selectional/skeletons/naturalness references; describe the unified words+pairs schema with is_canonical; describe corpus_sentences in D1
  • "Project Structure" tree: remove packages/generation/, packages/generators/, packages/governors/
  • "Dev Setup": drop the generation server start command + the heavy ML dep descriptions; flow is build_runtime_parquet.py + export-to-d1.py + wrangler dev (local) or LFS + wrangler deploy (CI)
  • "Gotchas": drop references to the FastAPI server, container cold-start, spaCy / morph features

This is a real rewrite. ~30% of CLAUDE.md changes.

  • [ ] Step 3: Commit:
git add packages/web/frontend/src/ CLAUDE.md
git commit -m "docs: rewrite CLAUDE.md + frontend copy for D1-only architecture

CSP/reranker/container described as retired; unified words+pairs with
is_canonical described as the new schema; corpus retrieval described
as Worker-D1 not container-Polars. Frontend drops the 60s cold-start
warning (no more container).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>"

Task 13: E2E smoke + push + PR

  • [ ] Step 1: Full local CI-equivalent run:
cd /Users/jneumann/Repos/PhonoLex

(cd packages/web/workers && npx tsc --noEmit && npm test 2>&1 | tail -10)
(cd packages/web/frontend && npm run type-check 2>&1 | tail -5)
(cd packages/web/frontend && npm run lint 2>&1 | tail -5)
(cd packages/web/frontend && npm run build 2>&1 | tail -5)

uv run python -m pytest packages/data/tests/ \
  --ignore=packages/data/tests/test_datasets.py \
  --ignore=packages/data/tests/test_new_loaders.py 2>&1 | tail -10

Expected: all green.

  • [ ] Step 2: End-to-end Worker smoke:
cd packages/web/workers && npx wrangler dev > /tmp/wrangler.log 2>&1 &
sleep 8

# Sentences (new D1-backed)
curl -s -X POST http://localhost:8787/api/sentences \
  -H 'Content-Type: application/json' \
  -d '{"top_k": 5}' | python3 -m json.tool | head -25

# Existing endpoints (verify is_canonical filter works correctly)
curl -s -X POST http://localhost:8787/api/words/search \
  -H 'Content-Type: application/json' \
  -d '{"cv_shape":["CVC"],"limit":5}' | python3 -m json.tool | head -30

# Generation routes are gone
curl -s -o /dev/null -w "/api/generation/generate-sentences: %{http_code}\n" \
  http://localhost:8787/api/generation/generate-sentences
curl -s -o /dev/null -w "/api/generation/sentences: %{http_code}\n" \
  http://localhost:8787/api/generation/sentences

pkill -f "wrangler dev"

Expected: /api/sentences returns sentences; /api/words/search returns content-POS words with CVC shape; /api/generation/* returns 404.

  • [ ] Step 3: Push:
git push -u origin refactor/d1-only-drop-csp 2>&1 | tail -10
  • [ ] Step 4: Open PR:
gh pr create --base develop --head refactor/d1-only-drop-csp \
  --title "refactor: D1-only generation; drop CSP + unify words/pairs" \
  --body-file - <<'EOF'
## Summary
- Moves /api/sentences into the Worker (D1-backed); no more FastAPI container
- Unifies `words` (47K → 125K rows) and `pairs` (60K → 642K rows) with an
  `is_canonical` column flagging the content-POS subset
- Adds `corpus_sentences{,_index}` D1 tables
- Deletes `packages/generators/`, `packages/governors/`, `packages/generation/`
- Drops spaCy + sentence-transformers + torch + transformers + lightgbm
- Drops the 7 spaCy-derived morph columns from words schema
- Reduces LFS to only `d1-seed.sql`; everything else in `data/runtime/` is
  local build cache (gitignored)
- CI paths-filter narrows to just the seed

## Why
The CSP synthetic generation paradigm was already deprecated; the live UI
only uses corpus retrieval. The whole stack was dragging an unused ML
dependency chain (~2 GB container, 30s cold start) into the deploy. The
canonical-vs-full split was hidden behind two parquet files the container
read directly. Consolidating into one Worker + D1 surface gets us:
- One service to deploy
- Sub-100ms cold start
- One LFS file
- No model downloads in CI
- Full vocabulary visible to the Worker (no more container-only data)

## Test plan
- [ ] Worker tests green
- [ ] Frontend lint + build green
- [ ] Python data tests green (CI scope)
- [ ] /api/sentences live in dev with corpus retrieval
- [ ] /api/words/search returns only content-POS words by default (is_canonical=1)
- [ ] /api/similarity/search and /api/words/:word still work over the full vocabulary
- [ ] /api/generation/* returns 404
- [ ] Staging deploy succeeds without building a container

🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF

Closing checklist

  • [ ] 13 tasks committed
  • [ ] Local CI-equivalent checks green
  • [ ] /api/sentences works against D1
  • [ ] /api/words/search correctly filters to canonical
  • [ ] /api/generation/* removed
  • [ ] .gitattributes lists only d1-seed.sql
  • [ ] No generators/, governors/, generation/ in HEAD
  • [ ] No spaCy / morph features in HEAD
  • [ ] PR opened against develop
  • [ ] CI green on the pushed branch
  • [ ] Deploy Staging succeeds (no container build, no Python pipeline)