Corpus retrieval alongside CSP synthetic generation¶
Date: 2026-05-11
Branch: Sibling feature branch off develop (once feature/csp-iteration lands)
or off feature/csp-iteration directly if work begins before that merges.
Status: Design — ready for implementation planning
Summary¶
Add a real-corpus retrieval service that ships attested sentences from clean
licensed corpora (CoLA-pos, UD-EWT, GUM, CHILDES adult-input, Tatoeba-en)
filtered by the same constraint schema CSP uses. Surface the results in a
two-section response — corpus matches first, synthetic matches below —
served from a new orchestrator endpoint /api/sentences that runs corpus
retrieval and CSP in parallel.
Corpus retrieval reuses CSP's existing constraint code path 1:1: same
hard_filter_expr, same _load_pairs_for_request. The only corpus-specific
addition is a group-by-sentence aggregation that requires every in-vocab
content word in the sentence to satisfy the filter — exact parity with the
way CSP requires every filler and the verb to satisfy hard constraints.
Value: real attested text reads naturally by definition, augments synthetic output on permissive queries, and gives clinicians/researchers the variety CSP can't produce. No reranker uncertainty on the corpus side; the cost is one offline ingest pipeline and ~50ms cold-start.
Terminology rule applied throughout: real-world attested text vs CSP-generated text = corpus vs synthetic. Used in schemas, types, docs, UI section headers — no friendlier-for-clinicians decoupling.
Background¶
PhonoLex v5.2 ships governed generation as constraint-driven CSP enumeration
plus Qwen3-Embedding-0.6B cosine-to-reference-corpus reranking. The reference
corpus is 22,899 sentences from CoLA-positive + UD English-EWT + GUM,
embedded once and shipped as data/runtime/naturalness_reference.npy
(PHON-110).
Those sentences are already on disk in
data/runtime/naturalness_reference_meta.jsonl as {sentence, source}
records. They were built to score synthetic output. They can also be
output: attested, well-formed English that satisfies a constraint combo is
the simplest possible answer to a generation request.
The CSP architecture treats every constraint as a row-level predicate over
the lexicon (packages/generators/src/phonolex_generators/csp/constraints.py).
A sentence is just a tuple of words; if we precompute one row per
(sentence, in-vocab content word), the exact same predicate machinery
applies — group by sentence_id, require all rows pass, return matching
sentences.
Architecture¶
Ingest (offline, packages/data/scripts/build_corpus_sentences.py):
CoLA-pos + UD-EWT + GUM + CHILDES-adult + Tatoeba-en
↓ tokenize, POS tag, lemmatize (spaCy en_core_web_sm)
↓ keep only content tokens (NOUN/VERB/ADJ/ADV)
↓ join lemma → words.parquet (47K) → drop OOV content tokens
↓ keep sentences where n_content_in_vocab ≥ 2
↓ profanity filter (default on)
↓ Qwen3-Embedding naturalness pre-score (self-row excluded from ref matmul)
data/runtime/corpus_sentences.parquet — per-(sentence, word) rows
data/runtime/corpus_sentences_index.parquet — per-sentence header rows
(both LFS-tracked)
Runtime (generation server cold-start):
WordStore.from_parquet() + corpus.load(runtime_dir)
↓ Polars LazyFrames held in process (~50ms extra cold-start, ~50MB extra RAM)
+ existing CSP solver, skeletons, Qwen reranker (unchanged)
Request (Worker → FastAPI):
POST /api/sentences { constraints, top_k_corpus, top_k_synthetic, include_synthetic }
↓ parallel:
corpus.match_corpus(...) — Polars filter, ~50-100ms warm
csp.solve(...) → realize → rerank — existing path, ~3-7s warm
↓ envelope:
{ corpus_matches: [...], synthetic_matches: [...],
corpus_skipped_reason, synthetic_skipped_reason, elapsed_ms }
The two paths share constraint dataclasses, share the WordStore, share
constraint validation. They differ only in what they enumerate over (corpus
sentences vs CSP-realized candidates).
Data model¶
Two new Parquet files at data/runtime/, both LFS-tracked.
corpus_sentences_index.parquet — per-sentence header¶
~500K–1M rows (sized by source intake).
| Column | Type | Notes |
|---|---|---|
sentence_id |
u32 |
dense int, primary key |
text |
str |
sentence as it appears in source |
source |
enum-str | cola | ud_ewt | gum | childes | tatoeba |
source_record_id |
str? |
provenance traceback (nullable) |
n_tokens |
u8 |
whitespace tokens, retained range 5–25 |
n_content_in_vocab |
u8 |
content tokens that joined words.parquet |
n_content_oov |
u8 |
content tokens that did not join |
naturalness_score |
f32? |
mean top-K cosine vs naturalness_reference.npy; self-row excluded |
corpus_sentences.parquet — per-(sentence, in-vocab content word) rows¶
~1–3M rows after lemma-join filtering.
| Column | Type | Notes |
|---|---|---|
sentence_id |
u32 |
FK to index |
position |
u8 |
content-word position in the sentence (0-indexed) |
surface |
str |
token as it appears (e.g. ran) |
lemma |
str |
spaCy lemma joined to words.parquet (e.g. run) |
pos |
enum-str | NOUN | VERB | ADJ | ADV |
phonemes_str |
str |
inlined from words.parquet, pipe-delimited |
| ~167 norm cols | varied | inlined from words.parquet — AoA, frequency_, percentile_, etc. |
Why denormalize the 167 norm cols? Because every query becomes a single
Polars hard_filter_expr against this table — no runtime join with
words.parquet needed. Storage cost is ~1–2 GB Parquet; query latency drops
from join-bound to scan-bound. Cold-start cost is a single pl.scan_parquet.
Join key: lemma (not surface). words.parquet stores lemmas; "ran"
joins as "run". The result-display surface uses text verbatim from the
corpus — subject-verb agreement is already correct in attested text.
Query path¶
Module: packages/generation/server/corpus.py
Public surface:
@dataclass(frozen=True)
class CorpusStore:
index_lf: pl.LazyFrame
words_lf: pl.LazyFrame
def load_corpus(runtime_dir: Path) -> CorpusStore: ...
def match_corpus(
store: CorpusStore,
constraints: list[Constraint],
pairs_df: pl.DataFrame, # reuse the same pairs frame CSP loads at cold-start
top_k: int,
) -> list[CorpusMatch]: ...
Hard constraints (Exclude / Bound / Pattern) — exact reuse of CSP's
hard_filter_expr(constraints) from packages/generators/src/phonolex_generators/csp/constraints.py.
The schema of corpus_sentences.parquet matches words.parquet on the
columns that filter touches (phonemes_str + all norm cols), so the
expression evaluates row-wise without modification.
After per-row pass/fail, group by sentence_id and require all rows pass:
expr = hard_filter_expr(constraints) or pl.lit(True)
words_lf = words_lf.with_columns(passes=expr)
ok_sids = (
words_lf
.group_by("sentence_id")
.agg(pl.col("passes").all().alias("ok"))
.filter(pl.col("ok"))
.select("sentence_id")
)
Parity rule: every in-vocab content word in the sentence must pass the filter, exactly as every CSP filler and the verb must pass. Function-word scaffolding (DET, ADP, etc.) is ignored on both sides — CSP synthesizes it, the corpus pipeline drops it.
Contrastive (Minpair / Maxopp) — exact reuse of
_load_pairs_for_request(constraint, pairs_df, filtered_spec) from
packages/generators/src/phonolex_generators/csp/skeleton.py. That function
handles phoneme1/phoneme2 orientation, position_type filtering, and
sonorant_diff thresholding. We accept its output and check:
# Sentence passes iff ∃ pair row (w1, w2) such that
# {w1, w2} ⊆ sentence_content_lemmas (after hard-filter).
The constraint's slots field is ignored on corpus matching — real
sentences carry no role tags. Frontend annotates this on results when slots
were set.
Multopp — not supported on single sentences (it's an N+1-sentence
property). Orchestrator returns empty corpus_matches with
corpus_skipped_reason: "multopp_paragraph_only".
Validation parity — CSP enforces the at-most-one-contrastive and
Multopp-paragraph-only rules inline at the top of solve() in
packages/generators/src/phonolex_generators/csp/solver.py. The
orchestrator endpoint runs the same checks once before dispatching to both
paths, so the corpus path inherits CSP's validation; lift those inline
checks into a shared validate_constraints(constraints) helper as part of
this work.
Ranking — sort survivors by naturalness_score desc, take top_k.
Orchestrator endpoint¶
New endpoint: POST /api/sentences in packages/generation/server/routes/sentences.py.
The existing /api/generate-sentences and /api/generate-paragraphs
endpoints stay untouched as the pure CSP paths.
Request¶
class SentencesRequest(BaseModel):
constraints: list[ConstraintIn] # exact same union as /api/generate-sentences
band: str = "all" # CSP band; corpus ignores it
top_k_corpus: int = 10
top_k_synthetic: int = 10
include_synthetic: bool = True # set False to skip CSP entirely
Response¶
class CorpusMatch(BaseModel):
text: str
source: Literal["cola", "ud_ewt", "gum", "childes", "tatoeba"]
naturalness_score: float
n_content_in_vocab: int
class SyntheticMatch(BaseModel):
# Rename of the existing SentenceCandidate dataclass in
# packages/generation/server/schemas.py. Fields unchanged: sentence,
# composite_score, axis_scores, verb, fillers, skeleton, etc.
# If the rename is deferred (see "Open decisions" §1), `SyntheticMatch`
# is a `TypeAlias = SentenceCandidate` for the orchestrator's response,
# so the wire format and field names still read "synthetic" externally.
class SentencesResponse(BaseModel):
corpus_matches: list[CorpusMatch]
synthetic_matches: list[SyntheticMatch]
corpus_skipped_reason: Literal["multopp_paragraph_only"] | None = None
synthetic_skipped_reason: Literal["disabled_by_caller"] | None = None
elapsed_ms: dict[str, int]
Concurrency¶
async def post_sentences(req: SentencesRequest) -> SentencesResponse:
corpus_task = asyncio.create_task(asyncio.to_thread(match_corpus, ...))
if req.include_synthetic and not has_multopp(req.constraints):
synthetic_task = asyncio.create_task(asyncio.to_thread(run_csp, ...))
else:
synthetic_task = None
corpus_matches = await corpus_task
synthetic_matches = await synthetic_task if synthetic_task else []
...
CSP errors are caught and surfaced via synthetic_skipped_reason; corpus
results ship regardless.
Worker proxy¶
packages/web/workers/src/routes/generation.ts adds a sibling proxy route
that forwards /api/sentences to GENERATION_SERVER_URL. Same
auth/timeout/error shape as the existing CSP proxy. No edge orchestration.
Ingest pipeline¶
packages/data/scripts/build_corpus_sentences.py — produces both Parquet
files. Idempotent, deterministic for a fixed source set, runs in CI on
data-path changes.
Per-source loaders:
def load_cola_positive() -> Iterator[tuple[str, str]]: # (text, source_record_id)
def load_ud_ewt() -> Iterator[tuple[str, str]]:
def load_gum() -> Iterator[tuple[str, str]]:
def load_childes_adult() -> Iterator[tuple[str, str]]: # MOT/FAT/INV speakers, English locales
def load_tatoeba_english() -> Iterator[tuple[str, str]]: # CC-BY 2.0 FR dump
The first three reuse loaders from
packages/data/scripts/build_naturalness_reference.py. CHILDES reuses the
TalkBank XML reading path PHON-94 already validated. Tatoeba pulls from the
official CC-BY dump.
Pipeline stages (single pass, streaming):
- Sentence-level pre-filter — 5–25 whitespace tokens, ASCII-printable, dedup by lowercased text (within and across sources).
- spaCy parse —
en_core_web_sm,nlp.pipe(batch_size=256, n_process=1). POS + lemma only (no full parse). - Content-token extraction — keep
{NOUN, VERB, ADJ, ADV}only; drop punctuation, function words, numerals, spaCy-flagged named entities. - Lemma →
words.parquetjoin — in-memory{lemma → row}dict; drop tokens that don't join. - Sentence retention — keep iff
n_content_in_vocab ≥ 2. - Profanity filter — denylist of ~200 explicit terms; default on,
toggleable via
--no-profanity-filterfor research use. - Emit — one index row + N word rows, denormalizing norm cols inline.
- Naturalness pre-score — second pass: embed each surviving sentence
with Qwen3-Embedding-0.6B (matches the model used by the reranker), mean
top-K cosine vs
naturalness_reference.npy. If the sentence text matches a ref-matrix row verbatim, exclude that row from the calculation. Batched 32 on CPU; ~10 min for 1M sentences.
Runtime cost: ~30 min one-time on a developer laptop. CI does the same work conditional on data-path changes.
LFS tracking: add both outputs to .gitattributes patterns next to the
existing runtime artifacts.
Frontend¶
File: the existing Generation page (path to confirm during planning;
packages/web/frontend/src/pages/Generate.tsx or equivalent).
Change the page to:
1. Post to /api/sentences instead of /api/generate-sentences.
2. Render two stacked sections: Corpus matches first, Synthetic matches
below, separated by a section divider.
<ConstraintBuilder ... />
<RunButton onClick={runSentences} />
{result && (
<>
<ResultSection
title="Corpus matches"
subtitle={
result.corpus_skipped_reason === 'multopp_paragraph_only'
? 'Multiple opposition is a paragraph property — see Synthetic matches below.'
: `${result.corpus_matches.length} attested sentences match your constraints`
}
items={result.corpus_matches.map(renderCorpusMatch)}
/>
<SectionDivider />
<ResultSection
title="Synthetic matches"
subtitle={`${result.synthetic_matches.length} generated alternatives`}
items={result.synthetic_matches.map(renderSynthetic)}
/>
</>
)}
Per-item rendering:
- CorpusMatch — sentence text + source pill (CoLA / UD-EWT / GUM /
CHILDES / Tatoeba) + naturalness score on hover. No regenerate
affordance (attested text is fixed).
- SyntheticMatch — existing rendering. Composite score, axis breakdown,
(verb, fillers, skeleton) on hover.
Empty-state copy:
- corpus_matches = [] and no skip reason → "No attested sentences match
these constraints. Try the synthetic matches below."
- corpus_skipped_reason = multopp_paragraph_only → "Multiple opposition is
a paragraph property — see Synthetic matches below."
- synthetic_matches = [] → "No synthetic matches — try loosening your
constraints."
Constraint UI — no changes. The builder already produces the constraint
shape /api/sentences consumes.
Open decisions (defer to implementation planning)¶
SentenceCandidate→SyntheticMatchrename. Per the corpus/synthetic rule, renamepackages/generation/server/schemas.py:SentenceCandidateacross CSP, server, frontend types. Scope as part of this ticket or as a separate prior refactor.slotsfield UX. When a Minpair/Maxopp constraint hasslotsset, the corpus section ignores it (no role tags on attested text). Show a note ("Slot restrictions are dropped on corpus matches") or hide the slots picker entirely on the page.- Loading state. Single spinner with two timing pills below it, or SSE-streamed partial results so corpus matches appear before CSP finishes. The simple wait-for-both is fine for v1; SSE is a nice-to-have.
- Content gating denylist source. Pick a maintained denylist (e.g.
better-profanityPython lib) vs hand-curated list. Maintained list reduces upkeep; hand-curated lets us tune for clinical register.
Non-goals¶
- Paragraph orchestration.
/api/generate-paragraphsstays CSP-only; no/api/paragraphsorchestrator in v1. Real-corpus paragraphs would require paragraph segmentation across sources and a clean answer for what "corpus paragraph match" means under non-Multopp constraints — out of scope. - Unified ranking. We do not merge corpus + synthetic into one ranked list. Different scoring signals (Qwen-cosine vs CSP composite); two sections keeps the user's mental model honest.
- Custom corpora. v1 ships with the five fixed sources. User-uploaded or domain-specific corpora are a separate product surface.
- Pattern relaxation. "Every content word starts with /s/" is strict parity with CSP and will rarely have corpus hits. We accept this — the synthetic section covers strict pattern queries.
File layout¶
New:
- packages/data/scripts/build_corpus_sentences.py — ingest pipeline
- packages/generation/server/corpus.py — runtime loader + match_corpus
- packages/generation/server/routes/sentences.py — /api/sentences
orchestrator
- data/runtime/corpus_sentences.parquet (LFS)
- data/runtime/corpus_sentences_index.parquet (LFS)
- packages/generation/server/tests/test_corpus.py
- packages/generation/server/tests/test_sentences_orchestrator.py
Modified:
- packages/generation/server/main.py — load CorpusStore at cold-start
- packages/generation/server/schemas.py — add CorpusMatch,
SentencesRequest, SentencesResponse; (optionally) rename
SentenceCandidate → SyntheticMatch
- packages/web/workers/src/routes/generation.ts — add /api/sentences
proxy
- packages/web/frontend/src/pages/Generate.tsx (or equivalent) — two
stacked sections, post to /api/sentences
- packages/web/frontend/src/types/governance.ts — CorpusMatch,
SyntheticMatch, SentencesResponse types
- .gitattributes — LFS patterns for new Parquet outputs
- CLAUDE.md — note the new endpoint + data contract artifacts
Acceptance criteria¶
- Ingest produces both Parquets with the documented schemas;
n_content_in_vocab ≥ 2rule applied; profanity filter on by default; naturalness pre-score computed with self-row exclusion. match_corpusreturns sentences satisfying every CSP constraint type with exact parity to CSP'shard_filter_exprand_load_pairs_for_request. Adversarial test: a sentence where one content word violates Exclude/Bound/Pattern is excluded; otherwise it passes.- Orchestrator endpoint runs corpus + CSP in parallel; CSP failure does not block corpus response.
- Multopp routes return
corpus_matches: []withcorpus_skipped_reason: "multopp_paragraph_only". - Frontend renders two stacked sections with corpus first, synthetic below, both ranked within themselves. Empty-state copy renders per §Frontend.
- Worker proxy forwards
/api/sentencesto the generation server with the same auth/timeout shape as/api/generate-sentences. - Cold-start cost ≤ 200ms beyond current CSP cold-start (estimate; measured at first benchmark). Warm corpus query latency ≤ 200ms for typical Exclude+Bound queries; ≤ 500ms for contrastive queries. If actuals overshoot, file a tuning ticket — sub-100ms is the design target, sub-second is the acceptance floor.