Skip to content

Corpus retrieval alongside CSP synthetic generation

Date: 2026-05-11 Branch: Sibling feature branch off develop (once feature/csp-iteration lands) or off feature/csp-iteration directly if work begins before that merges. Status: Design — ready for implementation planning

Summary

Add a real-corpus retrieval service that ships attested sentences from clean licensed corpora (CoLA-pos, UD-EWT, GUM, CHILDES adult-input, Tatoeba-en) filtered by the same constraint schema CSP uses. Surface the results in a two-section response — corpus matches first, synthetic matches below — served from a new orchestrator endpoint /api/sentences that runs corpus retrieval and CSP in parallel.

Corpus retrieval reuses CSP's existing constraint code path 1:1: same hard_filter_expr, same _load_pairs_for_request. The only corpus-specific addition is a group-by-sentence aggregation that requires every in-vocab content word in the sentence to satisfy the filter — exact parity with the way CSP requires every filler and the verb to satisfy hard constraints.

Value: real attested text reads naturally by definition, augments synthetic output on permissive queries, and gives clinicians/researchers the variety CSP can't produce. No reranker uncertainty on the corpus side; the cost is one offline ingest pipeline and ~50ms cold-start.

Terminology rule applied throughout: real-world attested text vs CSP-generated text = corpus vs synthetic. Used in schemas, types, docs, UI section headers — no friendlier-for-clinicians decoupling.

Background

PhonoLex v5.2 ships governed generation as constraint-driven CSP enumeration plus Qwen3-Embedding-0.6B cosine-to-reference-corpus reranking. The reference corpus is 22,899 sentences from CoLA-positive + UD English-EWT + GUM, embedded once and shipped as data/runtime/naturalness_reference.npy (PHON-110).

Those sentences are already on disk in data/runtime/naturalness_reference_meta.jsonl as {sentence, source} records. They were built to score synthetic output. They can also be output: attested, well-formed English that satisfies a constraint combo is the simplest possible answer to a generation request.

The CSP architecture treats every constraint as a row-level predicate over the lexicon (packages/generators/src/phonolex_generators/csp/constraints.py). A sentence is just a tuple of words; if we precompute one row per (sentence, in-vocab content word), the exact same predicate machinery applies — group by sentence_id, require all rows pass, return matching sentences.

Architecture

Ingest (offline, packages/data/scripts/build_corpus_sentences.py):
  CoLA-pos + UD-EWT + GUM + CHILDES-adult + Tatoeba-en
    ↓ tokenize, POS tag, lemmatize (spaCy en_core_web_sm)
    ↓ keep only content tokens (NOUN/VERB/ADJ/ADV)
    ↓ join lemma → words.parquet (47K) → drop OOV content tokens
    ↓ keep sentences where n_content_in_vocab ≥ 2
    ↓ profanity filter (default on)
    ↓ Qwen3-Embedding naturalness pre-score (self-row excluded from ref matmul)
  data/runtime/corpus_sentences.parquet         — per-(sentence, word) rows
  data/runtime/corpus_sentences_index.parquet   — per-sentence header rows
  (both LFS-tracked)

Runtime (generation server cold-start):
  WordStore.from_parquet()  + corpus.load(runtime_dir)
    ↓ Polars LazyFrames held in process (~50ms extra cold-start, ~50MB extra RAM)
  + existing CSP solver, skeletons, Qwen reranker (unchanged)

Request (Worker → FastAPI):
  POST /api/sentences  { constraints, top_k_corpus, top_k_synthetic, include_synthetic }
    ↓ parallel:
        corpus.match_corpus(...)        — Polars filter, ~50-100ms warm
        csp.solve(...) → realize → rerank  — existing path, ~3-7s warm
    ↓ envelope:
        { corpus_matches: [...], synthetic_matches: [...],
          corpus_skipped_reason, synthetic_skipped_reason, elapsed_ms }

The two paths share constraint dataclasses, share the WordStore, share constraint validation. They differ only in what they enumerate over (corpus sentences vs CSP-realized candidates).

Data model

Two new Parquet files at data/runtime/, both LFS-tracked.

corpus_sentences_index.parquet — per-sentence header

~500K–1M rows (sized by source intake).

Column Type Notes
sentence_id u32 dense int, primary key
text str sentence as it appears in source
source enum-str cola | ud_ewt | gum | childes | tatoeba
source_record_id str? provenance traceback (nullable)
n_tokens u8 whitespace tokens, retained range 5–25
n_content_in_vocab u8 content tokens that joined words.parquet
n_content_oov u8 content tokens that did not join
naturalness_score f32? mean top-K cosine vs naturalness_reference.npy; self-row excluded

corpus_sentences.parquet — per-(sentence, in-vocab content word) rows

~1–3M rows after lemma-join filtering.

Column Type Notes
sentence_id u32 FK to index
position u8 content-word position in the sentence (0-indexed)
surface str token as it appears (e.g. ran)
lemma str spaCy lemma joined to words.parquet (e.g. run)
pos enum-str NOUN | VERB | ADJ | ADV
phonemes_str str inlined from words.parquet, pipe-delimited
~167 norm cols varied inlined from words.parquet — AoA, frequency_, percentile_, etc.

Why denormalize the 167 norm cols? Because every query becomes a single Polars hard_filter_expr against this table — no runtime join with words.parquet needed. Storage cost is ~1–2 GB Parquet; query latency drops from join-bound to scan-bound. Cold-start cost is a single pl.scan_parquet.

Join key: lemma (not surface). words.parquet stores lemmas; "ran" joins as "run". The result-display surface uses text verbatim from the corpus — subject-verb agreement is already correct in attested text.

Query path

Module: packages/generation/server/corpus.py

Public surface:

@dataclass(frozen=True)
class CorpusStore:
    index_lf: pl.LazyFrame
    words_lf: pl.LazyFrame

def load_corpus(runtime_dir: Path) -> CorpusStore: ...

def match_corpus(
    store: CorpusStore,
    constraints: list[Constraint],
    pairs_df: pl.DataFrame,  # reuse the same pairs frame CSP loads at cold-start
    top_k: int,
) -> list[CorpusMatch]: ...

Hard constraints (Exclude / Bound / Pattern) — exact reuse of CSP's hard_filter_expr(constraints) from packages/generators/src/phonolex_generators/csp/constraints.py. The schema of corpus_sentences.parquet matches words.parquet on the columns that filter touches (phonemes_str + all norm cols), so the expression evaluates row-wise without modification.

After per-row pass/fail, group by sentence_id and require all rows pass:

expr = hard_filter_expr(constraints) or pl.lit(True)
words_lf = words_lf.with_columns(passes=expr)
ok_sids = (
    words_lf
    .group_by("sentence_id")
    .agg(pl.col("passes").all().alias("ok"))
    .filter(pl.col("ok"))
    .select("sentence_id")
)

Parity rule: every in-vocab content word in the sentence must pass the filter, exactly as every CSP filler and the verb must pass. Function-word scaffolding (DET, ADP, etc.) is ignored on both sides — CSP synthesizes it, the corpus pipeline drops it.

Contrastive (Minpair / Maxopp) — exact reuse of _load_pairs_for_request(constraint, pairs_df, filtered_spec) from packages/generators/src/phonolex_generators/csp/skeleton.py. That function handles phoneme1/phoneme2 orientation, position_type filtering, and sonorant_diff thresholding. We accept its output and check:

# Sentence passes iff ∃ pair row (w1, w2) such that
# {w1, w2} ⊆ sentence_content_lemmas (after hard-filter).

The constraint's slots field is ignored on corpus matching — real sentences carry no role tags. Frontend annotates this on results when slots were set.

Multopp — not supported on single sentences (it's an N+1-sentence property). Orchestrator returns empty corpus_matches with corpus_skipped_reason: "multopp_paragraph_only".

Validation parity — CSP enforces the at-most-one-contrastive and Multopp-paragraph-only rules inline at the top of solve() in packages/generators/src/phonolex_generators/csp/solver.py. The orchestrator endpoint runs the same checks once before dispatching to both paths, so the corpus path inherits CSP's validation; lift those inline checks into a shared validate_constraints(constraints) helper as part of this work.

Ranking — sort survivors by naturalness_score desc, take top_k.

Orchestrator endpoint

New endpoint: POST /api/sentences in packages/generation/server/routes/sentences.py.

The existing /api/generate-sentences and /api/generate-paragraphs endpoints stay untouched as the pure CSP paths.

Request

class SentencesRequest(BaseModel):
    constraints: list[ConstraintIn]      # exact same union as /api/generate-sentences
    band: str = "all"                    # CSP band; corpus ignores it
    top_k_corpus: int = 10
    top_k_synthetic: int = 10
    include_synthetic: bool = True       # set False to skip CSP entirely

Response

class CorpusMatch(BaseModel):
    text: str
    source: Literal["cola", "ud_ewt", "gum", "childes", "tatoeba"]
    naturalness_score: float
    n_content_in_vocab: int

class SyntheticMatch(BaseModel):
    # Rename of the existing SentenceCandidate dataclass in
    # packages/generation/server/schemas.py. Fields unchanged: sentence,
    # composite_score, axis_scores, verb, fillers, skeleton, etc.
    # If the rename is deferred (see "Open decisions" §1), `SyntheticMatch`
    # is a `TypeAlias = SentenceCandidate` for the orchestrator's response,
    # so the wire format and field names still read "synthetic" externally.

class SentencesResponse(BaseModel):
    corpus_matches: list[CorpusMatch]
    synthetic_matches: list[SyntheticMatch]
    corpus_skipped_reason: Literal["multopp_paragraph_only"] | None = None
    synthetic_skipped_reason: Literal["disabled_by_caller"] | None = None
    elapsed_ms: dict[str, int]

Concurrency

async def post_sentences(req: SentencesRequest) -> SentencesResponse:
    corpus_task = asyncio.create_task(asyncio.to_thread(match_corpus, ...))
    if req.include_synthetic and not has_multopp(req.constraints):
        synthetic_task = asyncio.create_task(asyncio.to_thread(run_csp, ...))
    else:
        synthetic_task = None
    corpus_matches = await corpus_task
    synthetic_matches = await synthetic_task if synthetic_task else []
    ...

CSP errors are caught and surfaced via synthetic_skipped_reason; corpus results ship regardless.

Worker proxy

packages/web/workers/src/routes/generation.ts adds a sibling proxy route that forwards /api/sentences to GENERATION_SERVER_URL. Same auth/timeout/error shape as the existing CSP proxy. No edge orchestration.

Ingest pipeline

packages/data/scripts/build_corpus_sentences.py — produces both Parquet files. Idempotent, deterministic for a fixed source set, runs in CI on data-path changes.

Per-source loaders:

def load_cola_positive() -> Iterator[tuple[str, str]]:   # (text, source_record_id)
def load_ud_ewt() -> Iterator[tuple[str, str]]:
def load_gum() -> Iterator[tuple[str, str]]:
def load_childes_adult() -> Iterator[tuple[str, str]]:   # MOT/FAT/INV speakers, English locales
def load_tatoeba_english() -> Iterator[tuple[str, str]]: # CC-BY 2.0 FR dump

The first three reuse loaders from packages/data/scripts/build_naturalness_reference.py. CHILDES reuses the TalkBank XML reading path PHON-94 already validated. Tatoeba pulls from the official CC-BY dump.

Pipeline stages (single pass, streaming):

  1. Sentence-level pre-filter — 5–25 whitespace tokens, ASCII-printable, dedup by lowercased text (within and across sources).
  2. spaCy parseen_core_web_sm, nlp.pipe(batch_size=256, n_process=1). POS + lemma only (no full parse).
  3. Content-token extraction — keep {NOUN, VERB, ADJ, ADV} only; drop punctuation, function words, numerals, spaCy-flagged named entities.
  4. Lemma → words.parquet join — in-memory {lemma → row} dict; drop tokens that don't join.
  5. Sentence retention — keep iff n_content_in_vocab ≥ 2.
  6. Profanity filter — denylist of ~200 explicit terms; default on, toggleable via --no-profanity-filter for research use.
  7. Emit — one index row + N word rows, denormalizing norm cols inline.
  8. Naturalness pre-score — second pass: embed each surviving sentence with Qwen3-Embedding-0.6B (matches the model used by the reranker), mean top-K cosine vs naturalness_reference.npy. If the sentence text matches a ref-matrix row verbatim, exclude that row from the calculation. Batched 32 on CPU; ~10 min for 1M sentences.

Runtime cost: ~30 min one-time on a developer laptop. CI does the same work conditional on data-path changes.

LFS tracking: add both outputs to .gitattributes patterns next to the existing runtime artifacts.

Frontend

File: the existing Generation page (path to confirm during planning; packages/web/frontend/src/pages/Generate.tsx or equivalent).

Change the page to: 1. Post to /api/sentences instead of /api/generate-sentences. 2. Render two stacked sections: Corpus matches first, Synthetic matches below, separated by a section divider.

<ConstraintBuilder ... />
<RunButton onClick={runSentences} />

{result && (
  <>
    <ResultSection
      title="Corpus matches"
      subtitle={
        result.corpus_skipped_reason === 'multopp_paragraph_only'
          ? 'Multiple opposition is a paragraph property — see Synthetic matches below.'
          : `${result.corpus_matches.length} attested sentences match your constraints`
      }
      items={result.corpus_matches.map(renderCorpusMatch)}
    />
    <SectionDivider />
    <ResultSection
      title="Synthetic matches"
      subtitle={`${result.synthetic_matches.length} generated alternatives`}
      items={result.synthetic_matches.map(renderSynthetic)}
    />
  </>
)}

Per-item rendering: - CorpusMatch — sentence text + source pill (CoLA / UD-EWT / GUM / CHILDES / Tatoeba) + naturalness score on hover. No regenerate affordance (attested text is fixed). - SyntheticMatch — existing rendering. Composite score, axis breakdown, (verb, fillers, skeleton) on hover.

Empty-state copy: - corpus_matches = [] and no skip reason → "No attested sentences match these constraints. Try the synthetic matches below." - corpus_skipped_reason = multopp_paragraph_only → "Multiple opposition is a paragraph property — see Synthetic matches below." - synthetic_matches = [] → "No synthetic matches — try loosening your constraints."

Constraint UI — no changes. The builder already produces the constraint shape /api/sentences consumes.

Open decisions (defer to implementation planning)

  1. SentenceCandidateSyntheticMatch rename. Per the corpus/synthetic rule, rename packages/generation/server/schemas.py:SentenceCandidate across CSP, server, frontend types. Scope as part of this ticket or as a separate prior refactor.
  2. slots field UX. When a Minpair/Maxopp constraint has slots set, the corpus section ignores it (no role tags on attested text). Show a note ("Slot restrictions are dropped on corpus matches") or hide the slots picker entirely on the page.
  3. Loading state. Single spinner with two timing pills below it, or SSE-streamed partial results so corpus matches appear before CSP finishes. The simple wait-for-both is fine for v1; SSE is a nice-to-have.
  4. Content gating denylist source. Pick a maintained denylist (e.g. better-profanity Python lib) vs hand-curated list. Maintained list reduces upkeep; hand-curated lets us tune for clinical register.

Non-goals

  • Paragraph orchestration. /api/generate-paragraphs stays CSP-only; no /api/paragraphs orchestrator in v1. Real-corpus paragraphs would require paragraph segmentation across sources and a clean answer for what "corpus paragraph match" means under non-Multopp constraints — out of scope.
  • Unified ranking. We do not merge corpus + synthetic into one ranked list. Different scoring signals (Qwen-cosine vs CSP composite); two sections keeps the user's mental model honest.
  • Custom corpora. v1 ships with the five fixed sources. User-uploaded or domain-specific corpora are a separate product surface.
  • Pattern relaxation. "Every content word starts with /s/" is strict parity with CSP and will rarely have corpus hits. We accept this — the synthetic section covers strict pattern queries.

File layout

New: - packages/data/scripts/build_corpus_sentences.py — ingest pipeline - packages/generation/server/corpus.py — runtime loader + match_corpus - packages/generation/server/routes/sentences.py/api/sentences orchestrator - data/runtime/corpus_sentences.parquet (LFS) - data/runtime/corpus_sentences_index.parquet (LFS) - packages/generation/server/tests/test_corpus.py - packages/generation/server/tests/test_sentences_orchestrator.py

Modified: - packages/generation/server/main.py — load CorpusStore at cold-start - packages/generation/server/schemas.py — add CorpusMatch, SentencesRequest, SentencesResponse; (optionally) rename SentenceCandidateSyntheticMatch - packages/web/workers/src/routes/generation.ts — add /api/sentences proxy - packages/web/frontend/src/pages/Generate.tsx (or equivalent) — two stacked sections, post to /api/sentences - packages/web/frontend/src/types/governance.tsCorpusMatch, SyntheticMatch, SentencesResponse types - .gitattributes — LFS patterns for new Parquet outputs - CLAUDE.md — note the new endpoint + data contract artifacts

Acceptance criteria

  1. Ingest produces both Parquets with the documented schemas; n_content_in_vocab ≥ 2 rule applied; profanity filter on by default; naturalness pre-score computed with self-row exclusion.
  2. match_corpus returns sentences satisfying every CSP constraint type with exact parity to CSP's hard_filter_expr and _load_pairs_for_request. Adversarial test: a sentence where one content word violates Exclude/Bound/Pattern is excluded; otherwise it passes.
  3. Orchestrator endpoint runs corpus + CSP in parallel; CSP failure does not block corpus response.
  4. Multopp routes return corpus_matches: [] with corpus_skipped_reason: "multopp_paragraph_only".
  5. Frontend renders two stacked sections with corpus first, synthetic below, both ranked within themselves. Empty-state copy renders per §Frontend.
  6. Worker proxy forwards /api/sentences to the generation server with the same auth/timeout shape as /api/generate-sentences.
  7. Cold-start cost ≤ 200ms beyond current CSP cold-start (estimate; measured at first benchmark). Warm corpus query latency ≤ 200ms for typical Exclude+Bound queries; ≤ 500ms for contrastive queries. If actuals overshoot, file a tuning ticket — sub-100ms is the design target, sub-second is the acceptance floor.