v5.2.1 handoff — Sentences as a first-class clinical surface¶

Branch: polish/v5.2.1-ui-copy → develop → main Predecessor: the v5.1 handoff narrated the CSP solver + Qwen3-Embedding reranker that was retired in v5.2.0 (PR #113). See memory/project_csp_and_reranker_deprecated.md for the post-mortem; that work is snapshot at tag archive/csp-generation-v5.2.

What v5.2.1 ships¶

Sentences becomes a first-class clinical surface — same constraint vocabulary as Word Lists / Contrast Sets / Lookup, plus a per-result word-highlight overlay that makes triage workable. The 50+ commits between develop and the polish branch cluster into four buckets:

1. Sentences route now actually filters¶

Contrastive constraints wired — minpair / maxopp / multopp self-joins through the pairs table. Previously the route silently dropped these at constraintsToBody, meaning the UI's contrastive controls did nothing for as long as the D1-only refactor had been live. See memory/feedback_silent_drops_are_regressions.md.
CONTAINS_MEDIAL enforces edge exclusion in SQL — no more word-edge matches under medial-only.
NULL-fail for percentile bounds — content surfaces with NULL percentile (zero evidence in the relevant frequency band) fail the bound. Raw norm bounds keep NULL-pass.
Tiered ranking — global by per-query match_count DESC (multi-hit sentences before single-hit, regardless of source), source-interleaved within tier by rarity_score.
Per-result highlight payload — {include_surfaces, pair_surfaces} per match; CorpusRow renders via WordHighlighter compliance mode.

2. Corpus quality gates¶

SLP content-suitability gate (in-house V/A norms + AFINN NEG_3/4/5 buckets) — drops violence/conflict content
PROPN cap of 2 per sentence + English-frequency threshold (5wpm in FineWeb-Edu) substituting for langdetect
uh verbal filler + Spanish loanword denylist + letter-spelled-word (X-Y-Z) rejection
Parataxis dependency rejection (catches single-ROOT run-on patterns)
spaCy contraction handling — stem + suffix glued back to whole-word contraction surfaces (don't, won't, it's)
CHILDES + PhonBank retired (CHAT-transcript artifacts unfit for SLP)
Cross-source identical-text merge with aggressive normalization
Leading-quote / leading-dash strip on the persisted text column

Corpus: 342,976 → 235,866 sentences (-31%), almost entirely from upstream quality gates rather than corpus-source removal.

3. Developmental frequency rework¶

freq_age_adult → freq_age_all (renamed, repointed to FineWeb-Edu frequency)
freq_age_2y/5y/8y/12y now aggregate child PRODUCTION (CHILDES + PhonBank prod channels), not caregiver INPUT
naturalness_score retired end-to-end; rarity_score is the persisted ranking signal
Frequency-class percentile properties treat value=0 as NULL (zero-occurrence words no longer cluster at the 57th percentile)

4. UI polish¶

Chip onDelete bug — store gained removeAt(index) so chip deletes target exactly the clicked entry (previously partial-match field comparison could collapse sibling constraints)
WordHighlighter overlay normalised on the result cards (strips trailing punctuation so "d." in highlight set matches "d" in rendered text)
IPA keyboard icon spacing in ContrastiveSection
Tool descriptions in AppHeader drawer updated for Sentences

R&D workstreams (where the loop closes)¶

Workstream	State	Endgame role
Audio Detection (PHON-44/53/55)	Active — transcriber model trained	Diagnostic input + progress feedback
Curriculum Recommender	Concept	Diagnostic profile → graded sequence of targets, delivered through the live tools. Successor framing for "Content Catalog."
Governed Generation	Paused (CSP retired in v5.2.0)	Returns as R&D when curricula need synthetic material the corpus can't supply
Adaptive Loop	Concept	Glue closing diagnostic → curriculum → feedback

Migration / API breaking changes¶

Consumers of /api/sentences and /api/property-metadata will see:

naturalness_score removed from response; rarity_score + match_count + highlights added
freq_age_adult renamed to freq_age_all
Percentile bounds now exclude ~5-15× more sentences each (NULL-fail semantics)
freq_age_* numeric ranks have shifted (child PROD source vs caregiver INPUT source — different distributions)
Contrastive constraints actually filter results (previously dropped silently)

Verification quick-commands¶

# Corpus + worker + frontend tests
uv run python -m pytest packages/data/tests/ -q
cd packages/web/workers && npm test
cd ../frontend && npm test && npx tsc --noEmit

# Smoke the route
curl -s -X POST http://localhost:8787/api/sentences \
  -H 'Content-Type: application/json' \
  -d '{"constraints":[{"type":"contrastive_minpair","phoneme1":"b","phoneme2":"d","position":"initial"}],"top_k":5}'

Open items worth thinking about¶

Curriculum recommender scope — what the API surface looks like, where it lives in the existing tool grid
Audio detection integration — how diagnostic profiles flow from the audio workstream into curriculum recommendations
Inverted index for constraint cardinality — large pre-computed coverage tables (~29M rows) staged in research/2026-05-21-corpus-filter-audit/ could speed up cardinality preview ("4,000 sentences before AND-cascade"). Deferred until the CTE path shows latency issues at production scale.
Coverage badges in UI — surface "X sentences match this constraint" cardinality so the AND-cascade math is visible to the clinician before they run the query.