Skip to content

v5.2.1 handoff — Sentences as a first-class clinical surface

Branch: polish/v5.2.1-ui-copy → develop → main Predecessor: the v5.1 handoff narrated the CSP solver + Qwen3-Embedding reranker that was retired in v5.2.0 (PR #113). See memory/project_csp_and_reranker_deprecated.md for the post-mortem; that work is snapshot at tag archive/csp-generation-v5.2.

What v5.2.1 ships

Sentences becomes a first-class clinical surface — same constraint vocabulary as Word Lists / Contrast Sets / Lookup, plus a per-result word-highlight overlay that makes triage workable. The 50+ commits between develop and the polish branch cluster into four buckets:

1. Sentences route now actually filters

  • Contrastive constraints wired — minpair / maxopp / multopp self-joins through the pairs table. Previously the route silently dropped these at constraintsToBody, meaning the UI's contrastive controls did nothing for as long as the D1-only refactor had been live. See memory/feedback_silent_drops_are_regressions.md.
  • CONTAINS_MEDIAL enforces edge exclusion in SQL — no more word-edge matches under medial-only.
  • NULL-fail for percentile bounds — content surfaces with NULL percentile (zero evidence in the relevant frequency band) fail the bound. Raw norm bounds keep NULL-pass.
  • Tiered ranking — global by per-query match_count DESC (multi-hit sentences before single-hit, regardless of source), source-interleaved within tier by rarity_score.
  • Per-result highlight payload{include_surfaces, pair_surfaces} per match; CorpusRow renders via WordHighlighter compliance mode.

2. Corpus quality gates

  • SLP content-suitability gate (in-house V/A norms + AFINN NEG_3/4/5 buckets) — drops violence/conflict content
  • PROPN cap of 2 per sentence + English-frequency threshold (5wpm in FineWeb-Edu) substituting for langdetect
  • uh verbal filler + Spanish loanword denylist + letter-spelled-word (X-Y-Z) rejection
  • Parataxis dependency rejection (catches single-ROOT run-on patterns)
  • spaCy contraction handling — stem + suffix glued back to whole-word contraction surfaces (don't, won't, it's)
  • CHILDES + PhonBank retired (CHAT-transcript artifacts unfit for SLP)
  • Cross-source identical-text merge with aggressive normalization
  • Leading-quote / leading-dash strip on the persisted text column

Corpus: 342,976 → 235,866 sentences (-31%), almost entirely from upstream quality gates rather than corpus-source removal.

3. Developmental frequency rework

  • freq_age_adultfreq_age_all (renamed, repointed to FineWeb-Edu frequency)
  • freq_age_2y/5y/8y/12y now aggregate child PRODUCTION (CHILDES + PhonBank prod channels), not caregiver INPUT
  • naturalness_score retired end-to-end; rarity_score is the persisted ranking signal
  • Frequency-class percentile properties treat value=0 as NULL (zero-occurrence words no longer cluster at the 57th percentile)

4. UI polish

  • Chip onDelete bug — store gained removeAt(index) so chip deletes target exactly the clicked entry (previously partial-match field comparison could collapse sibling constraints)
  • WordHighlighter overlay normalised on the result cards (strips trailing punctuation so "d." in highlight set matches "d" in rendered text)
  • IPA keyboard icon spacing in ContrastiveSection
  • Tool descriptions in AppHeader drawer updated for Sentences

R&D workstreams (where the loop closes)

Workstream State Endgame role
Audio Detection (PHON-44/53/55) Active — transcriber model trained Diagnostic input + progress feedback
Curriculum Recommender Concept Diagnostic profile → graded sequence of targets, delivered through the live tools. Successor framing for "Content Catalog."
Governed Generation Paused (CSP retired in v5.2.0) Returns as R&D when curricula need synthetic material the corpus can't supply
Adaptive Loop Concept Glue closing diagnostic → curriculum → feedback

Migration / API breaking changes

Consumers of /api/sentences and /api/property-metadata will see:

  • naturalness_score removed from response; rarity_score + match_count + highlights added
  • freq_age_adult renamed to freq_age_all
  • Percentile bounds now exclude ~5-15× more sentences each (NULL-fail semantics)
  • freq_age_* numeric ranks have shifted (child PROD source vs caregiver INPUT source — different distributions)
  • Contrastive constraints actually filter results (previously dropped silently)

Verification quick-commands

# Corpus + worker + frontend tests
uv run python -m pytest packages/data/tests/ -q
cd packages/web/workers && npm test
cd ../frontend && npm test && npx tsc --noEmit

# Smoke the route
curl -s -X POST http://localhost:8787/api/sentences \
  -H 'Content-Type: application/json' \
  -d '{"constraints":[{"type":"contrastive_minpair","phoneme1":"b","phoneme2":"d","position":"initial"}],"top_k":5}'

Open items worth thinking about

  • Curriculum recommender scope — what the API surface looks like, where it lives in the existing tool grid
  • Audio detection integration — how diagnostic profiles flow from the audio workstream into curriculum recommendations
  • Inverted index for constraint cardinality — large pre-computed coverage tables (~29M rows) staged in research/2026-05-21-corpus-filter-audit/ could speed up cardinality preview ("4,000 sentences before AND-cascade"). Deferred until the CTE path shows latency issues at production scale.
  • Coverage badges in UI — surface "X sentences match this constraint" cardinality so the AND-cascade math is visible to the clinician before they run the query.