Skip to content

Sentences

Retrieve naturalistic English sentences that satisfy phoneme, CV-shape, frequency, and contrastive constraints — for use as in-context practice material, clinical examples, or assessment prompts. Sentences are drawn from a curated ~236K-sentence corpus (CoLA, UD English-EWT, GUM, Tatoeba, OpenSubtitles) gated for SLP suitability.

How It Works

The Sentences tool runs the same constraint vocabulary you use elsewhere (phoneme patterns, CV shapes, percentile-bounded developmental frequency, contrastive minimal pairs) against an indexed corpus of attested sentences. A returned sentence is one where every active constraint is satisfied — phoneme rules by at least one word, contrastive rules by both members of a pair witness, bound rules by every content word in the sentence.

Sentences are ranked by per-query match count first (sentences with more constraint-satisfying words rank higher), then by rarity score within tier (a static signal favoring sentences carrying rarer phonological constraints), source-interleaved so results don't pile up from one corpus.

Each result is rendered with a per-word highlight overlay — the words that earned the sentence its slot get a blue underline. For contrastive constraints, both members of every witness pair are highlighted so the contrast is visible at a glance. Click any word to open its profile.

Constraint Types

Phoneme patterns

Same pattern vocabulary as Custom Word Lists:

Type Meaning
STARTS_WITH At least one word in the sentence has these phonemes at its onset
ENDS_WITH At least one word has these phonemes at its coda
CONTAINS At least one word contains these phonemes anywhere
CONTAINS_MEDIAL At least one word contains these phonemes strictly between its first and last phoneme

Each pattern carries a mode: include (sentence must contain ≥1 matching word) or exclude (no word in the sentence may match). Multi-phoneme sequences are space-separated (e.g. "s t" matches /st/ clusters).

CV shape

Filter for sentences containing words of specified CV shapes (e.g. CVC, CCVCC, CV-CV). Include / exclude semantics match phoneme patterns.

Contrastive pairs

Two sentence-level variants, both witnessed via self-join through the pairs table — the sentence must contain both members of at least one matching pair:

  • Minimal pair — both members of a pair where the chosen two phonemes are the only difference (e.g. b/d initial: a sentence with both brain and drain)
  • Maximal opposition (Gierut 1989) — minimal pair where the two phonemes also cross the sonorant class (e.g. p/m)

Multiple opposition is not offered here: requiring a single sentence to witness one substitute against several target phonemes at once almost never occurs in attested text. Use the Contrast Sets tool for multiple-opposition word pairs.

Match count for contrastive rules = the number of witness pairs the sentence carries (more witnesses = higher rank within tier).

Psycholinguistic bounds

Restrict by percentile or raw threshold on any of the curated norms (AoA, concreteness, valence, arousal, familiarity, frequency at age band, etc.).

Percentile bounds use NULL-fail semantics: a content word lacking the relevant percentile data fails the bound. This is the correct semantic for "is this word in the X-y vocabulary band at all" — a word with no freq_age_2y evidence isn't a 2-year-old's word.

Raw norm bounds use NULL-pass semantics: rating-scale norms (concreteness, valence) can have missing rater coverage; a word with no value gets the benefit of the doubt.

Frequency-class properties (freq_age_*, frequency) treat value=0 as NULL for percentile purposes — a word never occurring in the corpus isn't a "57th-percentile word" (which is where zero-tied entries used to land).

Word Frequency age bands

The age-band picker resolves to specific *_percentile columns at submission time:

Band Source
All FineWeb-Edu general corpus (derived frequency — the general-purpose reference)
2y Child production at 12-36mo (CHILDES + PhonBank prod channels)
5y Child production at 36-72mo
8y Child production at 72-108mo
12y Child production at 108-144mo

Age bands reflect what children of that age actually produce, not what adults say to them. (The previous build aggregated caregiver input, which surfaced adult vocabulary as a "2y filter" — fixed in v5.2.1.)

Workflow

  1. Compose constraints in the left panel — phoneme patterns, CV shapes, bounds, contrastive rules
  2. Run retrieval — top results returned in tens of ms; constraints apply as AND across rule types
  3. Triage with the overlay — highlighted words show why each sentence was returned; the source-pill on each card shows which corpus contributed
  4. Click any word to open its full profile (phonology, norms, similar words, etc.)

Notes on the corpus

The corpus underwent multiple rounds of SLP-targeted curation:

  • Cross-source identical-text merge (e.g. a sentence appearing in both Tatoeba and OpenSubtitles shows both pills)
  • Distressing-content gate (in-house valence + arousal norms + AFINN strong-negative buckets)
  • PROPN cap of 2 per sentence + English-frequency threshold for proper nouns (substitutes for langdetect)
  • Letter-spelled-word rejection (R-o-b-a-r-d patterns)
  • Spanish-loanword denylist + verbal-filler (uh) rejection
  • Parataxis dependency rejection (run-on patterns)
  • spaCy contraction handling — surfaces are stored as whole-word contractions (don't, won't, it's)
  • CHILDES + PhonBank conversational transcripts retired 2026-05-25 (CHAT-transcript artifacts produced locally-plausible but globally-broken sentences for SLP material)

See the docs/data-derivation-manifest.md and the v5.2.1 CHANGELOG entry for full details.

  • Custom Word Lists — same constraint vocabulary applied to the word lexicon rather than sentence corpus
  • Contrast Sets — browse all minimal-pair / maximal-opposition / multiple-opposition pairs in the lexicon (Sentences offers the minimal-pair and maximal-opposition subset at the sentence level; multiple opposition is Contrast Sets only)
  • Lookup — open the full profile of any word found in a Sentences result