Skip to content

Word Lists SLP Curation — Design Spec

Date: 2026-05-14 Status: Design — pending plan Owner: Jared Neumann Brainstorming session: 2026-05-14 conversation, branch feature/phon-116-naturalness-scorer (will branch fresh)

Problem

The current Word Lists tool (packages/web/frontend/src/components/Builder.tsx) exposes 30 surfaced filterable properties across 8 categories, three pattern types, and a phoneme exclusion list. It was built as a "power tool" for researcher-grade exploration, but the empirical SLP audience — confirmed via Reddit corpus analysis (~/Repos/speech-community-analysis) and the SBIR survey draft — uses a much narrower vocabulary of clinical concepts.

Two adjacent problems compound:

  1. Sound similarity is a separate tool (PhonologicalSimilarityTool.tsx, currently unwired from the tool registry per the App_new.tsx:117 comment "PHON-117: Sound Similarity is being consolidated into Word Lists"). It needs a home, and the natural one is as one more composable rule inside Word Lists.
  2. Syllable shape (CV pattern) appears repeatedly in SLP discussion (cluster reduction, cycles approach, apraxia progression, complexity approach — Reddit clusters 459/480/491/2611, 1921, plus SBIR Q10 row "phonological / phonotactic profile"). It is latent in the data — the syllabification module already produces Syllable objects with onset/nucleus/coda lists — but never emitted as a queryable column.

Goals

  1. Curate the platform UI down to ~14 clinically relevant properties across 4 groups, with SLP-language labels.
  2. Fold sound similarity in as one composable rule alongside patterns, exclusions, and property bounds. All rules AND together.
  3. Surface syllable shape (CV pattern) as a new derived column + categorical rule UI primitive.
  4. Keep the API researcher-grade — separation of concerns. Researchers get the full property set; clinicians get the curated subset.
  5. No new datasets, no new metrics — pure curation + one derived column (cv_shape) + one aggregated headline (freq_age_adult) from existing raw band data.

Non-goals

  • Building a "Researcher mode" UI toggle. The API is the researcher surface; we don't need an in-platform power-user mode in this pass.
  • Designing higher-level query presets ("articulation /s/-initial CVC age 5"). Compelling UX moment but out of scope; revisit after curation lands.
  • Touching Contrastive Sets, Sentences (governed generation), Text Analysis, or Lookup. Single-tool surgery.
  • Adding a POS top-level filter chip. The data is there (v5.2 is NOUN/VERB/ADJ/ADV) but adding it as a new categorical filter exceeds the "just curate" constraint. Flag as a follow-up.
  • Adding new pattern types beyond what the data layer already supports. (CONTAINS_MEDIAL added in PHON-110 may need a small frontend catch-up but doesn't earn its own design section.)

Empirical SLP signals driving the curation

From the Reddit SLP corpus (92,842 thread-context units, ~/Repos/speech-community-analysis/data/reports/codebook_v0.1.md) and the SBIR survey draft (~/Repos/speech-community-analysis/phonolex_slp_survey_v01.md):

  • Cluster 412 (747 threads): Speech Therapy Materials and Resources — second-largest cluster overall; the explicit materials-prep pain point.
  • Per-phoneme articulation clusters: 459 (/r/), 480 (/s/), 491 (velar), 428 (/l/), 458 (R challenges). Direct evidence that SLPs frame Word Lists work as "target this phoneme."
  • Cluster 1921: phonological treatment approaches — minimal pairs, maximal opposition, cycles approach. (Most served by Contrastive Sets, but Word Lists feeds the input.)
  • Cluster 2611: complexity approach — Gierut/Storkel target selection via WCM.
  • Cluster 427 (387 threads): Clinical Goal Writing and Implementation — measurable goals feed measurable word lists.
  • Cluster 386/387: Adult Cognitive Therapy + Adult Cog-Comm Materials — separate audience requiring adult-band vocab norms.
  • SBIR Q10 mapping:
  • "Cannot find enough items at the right phonological / phonotactic profile" → phoneme position + exclusion + shape + length
  • "Cannot find enough items at the right grade / vocabulary level" → AoA + developmental frequency by age band
  • "Materials don't avoid trauma-related content for sensitive students" → valence + arousal
  • "Not enough variety to maintain student engagement across sessions" → variety covered by composability + sound similarity for generalization probes

Design

1. Audience and segmentation

Platform UI is curated for three SLP segments:

  1. Pediatric articulation / phonology
  2. Pediatric language / literacy
  3. Adult aphasia / dysarthria

Researchers / grad students use the API. This is a clean separation of concerns: platform = clinical, API = researcher-grade.

2. Composition model

Word Lists is a list of rules that AND together. Any rule can be empty/skipped. Five rule types:

Rule type Existing? Notes
Phoneme position pattern yes STARTS_WITH / ENDS_WITH / CONTAINS / CONTAINS_MEDIAL; multiple patterns AND
Exclude phonemes yes Single blacklist of IPA phonemes
Similar to anchor word yes (separate tool) Folded in; optional anchor + preset/weights + threshold
Property bounds yes Curated to 13 numeric properties (see §3)
Shape (CV pattern) NO New categorical rule against new cv_shape derived column

3. Curated property surface

Final platform surface: 4 groups, 14 properties.

Group label (proposed) Current group Properties Why kept
Word Shape Phonological Complexity syllable_count, phoneme_count, wcm_score, cv_shape ← new Complexity approach, apraxia progression, cluster work, cycles
Age Appropriateness Lexical + Developmental Frequency (merged) aoa, freq_age_2y, freq_age_5y, freq_age_8y, freq_age_12y, freq_age_adult ← new headline Age-stratified vocab is THE bottleneck SLPs name
Imagery & Familiarity Semantic Properties concreteness, familiarity Picture stim viability + "does the kid know it"
Emotional Tone Affective Properties valence, arousal SBIR Q10 trauma-sensitive content; aphasia stim work

Dropped from the platform UI (still accessible via API):

Group Dropped properties Why dropped
Phonotactic Probability phono_prob_avg, positional_prob_avg, neighborhood_density, str_phono_prob_avg, str_positional_prob_avg, str_neighborhood_density Researcher precision (4 decimals on BPP/PSP); ND borderline-clinical but no Reddit/SBIR signal
Lexical (stragglers) frequency (raw), contextual_diversity, pos_dominant_freq, log_frequency Raw freq redundant with Zipf; CD duplicates freq; POS dominance is a researcher metric; log_frequency subsumed by freq_age_adult in the new Age group
Cognitive/Embodied iconicity, boi, socialness, semd_topic, n_topics_for_word Research norms; no clinical workflow signal
Morphological morpheme_count, n_prefixes, n_suffixes Niche literacy use; not in empirical signal

Already retired (verify removal in this pass): freq_cyplex_*, semantic_diversity alias, semd_vn, semd_h13 (all retired in PHON-117); Lancaster sensorimotor (NO-GO 2026-05-12); ELP RT (PHON-71/75).

4. Two derived data additions

These are not new datasets — strict aggregations / derivations of existing raw data:

4a. cv_shape — whole-word CV skeleton

Derived from existing Syllable objects (packages/data/src/phonolex_data/phonology/syllabification.py).

# pseudocode in pipeline/words.py
def compute_cv_shape(syllables: list[Syllable]) -> str:
    parts = []
    for syl in syllables:
        parts.append("C" * len(syl.onset) + "V" + "C" * len(syl.coda))
    return "-".join(parts)

# examples:
#   "cat"     /k.æ.t/         → "CVC"
#   "spring"  /s.p.ɹ.ɪ.ŋ/     → "CCCVC"
#   "kitten"  /k.ɪ.t.ə.n/     → "CVC-VC"
#   "boat"    /b.oʊ.t/        → "CVC"  (diphthong = single V, already in VOWELS set)
  • New column cv_shape: str on words.parquet (lives on words table not word_properties because it's the first string-typed platform property).
  • New PropertyDef in both packages/web/workers/src/config/properties.ts and packages/web/workers/scripts/config.py. Marked with a new kind: 'numeric' | 'categorical' field on PropertyDef; default 'numeric' for backwards compat; cv_shape is 'categorical'.
  • platform_visible: true.
  • API filter: cv_shape accepts either an exact string match or a comma-separated list (OR within the list).

4b. freq_age_adult — adult developmental-frequency headline

Mean of existing wpm_b4 + wpm_b5 raw band cols (FineWeb-Edu grade-banded; b4/b5 are the high-school/college tail). Mirrors the existing 4 headline aggregations in DEV_FREQ_HEADLINES.

# pseudocode in pipeline/words.py — alongside the existing freq_age_2y/5y/8y/12y computations
def compute_freq_age_adult(row) -> float:
    return mean_missing_as_zero(row.wpm_b4, row.wpm_b5)
  • New entry in DEV_FREQ_HEADLINES array (both properties.ts and config.py).
  • Scale 0–50000 wpm, use_log_scale: true, mirrors siblings.
  • platform_visible: true.
  • log_frequency drops out of the platform surface (still in the API).

5. UI surface

Builder.tsx restructures into 5 accordion sections (preserves the existing <Accordion> pattern; reuses PropertySlider, PhonemePickerDialog, WordListTable):

┌──────────────────────────────────────────────────────────────┐
│ ▼ Phoneme rules                              (default open)  │
│   • Pattern matching (existing, +CONTAINS_MEDIAL catch-up)   │
│   • Exclude phonemes (existing)                              │
│   • Similar to ____ anchor word (new fold-in)                │
├──────────────────────────────────────────────────────────────┤
│ ▼ Word Shape                                 (default open)  │
│   • syllable_count slider                                    │
│   • phoneme_count slider                                     │
│   • wcm_score slider                                         │
│   • CV shape chip picker (new categorical rule)              │
├──────────────────────────────────────────────────────────────┤
│ ▶ Age Appropriateness    (6 sliders, collapsed)              │
├──────────────────────────────────────────────────────────────┤
│ ▶ Imagery & Familiarity  (2 sliders, collapsed)              │
├──────────────────────────────────────────────────────────────┤
│ ▶ Emotional Tone         (2 sliders, collapsed)              │
└──────────────────────────────────────────────────────────────┘
                                                  [ Build list ]

5a. "Similar to" rule — lift from PhonologicalSimilarityTool

Located inside the Phoneme rules accordion (composable as one more rule; not a separate section because the user is "equally composable" — section parity would visually elevate it).

┌─ Similar to ──────────────────────────────────────┐
│ Anchor word: [snake               ]               │
│                                                   │
│ Preset:                                           │
│ [ Rhymes ● ][ Alliteration ][ Assonance ]         │
│ [ Consonance ][ Balanced  ]                       │
│                                                   │
│ Match strength: [ High (0.85) ▼ ]                 │
│                                                   │
│ ▶ Advanced (component weights + position)         │ ← collapsed disclosure
└───────────────────────────────────────────────────┘

Lift directly from PhonologicalSimilarityTool.tsx: - PRESETS array (Rhymes / Balanced / Alliteration / Assonance / Consonance) — lines 47–83 - Labeled-bucket threshold select (Very High / High / Medium / Low / Very Low) — lines 292–305 - Position + syllableCount coupled selects with disable-on-all-or-medial logic — lines 176–214

Behavior: - Empty anchor → rule inactive (no similarity backend call, no result intersection) - Active anchor → backend hit; results AND-intersected with property+pattern+exclude results - Result ordering: when similarity rule is active, sort defaults to similarity desc; otherwise word-name asc (existing WordListTable defaultSort prop)

5b. CV shape rule — categorical chip picker

New reusable component <CategoricalRule> (planned, generic for future categorical filters):

┌─ CV shape ────────────────────────────────────────┐
│ Common shapes:                                    │
│ [ V ][ CV ][ VC ][ CVC ● ][ CCV ][ CCVC ]         │
│ [ CVCC ][ CCVCC ][ CV-CV ][ CV-CVC ][ CCV-CV ]    │
│                                                   │
│ Custom: [ CVCV-CVC      ]    [+ Add]              │
│                                                   │
│ Active: CVC, CV-CV  ✕                             │
└───────────────────────────────────────────────────┘
  • Multi-select chips (OR semantics within rule, AND with rest of query)
  • Common-shapes preset list covers apraxia progression + cluster work + cycles staples
  • Free-text "Custom" input + Add button — accepts any sequence matching ^[CV]+(-[CV]+)*$
  • "Active" line shows the current OR'd selection with individual remove chips

The <CategoricalRule> component is reusable: same pattern would serve a future POS chip rule.

6. Platform / API separation mechanism

Two-tier visibility for properties. Preserve existing surfaced semantics; add platform_visible.

// packages/web/workers/src/config/properties.ts
export interface PropertyDef {
  // ... existing fields including surfaced
  surfaced?: boolean;          // unchanged: false = D1-only, not in /api/property-metadata
  platform_visible?: boolean;  // NEW: true = shown in platform UI; default undefined (= API-only)
  kind?: 'numeric' | 'categorical';  // NEW: default 'numeric'; cv_shape is 'categorical'
}

Metadata route gains a query param:

GET /api/property-metadata                  → full surfaced set (researcher; current behavior)
GET /api/property-metadata?surface=platform → curated 14-property platform subset

getSurfacedCategories() stays as-is. Add:

export function getPlatformCategories(): PropertyCategory[] {
  // filter to surfaced && platform_visible === true; drop empty groups
}

Frontend usePropertyMetadata hook calls ?surface=platform by default. Mirror updates in packages/web/workers/scripts/config.py.

14 properties get platform_visible: true: - Word Shape: syllable_count, phoneme_count, wcm_score, cv_shape - Age Appropriateness: aoa, freq_age_2y, freq_age_5y, freq_age_8y, freq_age_12y, freq_age_adult - Imagery & Familiarity: concreteness, familiarity - Emotional Tone: valence, arousal

All other surfaced properties get no platform_visible flag (i.e., API-only by default).

7. Sound similarity fold-in mechanism (C1: single combined endpoint)

POST /api/words/search gains an optional similar_to block:

interface WordSearchRequest {
  patterns?: Pattern[];
  exclude_phonemes?: string[];
  cv_shape?: string[];        // OR within array
  // ... existing min_/max_ filter params
  similar_to?: {
    word: string;
    weights: { onset: number; nucleus: number; coda: number };
    threshold: number;
    position: 'all' | 'initial' | 'final' | 'medial';
    syllable_count: number;
  };
}

Server flow when similar_to is present: 1. Run the existing similarity scan to produce a { word → similarity_score } map above threshold 2. Run the existing filter+pattern query against words table 3. Intersect by word name 4. Sort intersection by similarity desc 5. Return top N with similarity field populated on each row

Server flow when similar_to is absent: - Unchanged from today

Frontend apiClient updates: findSimilarWords retires; searchWords accepts the optional similar_to block. Existing direct /api/similarity/search route stays in place for backward compat (no UI uses it anymore once Builder migrates; can be removed in a v5.3 cleanup).

Implementation summary (deferred to plan)

Files touched (anticipated, for planning purposes only):

Data layer - packages/data/src/phonolex_data/pipeline/words.py — add cv_shape derivation + freq_age_adult headline - packages/data/src/phonolex_data/runtime/schema.py — register new columns

API layer - packages/web/workers/src/config/properties.ts — add platform_visible, kind, cv_shape PropertyDef, freq_age_adult PropertyDef, getPlatformCategories() function - packages/web/workers/scripts/config.py — mirror - packages/web/workers/src/routes/meta.ts (or wherever property-metadata is served) — ?surface=platform handling - packages/web/workers/src/routes/words.ts — accept similar_to block in /api/words/search; intersect logic - packages/web/workers/src/routes/similarity.ts — left in place for backward compat (slated for v5.3 removal) - packages/web/workers/src/types.ts — update WordSearchRequest, Word types

Frontend - packages/web/frontend/src/components/Builder.tsx — restructure 5 accordions; wire new components - packages/web/frontend/src/components/shared/SimilarToRule.tsx — NEW; lifted from PhonologicalSimilarityTool.tsx - packages/web/frontend/src/components/shared/CategoricalRule.tsx — NEW; reusable chip + custom-input picker - packages/web/frontend/src/components/tools/PhonologicalSimilarityTool.tsx — DELETE - packages/web/frontend/src/services/phonolexApi.ts (or wherever apiClient lives) — update searchWords signature - packages/web/frontend/src/hooks/usePropertyMetadata.ts — call ?surface=platform - packages/web/frontend/src/App_new.tsx — update Word Lists description to remove the "rhyming and sound-similarity workflows" sub-mention (now part of the unified tool)

Tests - packages/data/tests/cv_shape derivation correctness across a sample of CMU words; freq_age_adult aggregation matches sibling pattern - packages/web/workers/test/?surface=platform filtering; similar_to intersection logic with mocked similarity backend - packages/web/frontend/src/test/ — Builder.tsx restructure smoke; CategoricalRule + SimilarToRule unit

Data regeneration - Regenerate data/runtime/words.parquet via uv run python packages/data/scripts/build_runtime_parquet.py - Regenerate d1-seed.sql via packages/web/workers/scripts/export-to-d1.py - Apply migration to local + staging + prod D1

Risks

  1. Loss aversion from researchers who used the dropped properties — mitigated by API parity. The 35→14 cut is UI-only; nothing leaves the data layer. Frame the change as "platform got SLP-focused" not "we removed properties."
  2. cv_shape interpretation edge cases — syllabic consonants (/n̩/ in "button"), affricates (/tʃ/ is one C or two?), diphthongs. The existing syllabifier already collapses diphthongs to single V (verified — they're in the VOWELS set at syllabification.py:39). Affricates are single phonemes in CMU/IPA mappings → single C. Syllabic consonants are rare in CMU dict outputs and would parse as schwa+C. Document the rules; not blocking.
  3. D1 column count headroom — D1 hard limit is 100 cols per table (CLAUDE.md gotcha). words table currently has phonemes_str and a few string cols; adding cv_shape should fit comfortably. Verify before write.
  4. PHON-117 retirement leakage — frontend may still be showing retired properties via stale metadata cache. The migration to ?surface=platform ensures only platform_visible: true properties render; retired properties (surfaced: false) were already gone but worth verifying in QA.
  5. First categorical rule UI primitiveCategoricalRule is new infrastructure. Component design should be opinionated enough to be reusable (POS, future hypotheticals) without being a generic form-builder.

Open follow-ups (deferred — not part of this design)

  • POS top-level chip filter (NOUN/VERB/ADJ/ADV) — data is there but adding the surface exceeds "just curate." Worth filing.
  • Starter query presets ("articulation /s/-initial CVC age 5", "rhymes for X", "early intervention core vocab") — compelling UX moment; revisit after curation lands.
  • Researcher mode toggle for in-platform researchers — currently they use the API; if SBIR or downstream work surfaces demand, build a toggle.
  • /api/similarity/search removal — keep for backward compat through v5.3, then drop once no consumer remains.
  • Restore retired property visibility audit — verify frontend isn't still showing freq_cyplex_*, semd_vn, semd_h13 via stale state.

References

  • Current Builder: packages/web/frontend/src/components/Builder.tsx
  • Current property surface: packages/web/workers/src/config/properties.ts
  • Hidden similarity tool: packages/web/frontend/src/components/tools/PhonologicalSimilarityTool.tsx
  • Similarity route: packages/web/workers/src/routes/similarity.ts
  • Tool registry: packages/web/frontend/src/App_new.tsx
  • Reddit SLP corpus analysis: ~/Repos/speech-community-analysis/data/reports/codebook_v0.1.md, ~/Repos/speech-community-analysis/data/reports/phase1_memo.md
  • SBIR SLP survey draft: ~/Repos/speech-community-analysis/phonolex_slp_survey_v01.md
  • PHON-117 (Sound Similarity into Word Lists) — referenced in App_new.tsx:117 comment