Word Lists SLP Curation — Design Spec¶

Date: 2026-05-14 Status: Design — pending plan Owner: Jared Neumann Brainstorming session: 2026-05-14 conversation, branch feature/phon-116-naturalness-scorer (will branch fresh)

Problem¶

The current Word Lists tool (packages/web/frontend/src/components/Builder.tsx) exposes 30 surfaced filterable properties across 8 categories, three pattern types, and a phoneme exclusion list. It was built as a "power tool" for researcher-grade exploration, but the empirical SLP audience — confirmed via Reddit corpus analysis (~/Repos/speech-community-analysis) and the SBIR survey draft — uses a much narrower vocabulary of clinical concepts.

Two adjacent problems compound:

Sound similarity is a separate tool (PhonologicalSimilarityTool.tsx, currently unwired from the tool registry per the App_new.tsx:117 comment "PHON-117: Sound Similarity is being consolidated into Word Lists"). It needs a home, and the natural one is as one more composable rule inside Word Lists.
Syllable shape (CV pattern) appears repeatedly in SLP discussion (cluster reduction, cycles approach, apraxia progression, complexity approach — Reddit clusters 459/480/491/2611, 1921, plus SBIR Q10 row "phonological / phonotactic profile"). It is latent in the data — the syllabification module already produces Syllable objects with onset/nucleus/coda lists — but never emitted as a queryable column.

Goals¶

Curate the platform UI down to ~14 clinically relevant properties across 4 groups, with SLP-language labels.
Fold sound similarity in as one composable rule alongside patterns, exclusions, and property bounds. All rules AND together.
Surface syllable shape (CV pattern) as a new derived column + categorical rule UI primitive.
Keep the API researcher-grade — separation of concerns. Researchers get the full property set; clinicians get the curated subset.
No new datasets, no new metrics — pure curation + one derived column (cv_shape) + one aggregated headline (freq_age_adult) from existing raw band data.

Non-goals¶

Building a "Researcher mode" UI toggle. The API is the researcher surface; we don't need an in-platform power-user mode in this pass.
Designing higher-level query presets ("articulation /s/-initial CVC age 5"). Compelling UX moment but out of scope; revisit after curation lands.
Touching Contrastive Sets, Sentences (governed generation), Text Analysis, or Lookup. Single-tool surgery.
Adding a POS top-level filter chip. The data is there (v5.2 is NOUN/VERB/ADJ/ADV) but adding it as a new categorical filter exceeds the "just curate" constraint. Flag as a follow-up.
Adding new pattern types beyond what the data layer already supports. (CONTAINS_MEDIAL added in PHON-110 may need a small frontend catch-up but doesn't earn its own design section.)

Empirical SLP signals driving the curation¶

From the Reddit SLP corpus (92,842 thread-context units, ~/Repos/speech-community-analysis/data/reports/codebook_v0.1.md) and the SBIR survey draft (~/Repos/speech-community-analysis/phonolex_slp_survey_v01.md):

Cluster 412 (747 threads): Speech Therapy Materials and Resources — second-largest cluster overall; the explicit materials-prep pain point.
Per-phoneme articulation clusters: 459 (/r/), 480 (/s/), 491 (velar), 428 (/l/), 458 (R challenges). Direct evidence that SLPs frame Word Lists work as "target this phoneme."
Cluster 1921: phonological treatment approaches — minimal pairs, maximal opposition, cycles approach. (Most served by Contrastive Sets, but Word Lists feeds the input.)
Cluster 2611: complexity approach — Gierut/Storkel target selection via WCM.
Cluster 427 (387 threads): Clinical Goal Writing and Implementation — measurable goals feed measurable word lists.
Cluster 386/387: Adult Cognitive Therapy + Adult Cog-Comm Materials — separate audience requiring adult-band vocab norms.
SBIR Q10 mapping:
"Cannot find enough items at the right phonological / phonotactic profile" → phoneme position + exclusion + shape + length
"Cannot find enough items at the right grade / vocabulary level" → AoA + developmental frequency by age band
"Materials don't avoid trauma-related content for sensitive students" → valence + arousal
"Not enough variety to maintain student engagement across sessions" → variety covered by composability + sound similarity for generalization probes

Design¶

1. Audience and segmentation¶

Platform UI is curated for three SLP segments:

Pediatric articulation / phonology
Pediatric language / literacy
Adult aphasia / dysarthria

Researchers / grad students use the API. This is a clean separation of concerns: platform = clinical, API = researcher-grade.

2. Composition model¶

Word Lists is a list of rules that AND together. Any rule can be empty/skipped. Five rule types:

Rule type	Existing?	Notes
Phoneme position pattern	yes	STARTS_WITH / ENDS_WITH / CONTAINS / CONTAINS_MEDIAL; multiple patterns AND
Exclude phonemes	yes	Single blacklist of IPA phonemes
Similar to anchor word	yes (separate tool)	Folded in; optional anchor + preset/weights + threshold
Property bounds	yes	Curated to 13 numeric properties (see §3)
Shape (CV pattern)	NO	New categorical rule against new `cv_shape` derived column

3. Curated property surface¶

Final platform surface: 4 groups, 14 properties.

Group label (proposed)	Current group	Properties	Why kept
Word Shape	Phonological Complexity	`syllable_count`, `phoneme_count`, `wcm_score`, `cv_shape` ← new	Complexity approach, apraxia progression, cluster work, cycles
Age Appropriateness	Lexical + Developmental Frequency (merged)	`aoa`, `freq_age_2y`, `freq_age_5y`, `freq_age_8y`, `freq_age_12y`, `freq_age_adult` ← new headline	Age-stratified vocab is THE bottleneck SLPs name
Imagery & Familiarity	Semantic Properties	`concreteness`, `familiarity`	Picture stim viability + "does the kid know it"
Emotional Tone	Affective Properties	`valence`, `arousal`	SBIR Q10 trauma-sensitive content; aphasia stim work

Dropped from the platform UI (still accessible via API):

Group	Dropped properties	Why dropped
Phonotactic Probability	`phono_prob_avg`, `positional_prob_avg`, `neighborhood_density`, `str_phono_prob_avg`, `str_positional_prob_avg`, `str_neighborhood_density`	Researcher precision (4 decimals on BPP/PSP); ND borderline-clinical but no Reddit/SBIR signal
Lexical (stragglers)	`frequency` (raw), `contextual_diversity`, `pos_dominant_freq`, `log_frequency`	Raw freq redundant with Zipf; CD duplicates freq; POS dominance is a researcher metric; log_frequency subsumed by `freq_age_adult` in the new Age group
Cognitive/Embodied	`iconicity`, `boi`, `socialness`, `semd_topic`, `n_topics_for_word`	Research norms; no clinical workflow signal
Morphological	`morpheme_count`, `n_prefixes`, `n_suffixes`	Niche literacy use; not in empirical signal

Already retired (verify removal in this pass): freq_cyplex_*, semantic_diversity alias, semd_vn, semd_h13 (all retired in PHON-117); Lancaster sensorimotor (NO-GO 2026-05-12); ELP RT (PHON-71/75).

4. Two derived data additions¶

These are not new datasets — strict aggregations / derivations of existing raw data:

4a. `cv_shape` — whole-word CV skeleton¶

Derived from existing Syllable objects (packages/data/src/phonolex_data/phonology/syllabification.py).

# pseudocode in pipeline/words.py
def compute_cv_shape(syllables: list[Syllable]) -> str:
    parts = []
    for syl in syllables:
        parts.append("C" * len(syl.onset) + "V" + "C" * len(syl.coda))
    return "-".join(parts)

# examples:
#   "cat"     /k.æ.t/         → "CVC"
#   "spring"  /s.p.ɹ.ɪ.ŋ/     → "CCCVC"
#   "kitten"  /k.ɪ.t.ə.n/     → "CVC-VC"
#   "boat"    /b.oʊ.t/        → "CVC"  (diphthong = single V, already in VOWELS set)

New column cv_shape: str on words.parquet (lives on words table not word_properties because it's the first string-typed platform property).
New PropertyDef in both packages/web/workers/src/config/properties.ts and packages/web/workers/scripts/config.py. Marked with a new kind: 'numeric' | 'categorical' field on PropertyDef; default 'numeric' for backwards compat; cv_shape is 'categorical'.
platform_visible: true.
API filter: cv_shape accepts either an exact string match or a comma-separated list (OR within the list).

4b. `freq_age_adult` — adult developmental-frequency headline¶

Mean of existing wpm_b4 + wpm_b5 raw band cols (FineWeb-Edu grade-banded; b4/b5 are the high-school/college tail). Mirrors the existing 4 headline aggregations in DEV_FREQ_HEADLINES.

# pseudocode in pipeline/words.py — alongside the existing freq_age_2y/5y/8y/12y computations
def compute_freq_age_adult(row) -> float:
    return mean_missing_as_zero(row.wpm_b4, row.wpm_b5)

New entry in DEV_FREQ_HEADLINES array (both properties.ts and config.py).
Scale 0–50000 wpm, use_log_scale: true, mirrors siblings.
platform_visible: true.
log_frequency drops out of the platform surface (still in the API).

5. UI surface¶

Builder.tsx restructures into 5 accordion sections (preserves the existing <Accordion> pattern; reuses PropertySlider, PhonemePickerDialog, WordListTable):

┌──────────────────────────────────────────────────────────────┐
│ ▼ Phoneme rules                              (default open)  │
│   • Pattern matching (existing, +CONTAINS_MEDIAL catch-up)   │
│   • Exclude phonemes (existing)                              │
│   • Similar to ____ anchor word (new fold-in)                │
├──────────────────────────────────────────────────────────────┤
│ ▼ Word Shape                                 (default open)  │
│   • syllable_count slider                                    │
│   • phoneme_count slider                                     │
│   • wcm_score slider                                         │
│   • CV shape chip picker (new categorical rule)              │
├──────────────────────────────────────────────────────────────┤
│ ▶ Age Appropriateness    (6 sliders, collapsed)              │
├──────────────────────────────────────────────────────────────┤
│ ▶ Imagery & Familiarity  (2 sliders, collapsed)              │
├──────────────────────────────────────────────────────────────┤
│ ▶ Emotional Tone         (2 sliders, collapsed)              │
└──────────────────────────────────────────────────────────────┘
                                                  [ Build list ]

5a. "Similar to" rule — lift from PhonologicalSimilarityTool¶

Located inside the Phoneme rules accordion (composable as one more rule; not a separate section because the user is "equally composable" — section parity would visually elevate it).

┌─ Similar to ──────────────────────────────────────┐
│ Anchor word: [snake               ]               │
│                                                   │
│ Preset:                                           │
│ [ Rhymes ● ][ Alliteration ][ Assonance ]         │
│ [ Consonance ][ Balanced  ]                       │
│                                                   │
│ Match strength: [ High (0.85) ▼ ]                 │
│                                                   │
│ ▶ Advanced (component weights + position)         │ ← collapsed disclosure
└───────────────────────────────────────────────────┘

Lift directly from PhonologicalSimilarityTool.tsx: - PRESETS array (Rhymes / Balanced / Alliteration / Assonance / Consonance) — lines 47–83 - Labeled-bucket threshold select (Very High / High / Medium / Low / Very Low) — lines 292–305 - Position + syllableCount coupled selects with disable-on-all-or-medial logic — lines 176–214

Behavior: - Empty anchor → rule inactive (no similarity backend call, no result intersection) - Active anchor → backend hit; results AND-intersected with property+pattern+exclude results - Result ordering: when similarity rule is active, sort defaults to similarity desc; otherwise word-name asc (existing WordListTable defaultSort prop)

5b. CV shape rule — categorical chip picker¶

New reusable component <CategoricalRule> (planned, generic for future categorical filters):

┌─ CV shape ────────────────────────────────────────┐
│ Common shapes:                                    │
│ [ V ][ CV ][ VC ][ CVC ● ][ CCV ][ CCVC ]         │
│ [ CVCC ][ CCVCC ][ CV-CV ][ CV-CVC ][ CCV-CV ]    │
│                                                   │
│ Custom: [ CVCV-CVC      ]    [+ Add]              │
│                                                   │
│ Active: CVC, CV-CV  ✕                             │
└───────────────────────────────────────────────────┘

Multi-select chips (OR semantics within rule, AND with rest of query)
Common-shapes preset list covers apraxia progression + cluster work + cycles staples
Free-text "Custom" input + Add button — accepts any sequence matching ^[CV]+(-[CV]+)*$
"Active" line shows the current OR'd selection with individual remove chips

The <CategoricalRule> component is reusable: same pattern would serve a future POS chip rule.

6. Platform / API separation mechanism¶

Two-tier visibility for properties. Preserve existing surfaced semantics; add platform_visible.

// packages/web/workers/src/config/properties.ts
export interface PropertyDef {
  // ... existing fields including surfaced
  surfaced?: boolean;          // unchanged: false = D1-only, not in /api/property-metadata
  platform_visible?: boolean;  // NEW: true = shown in platform UI; default undefined (= API-only)
  kind?: 'numeric' | 'categorical';  // NEW: default 'numeric'; cv_shape is 'categorical'
}

Metadata route gains a query param:

GET /api/property-metadata                  → full surfaced set (researcher; current behavior)
GET /api/property-metadata?surface=platform → curated 14-property platform subset

getSurfacedCategories() stays as-is. Add:

export function getPlatformCategories(): PropertyCategory[] {
  // filter to surfaced && platform_visible === true; drop empty groups
}

Frontend usePropertyMetadata hook calls ?surface=platform by default. Mirror updates in packages/web/workers/scripts/config.py.

14 properties get platform_visible: true: - Word Shape: syllable_count, phoneme_count, wcm_score, cv_shape - Age Appropriateness: aoa, freq_age_2y, freq_age_5y, freq_age_8y, freq_age_12y, freq_age_adult - Imagery & Familiarity: concreteness, familiarity - Emotional Tone: valence, arousal

All other surfaced properties get no platform_visible flag (i.e., API-only by default).

7. Sound similarity fold-in mechanism (C1: single combined endpoint)¶

POST /api/words/search gains an optional similar_to block:

interface WordSearchRequest {
  patterns?: Pattern[];
  exclude_phonemes?: string[];
  cv_shape?: string[];        // OR within array
  // ... existing min_/max_ filter params
  similar_to?: {
    word: string;
    weights: { onset: number; nucleus: number; coda: number };
    threshold: number;
    position: 'all' | 'initial' | 'final' | 'medial';
    syllable_count: number;
  };
}

Server flow when similar_to is present: 1. Run the existing similarity scan to produce a { word → similarity_score } map above threshold 2. Run the existing filter+pattern query against words table 3. Intersect by word name 4. Sort intersection by similarity desc 5. Return top N with similarity field populated on each row

Server flow when similar_to is absent: - Unchanged from today

Frontend apiClient updates: findSimilarWords retires; searchWords accepts the optional similar_to block. Existing direct /api/similarity/search route stays in place for backward compat (no UI uses it anymore once Builder migrates; can be removed in a v5.3 cleanup).

Implementation summary (deferred to plan)¶

Files touched (anticipated, for planning purposes only):

Data layer - packages/data/src/phonolex_data/pipeline/words.py — add cv_shape derivation + freq_age_adult headline - packages/data/src/phonolex_data/runtime/schema.py — register new columns

API layer - packages/web/workers/src/config/properties.ts — add platform_visible, kind, cv_shape PropertyDef, freq_age_adult PropertyDef, getPlatformCategories() function - packages/web/workers/scripts/config.py — mirror - packages/web/workers/src/routes/meta.ts (or wherever property-metadata is served) — ?surface=platform handling - packages/web/workers/src/routes/words.ts — accept similar_to block in /api/words/search; intersect logic - packages/web/workers/src/routes/similarity.ts — left in place for backward compat (slated for v5.3 removal) - packages/web/workers/src/types.ts — update WordSearchRequest, Word types

Frontend - packages/web/frontend/src/components/Builder.tsx — restructure 5 accordions; wire new components - packages/web/frontend/src/components/shared/SimilarToRule.tsx — NEW; lifted from PhonologicalSimilarityTool.tsx - packages/web/frontend/src/components/shared/CategoricalRule.tsx — NEW; reusable chip + custom-input picker - packages/web/frontend/src/components/tools/PhonologicalSimilarityTool.tsx — DELETE - packages/web/frontend/src/services/phonolexApi.ts (or wherever apiClient lives) — update searchWords signature - packages/web/frontend/src/hooks/usePropertyMetadata.ts — call ?surface=platform - packages/web/frontend/src/App_new.tsx — update Word Lists description to remove the "rhyming and sound-similarity workflows" sub-mention (now part of the unified tool)

Tests - packages/data/tests/ — cv_shape derivation correctness across a sample of CMU words; freq_age_adult aggregation matches sibling pattern - packages/web/workers/test/ — ?surface=platform filtering; similar_to intersection logic with mocked similarity backend - packages/web/frontend/src/test/ — Builder.tsx restructure smoke; CategoricalRule + SimilarToRule unit

Data regeneration - Regenerate data/runtime/words.parquet via uv run python packages/data/scripts/build_runtime_parquet.py - Regenerate d1-seed.sql via packages/web/workers/scripts/export-to-d1.py - Apply migration to local + staging + prod D1

Risks¶

Loss aversion from researchers who used the dropped properties — mitigated by API parity. The 35→14 cut is UI-only; nothing leaves the data layer. Frame the change as "platform got SLP-focused" not "we removed properties."
cv_shape interpretation edge cases — syllabic consonants (/n̩/ in "button"), affricates (/tʃ/ is one C or two?), diphthongs. The existing syllabifier already collapses diphthongs to single V (verified — they're in the VOWELS set at syllabification.py:39). Affricates are single phonemes in CMU/IPA mappings → single C. Syllabic consonants are rare in CMU dict outputs and would parse as schwa+C. Document the rules; not blocking.
D1 column count headroom — D1 hard limit is 100 cols per table (CLAUDE.md gotcha). words table currently has phonemes_str and a few string cols; adding cv_shape should fit comfortably. Verify before write.
PHON-117 retirement leakage — frontend may still be showing retired properties via stale metadata cache. The migration to ?surface=platform ensures only platform_visible: true properties render; retired properties (surfaced: false) were already gone but worth verifying in QA.
First categorical rule UI primitive — CategoricalRule is new infrastructure. Component design should be opinionated enough to be reusable (POS, future hypotheticals) without being a generic form-builder.

Open follow-ups (deferred — not part of this design)¶

POS top-level chip filter (NOUN/VERB/ADJ/ADV) — data is there but adding the surface exceeds "just curate." Worth filing.
Starter query presets ("articulation /s/-initial CVC age 5", "rhymes for X", "early intervention core vocab") — compelling UX moment; revisit after curation lands.
Researcher mode toggle for in-platform researchers — currently they use the API; if SBIR or downstream work surfaces demand, build a toggle.
/api/similarity/search removal — keep for backward compat through v5.3, then drop once no consumer remains.
Restore retired property visibility audit — verify frontend isn't still showing freq_cyplex_*, semd_vn, semd_h13 via stale state.

References¶

Current Builder: packages/web/frontend/src/components/Builder.tsx
Current property surface: packages/web/workers/src/config/properties.ts
Hidden similarity tool: packages/web/frontend/src/components/tools/PhonologicalSimilarityTool.tsx
Similarity route: packages/web/workers/src/routes/similarity.ts
Tool registry: packages/web/frontend/src/App_new.tsx
Reddit SLP corpus analysis: ~/Repos/speech-community-analysis/data/reports/codebook_v0.1.md, ~/Repos/speech-community-analysis/data/reports/phase1_memo.md
SBIR SLP survey draft: ~/Repos/speech-community-analysis/phonolex_slp_survey_v01.md
PHON-117 (Sound Similarity into Word Lists) — referenced in App_new.tsx:117 comment