Word Lists SLP Curation — Design Spec¶
Date: 2026-05-14
Status: Design — pending plan
Owner: Jared Neumann
Brainstorming session: 2026-05-14 conversation, branch feature/phon-116-naturalness-scorer (will branch fresh)
Problem¶
The current Word Lists tool (packages/web/frontend/src/components/Builder.tsx) exposes 30 surfaced filterable properties across 8 categories, three pattern types, and a phoneme exclusion list. It was built as a "power tool" for researcher-grade exploration, but the empirical SLP audience — confirmed via Reddit corpus analysis (~/Repos/speech-community-analysis) and the SBIR survey draft — uses a much narrower vocabulary of clinical concepts.
Two adjacent problems compound:
- Sound similarity is a separate tool (
PhonologicalSimilarityTool.tsx, currently unwired from the tool registry per theApp_new.tsx:117comment "PHON-117: Sound Similarity is being consolidated into Word Lists"). It needs a home, and the natural one is as one more composable rule inside Word Lists. - Syllable shape (CV pattern) appears repeatedly in SLP discussion (cluster reduction, cycles approach, apraxia progression, complexity approach — Reddit clusters 459/480/491/2611, 1921, plus SBIR Q10 row "phonological / phonotactic profile"). It is latent in the data — the syllabification module already produces
Syllableobjects with onset/nucleus/coda lists — but never emitted as a queryable column.
Goals¶
- Curate the platform UI down to ~14 clinically relevant properties across 4 groups, with SLP-language labels.
- Fold sound similarity in as one composable rule alongside patterns, exclusions, and property bounds. All rules AND together.
- Surface syllable shape (CV pattern) as a new derived column + categorical rule UI primitive.
- Keep the API researcher-grade — separation of concerns. Researchers get the full property set; clinicians get the curated subset.
- No new datasets, no new metrics — pure curation + one derived column (
cv_shape) + one aggregated headline (freq_age_adult) from existing raw band data.
Non-goals¶
- Building a "Researcher mode" UI toggle. The API is the researcher surface; we don't need an in-platform power-user mode in this pass.
- Designing higher-level query presets ("articulation /s/-initial CVC age 5"). Compelling UX moment but out of scope; revisit after curation lands.
- Touching Contrastive Sets, Sentences (governed generation), Text Analysis, or Lookup. Single-tool surgery.
- Adding a POS top-level filter chip. The data is there (v5.2 is NOUN/VERB/ADJ/ADV) but adding it as a new categorical filter exceeds the "just curate" constraint. Flag as a follow-up.
- Adding new pattern types beyond what the data layer already supports. (CONTAINS_MEDIAL added in PHON-110 may need a small frontend catch-up but doesn't earn its own design section.)
Empirical SLP signals driving the curation¶
From the Reddit SLP corpus (92,842 thread-context units, ~/Repos/speech-community-analysis/data/reports/codebook_v0.1.md) and the SBIR survey draft (~/Repos/speech-community-analysis/phonolex_slp_survey_v01.md):
- Cluster 412 (747 threads): Speech Therapy Materials and Resources — second-largest cluster overall; the explicit materials-prep pain point.
- Per-phoneme articulation clusters: 459 (/r/), 480 (/s/), 491 (velar), 428 (/l/), 458 (R challenges). Direct evidence that SLPs frame Word Lists work as "target this phoneme."
- Cluster 1921: phonological treatment approaches — minimal pairs, maximal opposition, cycles approach. (Most served by Contrastive Sets, but Word Lists feeds the input.)
- Cluster 2611: complexity approach — Gierut/Storkel target selection via WCM.
- Cluster 427 (387 threads): Clinical Goal Writing and Implementation — measurable goals feed measurable word lists.
- Cluster 386/387: Adult Cognitive Therapy + Adult Cog-Comm Materials — separate audience requiring adult-band vocab norms.
- SBIR Q10 mapping:
- "Cannot find enough items at the right phonological / phonotactic profile" → phoneme position + exclusion + shape + length
- "Cannot find enough items at the right grade / vocabulary level" → AoA + developmental frequency by age band
- "Materials don't avoid trauma-related content for sensitive students" → valence + arousal
- "Not enough variety to maintain student engagement across sessions" → variety covered by composability + sound similarity for generalization probes
Design¶
1. Audience and segmentation¶
Platform UI is curated for three SLP segments:
- Pediatric articulation / phonology
- Pediatric language / literacy
- Adult aphasia / dysarthria
Researchers / grad students use the API. This is a clean separation of concerns: platform = clinical, API = researcher-grade.
2. Composition model¶
Word Lists is a list of rules that AND together. Any rule can be empty/skipped. Five rule types:
| Rule type | Existing? | Notes |
|---|---|---|
| Phoneme position pattern | yes | STARTS_WITH / ENDS_WITH / CONTAINS / CONTAINS_MEDIAL; multiple patterns AND |
| Exclude phonemes | yes | Single blacklist of IPA phonemes |
| Similar to anchor word | yes (separate tool) | Folded in; optional anchor + preset/weights + threshold |
| Property bounds | yes | Curated to 13 numeric properties (see §3) |
| Shape (CV pattern) | NO | New categorical rule against new cv_shape derived column |
3. Curated property surface¶
Final platform surface: 4 groups, 14 properties.
| Group label (proposed) | Current group | Properties | Why kept |
|---|---|---|---|
| Word Shape | Phonological Complexity | syllable_count, phoneme_count, wcm_score, cv_shape ← new |
Complexity approach, apraxia progression, cluster work, cycles |
| Age Appropriateness | Lexical + Developmental Frequency (merged) | aoa, freq_age_2y, freq_age_5y, freq_age_8y, freq_age_12y, freq_age_adult ← new headline |
Age-stratified vocab is THE bottleneck SLPs name |
| Imagery & Familiarity | Semantic Properties | concreteness, familiarity |
Picture stim viability + "does the kid know it" |
| Emotional Tone | Affective Properties | valence, arousal |
SBIR Q10 trauma-sensitive content; aphasia stim work |
Dropped from the platform UI (still accessible via API):
| Group | Dropped properties | Why dropped |
|---|---|---|
| Phonotactic Probability | phono_prob_avg, positional_prob_avg, neighborhood_density, str_phono_prob_avg, str_positional_prob_avg, str_neighborhood_density |
Researcher precision (4 decimals on BPP/PSP); ND borderline-clinical but no Reddit/SBIR signal |
| Lexical (stragglers) | frequency (raw), contextual_diversity, pos_dominant_freq, log_frequency |
Raw freq redundant with Zipf; CD duplicates freq; POS dominance is a researcher metric; log_frequency subsumed by freq_age_adult in the new Age group |
| Cognitive/Embodied | iconicity, boi, socialness, semd_topic, n_topics_for_word |
Research norms; no clinical workflow signal |
| Morphological | morpheme_count, n_prefixes, n_suffixes |
Niche literacy use; not in empirical signal |
Already retired (verify removal in this pass): freq_cyplex_*, semantic_diversity alias, semd_vn, semd_h13 (all retired in PHON-117); Lancaster sensorimotor (NO-GO 2026-05-12); ELP RT (PHON-71/75).
4. Two derived data additions¶
These are not new datasets — strict aggregations / derivations of existing raw data:
4a. cv_shape — whole-word CV skeleton¶
Derived from existing Syllable objects (packages/data/src/phonolex_data/phonology/syllabification.py).
# pseudocode in pipeline/words.py
def compute_cv_shape(syllables: list[Syllable]) -> str:
parts = []
for syl in syllables:
parts.append("C" * len(syl.onset) + "V" + "C" * len(syl.coda))
return "-".join(parts)
# examples:
# "cat" /k.æ.t/ → "CVC"
# "spring" /s.p.ɹ.ɪ.ŋ/ → "CCCVC"
# "kitten" /k.ɪ.t.ə.n/ → "CVC-VC"
# "boat" /b.oʊ.t/ → "CVC" (diphthong = single V, already in VOWELS set)
- New column
cv_shape: stronwords.parquet(lives onwordstable notword_propertiesbecause it's the first string-typed platform property). - New
PropertyDefin bothpackages/web/workers/src/config/properties.tsandpackages/web/workers/scripts/config.py. Marked with a newkind: 'numeric' | 'categorical'field onPropertyDef; default'numeric'for backwards compat;cv_shapeis'categorical'. platform_visible: true.- API filter:
cv_shapeaccepts either an exact string match or a comma-separated list (OR within the list).
4b. freq_age_adult — adult developmental-frequency headline¶
Mean of existing wpm_b4 + wpm_b5 raw band cols (FineWeb-Edu grade-banded; b4/b5 are the high-school/college tail). Mirrors the existing 4 headline aggregations in DEV_FREQ_HEADLINES.
# pseudocode in pipeline/words.py — alongside the existing freq_age_2y/5y/8y/12y computations
def compute_freq_age_adult(row) -> float:
return mean_missing_as_zero(row.wpm_b4, row.wpm_b5)
- New entry in
DEV_FREQ_HEADLINESarray (bothproperties.tsandconfig.py). - Scale
0–50000wpm,use_log_scale: true, mirrors siblings. platform_visible: true.log_frequencydrops out of the platform surface (still in the API).
5. UI surface¶
Builder.tsx restructures into 5 accordion sections (preserves the existing <Accordion> pattern; reuses PropertySlider, PhonemePickerDialog, WordListTable):
┌──────────────────────────────────────────────────────────────┐
│ ▼ Phoneme rules (default open) │
│ • Pattern matching (existing, +CONTAINS_MEDIAL catch-up) │
│ • Exclude phonemes (existing) │
│ • Similar to ____ anchor word (new fold-in) │
├──────────────────────────────────────────────────────────────┤
│ ▼ Word Shape (default open) │
│ • syllable_count slider │
│ • phoneme_count slider │
│ • wcm_score slider │
│ • CV shape chip picker (new categorical rule) │
├──────────────────────────────────────────────────────────────┤
│ ▶ Age Appropriateness (6 sliders, collapsed) │
├──────────────────────────────────────────────────────────────┤
│ ▶ Imagery & Familiarity (2 sliders, collapsed) │
├──────────────────────────────────────────────────────────────┤
│ ▶ Emotional Tone (2 sliders, collapsed) │
└──────────────────────────────────────────────────────────────┘
[ Build list ]
5a. "Similar to" rule — lift from PhonologicalSimilarityTool¶
Located inside the Phoneme rules accordion (composable as one more rule; not a separate section because the user is "equally composable" — section parity would visually elevate it).
┌─ Similar to ──────────────────────────────────────┐
│ Anchor word: [snake ] │
│ │
│ Preset: │
│ [ Rhymes ● ][ Alliteration ][ Assonance ] │
│ [ Consonance ][ Balanced ] │
│ │
│ Match strength: [ High (0.85) ▼ ] │
│ │
│ ▶ Advanced (component weights + position) │ ← collapsed disclosure
└───────────────────────────────────────────────────┘
Lift directly from PhonologicalSimilarityTool.tsx:
- PRESETS array (Rhymes / Balanced / Alliteration / Assonance / Consonance) — lines 47–83
- Labeled-bucket threshold select (Very High / High / Medium / Low / Very Low) — lines 292–305
- Position + syllableCount coupled selects with disable-on-all-or-medial logic — lines 176–214
Behavior:
- Empty anchor → rule inactive (no similarity backend call, no result intersection)
- Active anchor → backend hit; results AND-intersected with property+pattern+exclude results
- Result ordering: when similarity rule is active, sort defaults to similarity desc; otherwise word-name asc (existing WordListTable defaultSort prop)
5b. CV shape rule — categorical chip picker¶
New reusable component <CategoricalRule> (planned, generic for future categorical filters):
┌─ CV shape ────────────────────────────────────────┐
│ Common shapes: │
│ [ V ][ CV ][ VC ][ CVC ● ][ CCV ][ CCVC ] │
│ [ CVCC ][ CCVCC ][ CV-CV ][ CV-CVC ][ CCV-CV ] │
│ │
│ Custom: [ CVCV-CVC ] [+ Add] │
│ │
│ Active: CVC, CV-CV ✕ │
└───────────────────────────────────────────────────┘
- Multi-select chips (OR semantics within rule, AND with rest of query)
- Common-shapes preset list covers apraxia progression + cluster work + cycles staples
- Free-text "Custom" input + Add button — accepts any sequence matching
^[CV]+(-[CV]+)*$ - "Active" line shows the current OR'd selection with individual remove chips
The <CategoricalRule> component is reusable: same pattern would serve a future POS chip rule.
6. Platform / API separation mechanism¶
Two-tier visibility for properties. Preserve existing surfaced semantics; add platform_visible.
// packages/web/workers/src/config/properties.ts
export interface PropertyDef {
// ... existing fields including surfaced
surfaced?: boolean; // unchanged: false = D1-only, not in /api/property-metadata
platform_visible?: boolean; // NEW: true = shown in platform UI; default undefined (= API-only)
kind?: 'numeric' | 'categorical'; // NEW: default 'numeric'; cv_shape is 'categorical'
}
Metadata route gains a query param:
GET /api/property-metadata → full surfaced set (researcher; current behavior)
GET /api/property-metadata?surface=platform → curated 14-property platform subset
getSurfacedCategories() stays as-is. Add:
export function getPlatformCategories(): PropertyCategory[] {
// filter to surfaced && platform_visible === true; drop empty groups
}
Frontend usePropertyMetadata hook calls ?surface=platform by default. Mirror updates in packages/web/workers/scripts/config.py.
14 properties get platform_visible: true:
- Word Shape: syllable_count, phoneme_count, wcm_score, cv_shape
- Age Appropriateness: aoa, freq_age_2y, freq_age_5y, freq_age_8y, freq_age_12y, freq_age_adult
- Imagery & Familiarity: concreteness, familiarity
- Emotional Tone: valence, arousal
All other surfaced properties get no platform_visible flag (i.e., API-only by default).
7. Sound similarity fold-in mechanism (C1: single combined endpoint)¶
POST /api/words/search gains an optional similar_to block:
interface WordSearchRequest {
patterns?: Pattern[];
exclude_phonemes?: string[];
cv_shape?: string[]; // OR within array
// ... existing min_/max_ filter params
similar_to?: {
word: string;
weights: { onset: number; nucleus: number; coda: number };
threshold: number;
position: 'all' | 'initial' | 'final' | 'medial';
syllable_count: number;
};
}
Server flow when similar_to is present:
1. Run the existing similarity scan to produce a { word → similarity_score } map above threshold
2. Run the existing filter+pattern query against words table
3. Intersect by word name
4. Sort intersection by similarity desc
5. Return top N with similarity field populated on each row
Server flow when similar_to is absent:
- Unchanged from today
Frontend apiClient updates: findSimilarWords retires; searchWords accepts the optional similar_to block. Existing direct /api/similarity/search route stays in place for backward compat (no UI uses it anymore once Builder migrates; can be removed in a v5.3 cleanup).
Implementation summary (deferred to plan)¶
Files touched (anticipated, for planning purposes only):
Data layer
- packages/data/src/phonolex_data/pipeline/words.py — add cv_shape derivation + freq_age_adult headline
- packages/data/src/phonolex_data/runtime/schema.py — register new columns
API layer
- packages/web/workers/src/config/properties.ts — add platform_visible, kind, cv_shape PropertyDef, freq_age_adult PropertyDef, getPlatformCategories() function
- packages/web/workers/scripts/config.py — mirror
- packages/web/workers/src/routes/meta.ts (or wherever property-metadata is served) — ?surface=platform handling
- packages/web/workers/src/routes/words.ts — accept similar_to block in /api/words/search; intersect logic
- packages/web/workers/src/routes/similarity.ts — left in place for backward compat (slated for v5.3 removal)
- packages/web/workers/src/types.ts — update WordSearchRequest, Word types
Frontend
- packages/web/frontend/src/components/Builder.tsx — restructure 5 accordions; wire new components
- packages/web/frontend/src/components/shared/SimilarToRule.tsx — NEW; lifted from PhonologicalSimilarityTool.tsx
- packages/web/frontend/src/components/shared/CategoricalRule.tsx — NEW; reusable chip + custom-input picker
- packages/web/frontend/src/components/tools/PhonologicalSimilarityTool.tsx — DELETE
- packages/web/frontend/src/services/phonolexApi.ts (or wherever apiClient lives) — update searchWords signature
- packages/web/frontend/src/hooks/usePropertyMetadata.ts — call ?surface=platform
- packages/web/frontend/src/App_new.tsx — update Word Lists description to remove the "rhyming and sound-similarity workflows" sub-mention (now part of the unified tool)
Tests
- packages/data/tests/ — cv_shape derivation correctness across a sample of CMU words; freq_age_adult aggregation matches sibling pattern
- packages/web/workers/test/ — ?surface=platform filtering; similar_to intersection logic with mocked similarity backend
- packages/web/frontend/src/test/ — Builder.tsx restructure smoke; CategoricalRule + SimilarToRule unit
Data regeneration
- Regenerate data/runtime/words.parquet via uv run python packages/data/scripts/build_runtime_parquet.py
- Regenerate d1-seed.sql via packages/web/workers/scripts/export-to-d1.py
- Apply migration to local + staging + prod D1
Risks¶
- Loss aversion from researchers who used the dropped properties — mitigated by API parity. The 35→14 cut is UI-only; nothing leaves the data layer. Frame the change as "platform got SLP-focused" not "we removed properties."
cv_shapeinterpretation edge cases — syllabic consonants (/n̩/in "button"), affricates (/tʃ/is one C or two?), diphthongs. The existing syllabifier already collapses diphthongs to single V (verified — they're in theVOWELSset atsyllabification.py:39). Affricates are single phonemes in CMU/IPA mappings → single C. Syllabic consonants are rare in CMU dict outputs and would parse as schwa+C. Document the rules; not blocking.- D1 column count headroom — D1 hard limit is 100 cols per table (CLAUDE.md gotcha).
wordstable currently hasphonemes_strand a few string cols; addingcv_shapeshould fit comfortably. Verify before write. - PHON-117 retirement leakage — frontend may still be showing retired properties via stale metadata cache. The migration to
?surface=platformensures onlyplatform_visible: trueproperties render; retired properties (surfaced: false) were already gone but worth verifying in QA. - First categorical rule UI primitive —
CategoricalRuleis new infrastructure. Component design should be opinionated enough to be reusable (POS, future hypotheticals) without being a generic form-builder.
Open follow-ups (deferred — not part of this design)¶
- POS top-level chip filter (NOUN/VERB/ADJ/ADV) — data is there but adding the surface exceeds "just curate." Worth filing.
- Starter query presets ("articulation /s/-initial CVC age 5", "rhymes for X", "early intervention core vocab") — compelling UX moment; revisit after curation lands.
- Researcher mode toggle for in-platform researchers — currently they use the API; if SBIR or downstream work surfaces demand, build a toggle.
/api/similarity/searchremoval — keep for backward compat through v5.3, then drop once no consumer remains.- Restore retired property visibility audit — verify frontend isn't still showing
freq_cyplex_*,semd_vn,semd_h13via stale state.
References¶
- Current Builder:
packages/web/frontend/src/components/Builder.tsx - Current property surface:
packages/web/workers/src/config/properties.ts - Hidden similarity tool:
packages/web/frontend/src/components/tools/PhonologicalSimilarityTool.tsx - Similarity route:
packages/web/workers/src/routes/similarity.ts - Tool registry:
packages/web/frontend/src/App_new.tsx - Reddit SLP corpus analysis:
~/Repos/speech-community-analysis/data/reports/codebook_v0.1.md,~/Repos/speech-community-analysis/data/reports/phase1_memo.md - SBIR SLP survey draft:
~/Repos/speech-community-analysis/phonolex_slp_survey_v01.md - PHON-117 (Sound Similarity into Word Lists) — referenced in
App_new.tsx:117comment