Integrated Lexical Database Pipeline Design¶
2026-03-13 — Replace pickle-based pipeline with direct data assembly from raw datasets
Goal¶
Build the integrated lexical database directly from 25 raw source datasets through phonolex_data loaders and phonology utils, eliminating the cognitive_graph_v1.1_empirical.pkl intermediate artifact. The pipeline produces a LexicalDatabase object that any consumer (D1 export, governor lookup builder, future tools) can use.
Context¶
The monorepo migration created packages/data/ with loaders for 18 datasets and phonology utilities (syllabification, WCM, normalization). The current export-to-d1.py reads a pre-built pickle for all word data and edges. That pickle was built by scripts no longer in the codebase, and 4 of 7 edge source datasets had no loaders.
We've now acquired 8 additional datasets (MEN, WordSim-353, SPP, ECCC, MorphoLex, Prevalence, CYP-LEX, IPhOD) and need a pipeline that builds everything from source.
Approach¶
Focused pipeline modules in packages/data/src/phonolex_data/pipeline/, each with a single responsibility. A thin orchestrator assembles the full database. Consumers apply their own downstream logic (SQL export, tokenizer-based lookup building, API-level filtering).
No vocabulary filtering in the pipeline or database. The word universe is the union of all source datasets — any word appearing in CMU dict, SUBTLEX, prevalence, or any other dataset gets a record. Words with CMU entries have full phonological data; words without CMU entries have null phonological fields but carry whatever norms/frequency data their source datasets provide. Filtering is a query-time / consumer concern.
1. Pipeline Modules¶
Module: schema.py — Data Contract¶
Shared types consumed by all pipeline stages and downstream consumers.
@dataclass
class WordRecord:
word: str
has_phonology: bool # True for CMU dict words, False for norm-only words
# Phonological fields — populated for CMU dict words, None/empty for norm-only words
ipa: str | None
phonemes: list[str] # empty list for non-CMU words
phoneme_count: int | None
syllables: list[dict] # [{onset, nucleus, coda, stress}], empty for non-CMU
syllable_count: int | None
initial_phoneme: str | None
final_phoneme: str | None
wcm_score: int | None
# Norms — all optional, None means no data
frequency: float | None # SUBTLEX
log_frequency: float | None
contextual_diversity: float | None
prevalence: float | None # Brysbaert (NEW)
aoa: float | None # Glasgow AoA
aoa_kuperman: float | None
imageability: float | None
familiarity: float | None
concreteness: float | None
size: float | None
valence: float | None
arousal: float | None
dominance: float | None
iconicity: float | None
boi: float | None
socialness: float | None
auditory: float | None
visual: float | None
haptic: float | None
gustatory: float | None
olfactory: float | None
interoceptive: float | None
hand_arm: float | None
foot_leg: float | None
head: float | None
mouth: float | None
torso: float | None
elp_lexical_decision_rt: float | None
semantic_diversity: float | None
# Morphology (NEW — MorphoLex)
morpheme_count: int | None
is_monomorphemic: bool | None
n_prefixes: int | None
n_suffixes: int | None
morphological_segmentation: str | None # e.g., "un|break|able"
# Phonotactic probability (IPhOD replaces Vitevitch & Luce JSON loader)
neighborhood_density: int | None # unsDENS from IPhOD
phono_prob_avg: float | None # unsBPAV (unstressed biphone average)
positional_prob_avg: float | None # unsPOSPAV (unstressed positional segment avg)
# IPhOD stressed variants
str_phono_prob_avg: float | None # strBPAV (stressed biphone average)
str_positional_prob_avg: float | None # strPOSPAV (stressed positional segment avg)
str_neighborhood_density: int | None # strDENS
# Child frequency (NEW — CYP-LEX)
freq_cyplex_7_9: float | None # Zipf frequency, age 7-9
freq_cyplex_10_12: float | None # Zipf frequency, age 10-12
freq_cyplex_13: float | None # Zipf frequency, age 13+
# Vocab memberships
vocab_memberships: set[str]
@dataclass
class EdgeRecord:
source: str
target: str
edge_sources: list[str] # ["SWOW", "USF", ...]
swow_strength: float | None
usf_forward: float | None
usf_backward: float | None
men_relatedness: float | None
simlex_similarity: float | None
simlex_pos: str | None # POS from SimLex-999; requires updating load_simlex()
wordsim_relatedness: float | None
# SPP — priming effects (RT differences) between related and unrelated conditions
spp_first_priming: float | None # first_priming_overall (1st associate)
spp_other_priming: float | None # other_priming_overall (other associate)
spp_fas: float | None # firstassoc_fas (forward association strength)
spp_lsa: float | None # firstassoc_lsa (latent semantic analysis similarity)
# ECCC — aggregated speech-in-noise confusion data
eccc_consistency: float | None # proportion of listeners giving same confusion response
eccc_n_instances: int | None # number of confusion instances for this pair
eccc_phoneme_distance: float | None
@dataclass
class DerivedData:
percentiles: dict[str, dict[str, float | None]] # word → {prop_percentile: val}
minimal_pairs: list[tuple] # (w1, w2, p1, p2, pos, pos_type)
phoneme_data: dict[str, dict] # ipa → {type, features}
phoneme_norms: dict[str, float] # ipa → norm_sq
phoneme_dots: list[tuple] # (ipa1, ipa2, dot)
components: list[dict] # [{id, type, phonemes}]
word_syllable_data: dict # word → [{onset_key, nucleus_key, coda_key}]
component_key_to_id: dict
@dataclass
class LexicalDatabase:
words: dict[str, WordRecord]
edges: list[EdgeRecord]
derived: DerivedData
phoible_vectors: dict # Raw PHOIBLE data for dot products
Module: words.py — Assemble Word Records¶
build_words() → dict[str, WordRecord]
Steps:
1. Load CMU dict via cmudict_to_phono() — base IPA + phoneme list → seed initial word records with full phonological data
2. Syllabify each word via phonolex_data.phonology.syllabification
3. Compute WCM via phonolex_data.phonology.wcm
4. Normalize IPA via phonolex_data.phonology.normalize
5. Load and merge all norm datasets:
- Existing (11): Warriner, Kuperman, Glasgow, Concreteness, Sensorimotor, Semantic Diversity, Socialness, BOI, SUBTLEX, ELP, Iconicity
- New (4): Prevalence, MorphoLex, CYP-LEX, IPhOD
- For each norm dataset, if a word doesn't exist yet in the word dict, create a new WordRecord with null phonological fields and populate only the norm data
6. Load vocab list memberships (Ogden, AFINN, stop words, Swadesh, Roget, GSL, AVL)
7. Return dict[str, WordRecord] — union of all source datasets, unfiltered
Each norm loader returns dict[str, dict[str, Any]] (word → property dict). Merging is a simple update loop: for each word in every dataset, get or create a WordRecord and merge in the properties.
New loaders needed:
- load_prevalence() — reads data/norms/prevalence/English_Word_Prevalences.xlsx
- load_morpholex() — reads data/norms/morpholex/MorphoLEX_en.xlsx, returns segmentation + counts
- load_cyplex() — reads data/norms/cyplex/CYPLEX_all_age_bands.csv (combined file with all 3 bands); maps CYPLEX79_log → freq_cyplex_7_9, CYPLEX1012_log → freq_cyplex_10_12, CYPLEX13_log → freq_cyplex_13
- load_iphod() — reads data/norms/iphod/IPhOD2_Words.txt (tab-delimited), returns density + phonotactic measures
Module: edges.py — Assemble Edge Records¶
build_edges(words: dict[str, WordRecord]) → list[EdgeRecord]
Steps:
1. Load association datasets via loaders:
- load_swow() → {cue: {response: strength}}
- load_free_association() → {cue: {target: forward_strength}}
- load_simlex() → [(w1, w2, similarity, pos)] (update existing to also return POS column)
- load_men() → [(w1, w2, relatedness)] (NEW)
- load_wordsim() → [(w1, w2, relatedness)] (NEW)
- load_spp() → [(target, prime, first_priming, other_priming, fas, lsa)] (NEW — from spp_ldt_item_analysis.xlsx)
- load_eccc() → [(target, confusion, consistency, n_instances, phoneme_distance)] (NEW — aggregated per pair from confusionCorpus_v1.2.csv)
2. Build edge index: dict[tuple[str, str], EdgeRecord] keyed by sorted word pair
3. For each dataset, iterate pairs and merge into the index:
- If pair exists: add the new attributes, append to edge_sources
- If new: create EdgeRecord with this dataset's attributes, others as None
4. Return list[EdgeRecord]
The words argument is passed so edges can be filtered to only include words that exist in the lexicon (avoiding dangling references).
Module: derived.py — Compute Derived Data¶
build_derived(words, phoible_vectors) → DerivedData
Same logic as current export-to-d1.py sections 3-8. All derived computations operate only on words where has_phonology=True — norm-only words are skipped since they have no phonemes, syllables, or IPA.
- Percentiles:
bisect_right(sorted_vals, val) / N * 100for all percentile-eligible properties (computed over all words that have each property, regardless ofhas_phonology) - Minimal pairs: group by phoneme count, compare same-length words for single-phoneme differences (phonology-only words)
- Phoneme data: extract from PHOIBLE vectors, classify vowel/consonant, decode features
- Phoneme dot products: pairwise dot products of 76-dimensional feature vectors
- Syllable components: extract unique onset/nucleus/coda tuples, assign IDs, build word-syllable mappings (phonology-only words)
Module: __init__.py — Orchestrator¶
def build_lexical_database() -> LexicalDatabase:
"""Build the complete integrated lexical database from raw datasets."""
phoible = load_phoible()
words = build_words()
edges = build_edges(words)
derived = build_derived(words, phoible)
return LexicalDatabase(words=words, edges=edges, derived=derived, phoible_vectors=phoible)
2. New Loaders¶
Added to packages/data/src/phonolex_data/loaders/:
norms.py (extend existing)¶
load_prevalence() → dict[str, dict]— word →{prevalence: float}. Note: reads thePknowncolumn (proportion 0-1), NOT thePrevalencecolumn (which is log-scale)load_iphod() → dict[str, dict]— word →{neighborhood_density, phono_prob_avg, positional_prob_avg, str_phono_prob_avg, str_positional_prob_avg, str_neighborhood_density}. Replaces the existingload_phonotactic_probability()fromphoible.py(Vitevitch & Luce JSON); that loader should be deprecated.
associations.py (extend existing)¶
load_simlex()— update return type fromlist[tuple[str, str, float]]tolist[tuple[str, str, float, str]]to include POS column (needed forEdgeRecord.simlex_pos)load_men() → list[tuple[str, str, float]]— (word1, word2, relatedness)load_wordsim() → list[tuple[str, str, float]]— (word1, word2, relatedness)load_spp() → list[tuple]— (target, prime, first_priming, other_priming, fas, lsa). Readsspp_ldt_item_analysis.xlsx; columns:target,prime_1st_assoc,first_priming_overall,other_priming_overall,firstassoc_fas,firstassoc_lsaload_eccc() → list[tuple]— (target, confusion, consistency, n_instances, phoneme_distance). Aggregated per (target, confusion) pair fromconfusionCorpus_v1.2.csv;Consistencycolumn is a raw listener count (e.g. 9), NOT a proportion — divide byN-Listenersto get proportion (0-1)
New file: morphology.py¶
load_morpholex() → dict[str, dict]— word →{morpheme_count, n_prefixes, n_suffixes, is_monomorphemic, segmentation}
New file: child_frequency.py¶
load_cyplex() → dict[str, dict]— word →{freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13}
3. Consumer Changes¶
packages/web/workers/scripts/export-to-d1.py¶
Becomes a thin SQL writer:
1. from phonolex_data.pipeline import build_lexical_database
2. db = build_lexical_database()
3. Write SQL using existing INSERT generation logic — no vocabulary filtering
4. For norm-only words (has_phonology=False): write SQL NULLs for phonological columns (not empty strings or zeros)
5. Only write word_syllables rows for words with has_phonology=True
6. Delete all pickle imports, GRAPH_PATH, WCM computation, norm loading — all now in the pipeline
The config.py property definitions (PROPERTY_MAP, FILTERABLE_PROPERTIES, PERCENTILE_PROPERTIES) stay in workers/scripts/ since they define the D1 schema contract.
packages/dashboard/scripts/build_lookup.py¶
Already fixed to use phonolex_data.loaders. Could optionally call build_lexical_database() for the word data instead of loading datasets individually, but this is not required for 4.0.0 — it works as-is after the import fixes.
4. D1 Schema Changes¶
Existing column constraint changes¶
The current schema declares NOT NULL on phonological columns (ipa, phonemes, phonemes_str, syllables, phoneme_count, syllable_count). These must become nullable to support norm-only words. The full CREATE TABLE is regenerated by export-to-d1.py, so this is a change in the seed SQL, not a migration.
New columns on the words table¶
-- Phonology flag for easy consumer filtering
has_phonology INTEGER NOT NULL DEFAULT 1, -- NEW: 1 for CMU words, 0 for norm-only
-- Prevalence (Brysbaert)
prevalence REAL,
-- Morphology (MorphoLex) — morpheme_count, is_monomorphemic, n_prefixes, n_suffixes already exist
morphological_segmentation TEXT, -- NEW: "un|break|able"
-- Phonotactic probability (IPhOD replaces Vitevitch & Luce for phono_prob_avg, positional_prob_avg)
neighborhood_density INTEGER, -- NEW
str_phono_prob_avg REAL, -- NEW (stressed biphone average)
str_positional_prob_avg REAL, -- NEW (stressed positional segment avg)
str_neighborhood_density INTEGER, -- NEW (stressed neighborhood density)
-- Child frequency (CYP-LEX)
freq_cyplex_7_9 REAL, -- NEW
freq_cyplex_10_12 REAL, -- NEW
freq_cyplex_13 REAL, -- NEW
The prevalence column already exists in the current schema (from Brysbaert Prevalence in the pickle) but wasn't populated via a loader. Now it will be.
FILTERABLE_PROPERTIES and PROPERTY_MAP in config.py and properties.ts updated to include the new columns.
TypeScript type changes¶
WordRow and WordResponse in workers/src/types.ts must make phonological fields nullable (ipa: string | null, phonemes: string | null, etc.). This is a breaking API contract change — all 5 tool routes and the frontend components that display phonological data must handle null phonological fields gracefully (e.g., show norm data but omit phonological sections for norm-only words).
5. What Changes Downstream¶
- D1 schema for edges table — same columns, same structure (no change)
- Hono API routes — phonological tool routes (similarity, contrastive, patterns) must add
WHERE has_phonology = 1to exclude norm-only words. Norm-based queries (word lists filtered by frequency, AoA, etc.) work on all words. - TypeScript types —
WordRowandWordResponsephonological fields become nullable; frontend components handle gracefully - React frontend — dynamically renders from property metadata (no change to rendering logic), but Lookup and word detail views should handle norm-only words (show available data, omit phonological sections)
- Similarity algorithm — still uses PHOIBLE dot products; norm-only words naturally excluded since they have no
word_syllablesrows - G2P alignment pipeline — separate concern, not part of this work (no change)
6. Success Criteria¶
build_lexical_database()runs to completion and returns aLexicalDatabasewith all 25 datasets mergedexport-to-d1.pyproduces ad1-seed.sqlwith the updated schema (nullable phonological columns,has_phonologyflag, new columns)- Word count is substantially higher than before (union of all datasets, estimated ~150K+ words vs previous ~44K)
- Words split correctly:
has_phonology=1words have full IPA/phonemes/syllables;has_phonology=0words have NULL phonological fields and at least one non-null norm value - Edge count is comparable or higher (7 edge sources, same merge logic)
- All 5 PhonoLex web tools work with the new seed — phonological tools operate on
has_phonology=1words, norm-based queries include all words build_lookup.pyworks unchanged (already fixed imports)- No pickle files referenced anywhere in the codebase