Skip to content

Integrated Lexical Database Pipeline Design

2026-03-13 — Replace pickle-based pipeline with direct data assembly from raw datasets


Goal

Build the integrated lexical database directly from 25 raw source datasets through phonolex_data loaders and phonology utils, eliminating the cognitive_graph_v1.1_empirical.pkl intermediate artifact. The pipeline produces a LexicalDatabase object that any consumer (D1 export, governor lookup builder, future tools) can use.

Context

The monorepo migration created packages/data/ with loaders for 18 datasets and phonology utilities (syllabification, WCM, normalization). The current export-to-d1.py reads a pre-built pickle for all word data and edges. That pickle was built by scripts no longer in the codebase, and 4 of 7 edge source datasets had no loaders.

We've now acquired 8 additional datasets (MEN, WordSim-353, SPP, ECCC, MorphoLex, Prevalence, CYP-LEX, IPhOD) and need a pipeline that builds everything from source.

Approach

Focused pipeline modules in packages/data/src/phonolex_data/pipeline/, each with a single responsibility. A thin orchestrator assembles the full database. Consumers apply their own downstream logic (SQL export, tokenizer-based lookup building, API-level filtering).

No vocabulary filtering in the pipeline or database. The word universe is the union of all source datasets — any word appearing in CMU dict, SUBTLEX, prevalence, or any other dataset gets a record. Words with CMU entries have full phonological data; words without CMU entries have null phonological fields but carry whatever norms/frequency data their source datasets provide. Filtering is a query-time / consumer concern.


1. Pipeline Modules

Module: schema.py — Data Contract

Shared types consumed by all pipeline stages and downstream consumers.

@dataclass
class WordRecord:
    word: str
    has_phonology: bool            # True for CMU dict words, False for norm-only words
    # Phonological fields — populated for CMU dict words, None/empty for norm-only words
    ipa: str | None
    phonemes: list[str]            # empty list for non-CMU words
    phoneme_count: int | None
    syllables: list[dict]          # [{onset, nucleus, coda, stress}], empty for non-CMU
    syllable_count: int | None
    initial_phoneme: str | None
    final_phoneme: str | None
    wcm_score: int | None
    # Norms — all optional, None means no data
    frequency: float | None        # SUBTLEX
    log_frequency: float | None
    contextual_diversity: float | None
    prevalence: float | None       # Brysbaert (NEW)
    aoa: float | None              # Glasgow AoA
    aoa_kuperman: float | None
    imageability: float | None
    familiarity: float | None
    concreteness: float | None
    size: float | None
    valence: float | None
    arousal: float | None
    dominance: float | None
    iconicity: float | None
    boi: float | None
    socialness: float | None
    auditory: float | None
    visual: float | None
    haptic: float | None
    gustatory: float | None
    olfactory: float | None
    interoceptive: float | None
    hand_arm: float | None
    foot_leg: float | None
    head: float | None
    mouth: float | None
    torso: float | None
    elp_lexical_decision_rt: float | None
    semantic_diversity: float | None
    # Morphology (NEW — MorphoLex)
    morpheme_count: int | None
    is_monomorphemic: bool | None
    n_prefixes: int | None
    n_suffixes: int | None
    morphological_segmentation: str | None  # e.g., "un|break|able"
    # Phonotactic probability (IPhOD replaces Vitevitch & Luce JSON loader)
    neighborhood_density: int | None   # unsDENS from IPhOD
    phono_prob_avg: float | None       # unsBPAV (unstressed biphone average)
    positional_prob_avg: float | None  # unsPOSPAV (unstressed positional segment avg)
    # IPhOD stressed variants
    str_phono_prob_avg: float | None   # strBPAV (stressed biphone average)
    str_positional_prob_avg: float | None  # strPOSPAV (stressed positional segment avg)
    str_neighborhood_density: int | None   # strDENS
    # Child frequency (NEW — CYP-LEX)
    freq_cyplex_7_9: float | None   # Zipf frequency, age 7-9
    freq_cyplex_10_12: float | None # Zipf frequency, age 10-12
    freq_cyplex_13: float | None    # Zipf frequency, age 13+
    # Vocab memberships
    vocab_memberships: set[str]

@dataclass
class EdgeRecord:
    source: str
    target: str
    edge_sources: list[str]        # ["SWOW", "USF", ...]
    swow_strength: float | None
    usf_forward: float | None
    usf_backward: float | None
    men_relatedness: float | None
    simlex_similarity: float | None
    simlex_pos: str | None         # POS from SimLex-999; requires updating load_simlex()
    wordsim_relatedness: float | None
    # SPP — priming effects (RT differences) between related and unrelated conditions
    spp_first_priming: float | None    # first_priming_overall (1st associate)
    spp_other_priming: float | None    # other_priming_overall (other associate)
    spp_fas: float | None              # firstassoc_fas (forward association strength)
    spp_lsa: float | None              # firstassoc_lsa (latent semantic analysis similarity)
    # ECCC — aggregated speech-in-noise confusion data
    eccc_consistency: float | None     # proportion of listeners giving same confusion response
    eccc_n_instances: int | None       # number of confusion instances for this pair
    eccc_phoneme_distance: float | None

@dataclass
class DerivedData:
    percentiles: dict[str, dict[str, float | None]]  # word → {prop_percentile: val}
    minimal_pairs: list[tuple]     # (w1, w2, p1, p2, pos, pos_type)
    phoneme_data: dict[str, dict]  # ipa → {type, features}
    phoneme_norms: dict[str, float]  # ipa → norm_sq
    phoneme_dots: list[tuple]      # (ipa1, ipa2, dot)
    components: list[dict]         # [{id, type, phonemes}]
    word_syllable_data: dict       # word → [{onset_key, nucleus_key, coda_key}]
    component_key_to_id: dict

@dataclass
class LexicalDatabase:
    words: dict[str, WordRecord]
    edges: list[EdgeRecord]
    derived: DerivedData
    phoible_vectors: dict          # Raw PHOIBLE data for dot products

Module: words.py — Assemble Word Records

build_words() → dict[str, WordRecord]

Steps: 1. Load CMU dict via cmudict_to_phono() — base IPA + phoneme list → seed initial word records with full phonological data 2. Syllabify each word via phonolex_data.phonology.syllabification 3. Compute WCM via phonolex_data.phonology.wcm 4. Normalize IPA via phonolex_data.phonology.normalize 5. Load and merge all norm datasets: - Existing (11): Warriner, Kuperman, Glasgow, Concreteness, Sensorimotor, Semantic Diversity, Socialness, BOI, SUBTLEX, ELP, Iconicity - New (4): Prevalence, MorphoLex, CYP-LEX, IPhOD - For each norm dataset, if a word doesn't exist yet in the word dict, create a new WordRecord with null phonological fields and populate only the norm data 6. Load vocab list memberships (Ogden, AFINN, stop words, Swadesh, Roget, GSL, AVL) 7. Return dict[str, WordRecord] — union of all source datasets, unfiltered

Each norm loader returns dict[str, dict[str, Any]] (word → property dict). Merging is a simple update loop: for each word in every dataset, get or create a WordRecord and merge in the properties.

New loaders needed: - load_prevalence() — reads data/norms/prevalence/English_Word_Prevalences.xlsx - load_morpholex() — reads data/norms/morpholex/MorphoLEX_en.xlsx, returns segmentation + counts - load_cyplex() — reads data/norms/cyplex/CYPLEX_all_age_bands.csv (combined file with all 3 bands); maps CYPLEX79_logfreq_cyplex_7_9, CYPLEX1012_logfreq_cyplex_10_12, CYPLEX13_logfreq_cyplex_13 - load_iphod() — reads data/norms/iphod/IPhOD2_Words.txt (tab-delimited), returns density + phonotactic measures

Module: edges.py — Assemble Edge Records

build_edges(words: dict[str, WordRecord]) → list[EdgeRecord]

Steps: 1. Load association datasets via loaders: - load_swow(){cue: {response: strength}} - load_free_association(){cue: {target: forward_strength}} - load_simlex()[(w1, w2, similarity, pos)] (update existing to also return POS column) - load_men()[(w1, w2, relatedness)] (NEW) - load_wordsim()[(w1, w2, relatedness)] (NEW) - load_spp()[(target, prime, first_priming, other_priming, fas, lsa)] (NEW — from spp_ldt_item_analysis.xlsx) - load_eccc()[(target, confusion, consistency, n_instances, phoneme_distance)] (NEW — aggregated per pair from confusionCorpus_v1.2.csv) 2. Build edge index: dict[tuple[str, str], EdgeRecord] keyed by sorted word pair 3. For each dataset, iterate pairs and merge into the index: - If pair exists: add the new attributes, append to edge_sources - If new: create EdgeRecord with this dataset's attributes, others as None 4. Return list[EdgeRecord]

The words argument is passed so edges can be filtered to only include words that exist in the lexicon (avoiding dangling references).

Module: derived.py — Compute Derived Data

build_derived(words, phoible_vectors) → DerivedData

Same logic as current export-to-d1.py sections 3-8. All derived computations operate only on words where has_phonology=True — norm-only words are skipped since they have no phonemes, syllables, or IPA.

  • Percentiles: bisect_right(sorted_vals, val) / N * 100 for all percentile-eligible properties (computed over all words that have each property, regardless of has_phonology)
  • Minimal pairs: group by phoneme count, compare same-length words for single-phoneme differences (phonology-only words)
  • Phoneme data: extract from PHOIBLE vectors, classify vowel/consonant, decode features
  • Phoneme dot products: pairwise dot products of 76-dimensional feature vectors
  • Syllable components: extract unique onset/nucleus/coda tuples, assign IDs, build word-syllable mappings (phonology-only words)

Module: __init__.py — Orchestrator

def build_lexical_database() -> LexicalDatabase:
    """Build the complete integrated lexical database from raw datasets."""
    phoible = load_phoible()
    words = build_words()
    edges = build_edges(words)
    derived = build_derived(words, phoible)
    return LexicalDatabase(words=words, edges=edges, derived=derived, phoible_vectors=phoible)

2. New Loaders

Added to packages/data/src/phonolex_data/loaders/:

norms.py (extend existing)

  • load_prevalence() → dict[str, dict] — word → {prevalence: float}. Note: reads the Pknown column (proportion 0-1), NOT the Prevalence column (which is log-scale)
  • load_iphod() → dict[str, dict] — word → {neighborhood_density, phono_prob_avg, positional_prob_avg, str_phono_prob_avg, str_positional_prob_avg, str_neighborhood_density}. Replaces the existing load_phonotactic_probability() from phoible.py (Vitevitch & Luce JSON); that loader should be deprecated.

associations.py (extend existing)

  • load_simlex() — update return type from list[tuple[str, str, float]] to list[tuple[str, str, float, str]] to include POS column (needed for EdgeRecord.simlex_pos)
  • load_men() → list[tuple[str, str, float]] — (word1, word2, relatedness)
  • load_wordsim() → list[tuple[str, str, float]] — (word1, word2, relatedness)
  • load_spp() → list[tuple] — (target, prime, first_priming, other_priming, fas, lsa). Reads spp_ldt_item_analysis.xlsx; columns: target, prime_1st_assoc, first_priming_overall, other_priming_overall, firstassoc_fas, firstassoc_lsa
  • load_eccc() → list[tuple] — (target, confusion, consistency, n_instances, phoneme_distance). Aggregated per (target, confusion) pair from confusionCorpus_v1.2.csv; Consistency column is a raw listener count (e.g. 9), NOT a proportion — divide by N-Listeners to get proportion (0-1)

New file: morphology.py

  • load_morpholex() → dict[str, dict] — word → {morpheme_count, n_prefixes, n_suffixes, is_monomorphemic, segmentation}

New file: child_frequency.py

  • load_cyplex() → dict[str, dict] — word → {freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13}

3. Consumer Changes

packages/web/workers/scripts/export-to-d1.py

Becomes a thin SQL writer: 1. from phonolex_data.pipeline import build_lexical_database 2. db = build_lexical_database() 3. Write SQL using existing INSERT generation logic — no vocabulary filtering 4. For norm-only words (has_phonology=False): write SQL NULLs for phonological columns (not empty strings or zeros) 5. Only write word_syllables rows for words with has_phonology=True 6. Delete all pickle imports, GRAPH_PATH, WCM computation, norm loading — all now in the pipeline

The config.py property definitions (PROPERTY_MAP, FILTERABLE_PROPERTIES, PERCENTILE_PROPERTIES) stay in workers/scripts/ since they define the D1 schema contract.

packages/dashboard/scripts/build_lookup.py

Already fixed to use phonolex_data.loaders. Could optionally call build_lexical_database() for the word data instead of loading datasets individually, but this is not required for 4.0.0 — it works as-is after the import fixes.


4. D1 Schema Changes

Existing column constraint changes

The current schema declares NOT NULL on phonological columns (ipa, phonemes, phonemes_str, syllables, phoneme_count, syllable_count). These must become nullable to support norm-only words. The full CREATE TABLE is regenerated by export-to-d1.py, so this is a change in the seed SQL, not a migration.

New columns on the words table

-- Phonology flag for easy consumer filtering
has_phonology INTEGER NOT NULL DEFAULT 1,  -- NEW: 1 for CMU words, 0 for norm-only
-- Prevalence (Brysbaert)
prevalence REAL,
-- Morphology (MorphoLex) — morpheme_count, is_monomorphemic, n_prefixes, n_suffixes already exist
morphological_segmentation TEXT,  -- NEW: "un|break|able"
-- Phonotactic probability (IPhOD replaces Vitevitch & Luce for phono_prob_avg, positional_prob_avg)
neighborhood_density INTEGER,     -- NEW
str_phono_prob_avg REAL,          -- NEW (stressed biphone average)
str_positional_prob_avg REAL,     -- NEW (stressed positional segment avg)
str_neighborhood_density INTEGER, -- NEW (stressed neighborhood density)
-- Child frequency (CYP-LEX)
freq_cyplex_7_9 REAL,            -- NEW
freq_cyplex_10_12 REAL,          -- NEW
freq_cyplex_13 REAL,             -- NEW

The prevalence column already exists in the current schema (from Brysbaert Prevalence in the pickle) but wasn't populated via a loader. Now it will be.

FILTERABLE_PROPERTIES and PROPERTY_MAP in config.py and properties.ts updated to include the new columns.

TypeScript type changes

WordRow and WordResponse in workers/src/types.ts must make phonological fields nullable (ipa: string | null, phonemes: string | null, etc.). This is a breaking API contract change — all 5 tool routes and the frontend components that display phonological data must handle null phonological fields gracefully (e.g., show norm data but omit phonological sections for norm-only words).


5. What Changes Downstream

  • D1 schema for edges table — same columns, same structure (no change)
  • Hono API routes — phonological tool routes (similarity, contrastive, patterns) must add WHERE has_phonology = 1 to exclude norm-only words. Norm-based queries (word lists filtered by frequency, AoA, etc.) work on all words.
  • TypeScript typesWordRow and WordResponse phonological fields become nullable; frontend components handle gracefully
  • React frontend — dynamically renders from property metadata (no change to rendering logic), but Lookup and word detail views should handle norm-only words (show available data, omit phonological sections)
  • Similarity algorithm — still uses PHOIBLE dot products; norm-only words naturally excluded since they have no word_syllables rows
  • G2P alignment pipeline — separate concern, not part of this work (no change)

6. Success Criteria

  1. build_lexical_database() runs to completion and returns a LexicalDatabase with all 25 datasets merged
  2. export-to-d1.py produces a d1-seed.sql with the updated schema (nullable phonological columns, has_phonology flag, new columns)
  3. Word count is substantially higher than before (union of all datasets, estimated ~150K+ words vs previous ~44K)
  4. Words split correctly: has_phonology=1 words have full IPA/phonemes/syllables; has_phonology=0 words have NULL phonological fields and at least one non-null norm value
  5. Edge count is comparable or higher (7 edge sources, same merge logic)
  6. All 5 PhonoLex web tools work with the new seed — phonological tools operate on has_phonology=1 words, norm-based queries include all words
  7. build_lookup.py works unchanged (already fixed imports)
  8. No pickle files referenced anywhere in the codebase