Skip to content

Integrated Lexical Database Pipeline — Implementation Plan

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build the integrated lexical database directly from 25 raw source datasets, eliminating the pickle dependency.

Architecture: Focused pipeline modules in packages/data/src/phonolex_data/pipeline/ (schema, words, edges, derived, orchestrator). New loaders for 8 datasets added to packages/data/src/phonolex_data/loaders/. Consumer export-to-d1.py becomes a thin SQL writer calling build_lexical_database().

Tech Stack: Python 3.10+, openpyxl, numpy, pytest. D1 seed SQL output. TypeScript types in Hono workers.

Spec: docs/superpowers/specs/2026-03-13-integrated-lexical-database-pipeline-design.md


File Structure

New files to create:

  • packages/data/src/phonolex_data/loaders/morphology.pyload_morpholex()
  • packages/data/src/phonolex_data/loaders/child_frequency.pyload_cyplex()
  • packages/data/src/phonolex_data/pipeline/__init__.py — orchestrator (build_lexical_database())
  • packages/data/src/phonolex_data/pipeline/schema.py — WordRecord, EdgeRecord, DerivedData, LexicalDatabase dataclasses
  • packages/data/src/phonolex_data/pipeline/words.pybuild_words()
  • packages/data/src/phonolex_data/pipeline/edges.pybuild_edges()
  • packages/data/src/phonolex_data/pipeline/derived.pybuild_derived()
  • packages/data/tests/test_new_loaders.py — tests for 8 new loaders + simlex update
  • packages/data/tests/test_pipeline.py — tests for pipeline modules

Existing files to modify:

  • packages/data/src/phonolex_data/loaders/norms.py — add load_prevalence(), load_iphod()
  • packages/data/src/phonolex_data/loaders/associations.py — add load_men(), load_wordsim(), load_spp(), load_eccc(), update load_simlex()
  • packages/data/src/phonolex_data/loaders/__init__.py — export new functions
  • packages/web/workers/scripts/export-to-d1.py — rewrite to use pipeline
  • packages/web/workers/scripts/config.py — add new PropertyDefs, update edge types
  • packages/web/workers/src/types.ts — nullable phonological fields, new columns
  • packages/web/workers/src/config/properties.ts — add new property categories

Chunk 1: New Norm Loaders

Task 1: load_prevalence()

Files: - Modify: packages/data/src/phonolex_data/loaders/norms.py - Test: packages/data/tests/test_new_loaders.py

Context: Reads data/norms/prevalence/English_Word_Prevalences.xlsx. The proportion column is Pknown (0-1), NOT Prevalence (which is log-scale). ~62K words. Follow pattern of existing norm loaders (e.g., load_warriner()).

  • [ ] Step 1: Write the failing test
# packages/data/tests/test_new_loaders.py
"""Tests for newly added loaders (prevalence, iphod, morpholex, cyplex, men, wordsim, spp, eccc)."""

from __future__ import annotations

import pytest


def test_load_prevalence():
    from phonolex_data.loaders import load_prevalence

    result = load_prevalence()
    assert isinstance(result, dict)
    assert len(result) > 50000  # ~62K words

    # Spot check a common word
    assert "the" in result
    entry = result["the"]
    assert "prevalence" in entry
    assert 0.0 <= entry["prevalence"] <= 1.0
    # "the" should be known by nearly everyone
    assert entry["prevalence"] > 0.9
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_prevalence -v Expected: FAIL (ImportError — load_prevalence not yet defined)

  • [ ] Step 3: Implement load_prevalence()

Add to packages/data/src/phonolex_data/loaders/norms.py:

def load_prevalence(path: str | Path | None = None) -> dict[str, dict[str, float]]:
    """Load Brysbaert et al. (2019) word prevalence norms.

    Returns:
        {word: {prevalence: float}}  — proportion of people who know the word (0-1)
    """
    openpyxl = require_openpyxl()
    path = Path(path) if path else get_data_dir() / "norms" / "prevalence" / "English_Word_Prevalences.xlsx"
    wb = openpyxl.load_workbook(path, read_only=True, data_only=True)
    ws = wb.active

    result: dict[str, dict[str, float]] = {}
    header = None
    for row in ws.iter_rows(values_only=True):
        if header is None:
            header = [str(c).strip() if c else "" for c in row]
            continue
        word = row[header.index("Word")]
        if not word or not isinstance(word, str):
            continue
        try:
            prevalence = float(row[header.index("Pknown")])
            result[word.strip().lower()] = {"prevalence": prevalence}
        except (ValueError, TypeError):
            continue
    wb.close()
    return result

Add to packages/data/src/phonolex_data/loaders/__init__.py:

from phonolex_data.loaders.norms import load_prevalence
(Add to both the import and the __all__ list if one exists.)

  • [ ] Step 4: Run test to verify it passes

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_prevalence -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/norms.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_prevalence() loader for Brysbaert word prevalence norms"

Task 2: load_iphod()

Files: - Modify: packages/data/src/phonolex_data/loaders/norms.py - Test: packages/data/tests/test_new_loaders.py

Context: Reads data/norms/iphod/IPhOD2_Words.txt (tab-delimited). Key columns: Word, unsDENS (int), unsBPAV (float), unsPOSPAV (float), strDENS (int), strBPAV (float), strPOSPAV (float). ~54K words. Some words have multiple pronunciation rows — take the first occurrence. Replaces the old load_phonotactic_probability() from phoible.py.

  • [ ] Step 1: Write the failing test
# Add to packages/data/tests/test_new_loaders.py

def test_load_iphod():
    from phonolex_data.loaders import load_iphod

    result = load_iphod()
    assert isinstance(result, dict)
    assert len(result) > 30000  # ~54K unique words

    # Spot check
    assert "cat" in result
    entry = result["cat"]
    expected_keys = {
        "neighborhood_density", "phono_prob_avg", "positional_prob_avg",
        "str_neighborhood_density", "str_phono_prob_avg", "str_positional_prob_avg",
    }
    assert set(entry.keys()) == expected_keys
    assert isinstance(entry["neighborhood_density"], int)
    assert isinstance(entry["phono_prob_avg"], float)
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_iphod -v Expected: FAIL

  • [ ] Step 3: Implement load_iphod()

Add to packages/data/src/phonolex_data/loaders/norms.py:

def load_iphod(path: str | Path | None = None) -> dict[str, dict[str, float | int]]:
    """Load IPhOD2 phonotactic probability and neighborhood density norms.

    Replaces load_phonotactic_probability() (Vitevitch & Luce JSON).

    Returns:
        {word: {neighborhood_density, phono_prob_avg, positional_prob_avg,
                str_neighborhood_density, str_phono_prob_avg, str_positional_prob_avg}}
    """
    path = Path(path) if path else get_data_dir() / "norms" / "iphod" / "IPhOD2_Words.txt"
    result: dict[str, dict[str, float | int]] = {}
    with open(path, encoding="utf-8") as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            word = row.get("Word", "").strip().lower()
            if not word or word in result:  # take first pronunciation only
                continue
            try:
                result[word] = {
                    "neighborhood_density": int(float(row["unsDENS"])),
                    "phono_prob_avg": float(row["unsBPAV"]),
                    "positional_prob_avg": float(row["unsPOSPAV"]),
                    "str_neighborhood_density": int(float(row["strDENS"])),
                    "str_phono_prob_avg": float(row["strBPAV"]),
                    "str_positional_prob_avg": float(row["strPOSPAV"]),
                }
            except (ValueError, KeyError):
                continue
    return result

Export from __init__.py.

  • [ ] Step 4: Run test to verify it passes

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_iphod -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/norms.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_iphod() loader for IPhOD2 phonotactic probability norms"

Task 3: load_morpholex()

Files: - Create: packages/data/src/phonolex_data/loaders/morphology.py - Test: packages/data/tests/test_new_loaders.py

Context: Reads data/norms/morpholex/MorphoLEX_en.xlsx. Important: The first sheet ("Presentation") is a legend — data is spread across 30 PRS-signature sheets (e.g., '0-1-0', '1-1-1'). Skip the first sheet and the last 3 sheets ('All prefixes', 'All suffixes', 'All roots'). Columns per data sheet: Word, MorphoLexSegm (segmentation using {<>()} brackets like {(dark)}>ness>), Nmorph, PRS_signature (comma-separated P,R,S counts, e.g., "0,1,1"). There are NO nPrefix/nSuffix columns — derive from PRS_signature. ~70K words across all sheets. Segmentation uses curly braces {}, angle brackets <>, and parens () — strip all bracket types to get morpheme segments.

  • [ ] Step 1: Write the failing test
# Add to packages/data/tests/test_new_loaders.py

def test_load_morpholex():
    from phonolex_data.loaders import load_morpholex

    result = load_morpholex()
    assert isinstance(result, dict)
    assert len(result) > 50000  # ~70K words

    # Check a known polymorphemic word
    # "unbreakable" may not be in the dataset, pick a common word
    # MorphoLex has "darkness" → (dark)ness
    if "darkness" in result:
        entry = result["darkness"]
        expected_keys = {
            "morpheme_count", "n_prefixes", "n_suffixes",
            "is_monomorphemic", "morphological_segmentation",
        }
        assert set(entry.keys()) == expected_keys
        assert entry["morpheme_count"] >= 2
        assert entry["is_monomorphemic"] is False
        assert isinstance(entry["morphological_segmentation"], str)
        assert "|" in entry["morphological_segmentation"]

    # Check a monomorphemic word
    if "cat" in result:
        assert result["cat"]["is_monomorphemic"] is True
        assert result["cat"]["morpheme_count"] == 1
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_morpholex -v Expected: FAIL

  • [ ] Step 3: Implement load_morpholex()

Create packages/data/src/phonolex_data/loaders/morphology.py:

"""Morphology loaders."""

from __future__ import annotations

import re
from pathlib import Path

from phonolex_data.loaders._helpers import get_data_dir, require_openpyxl


def _parse_morpholex_segmentation(segm: str) -> str:
    """Convert MorphoLex bracket notation to pipe-delimited.

    MorphoLex uses {}, <>, and () brackets:
      '{(cat)}' → 'cat'
      '{(dark)}>ness>' → 'dark|ness'
      '{<un<(break)>able>}' → 'un|break|able'

    Strategy: strip all bracket types, keep text segments.
    """
    segments = re.findall(r"[^<>(){}]+", segm)
    segments = [s.strip() for s in segments if s.strip()]
    return "|".join(segments) if segments else segm


def load_morpholex(path: str | Path | None = None) -> dict[str, dict]:
    """Load MorphoLex-en morphological segmentation data.

    Data is spread across 30 PRS-signature sheets (skip first 'Presentation'
    sheet and last 3 summary sheets). Derive prefix/suffix counts from
    PRS_signature column (comma-separated P,R,S counts).

    Returns:
        {word: {morpheme_count, n_prefixes, n_suffixes,
                is_monomorphemic, morphological_segmentation}}
    """
    openpyxl = require_openpyxl()
    path = Path(path) if path else get_data_dir() / "norms" / "morpholex" / "MorphoLEX_en.xlsx"
    wb = openpyxl.load_workbook(path, read_only=True, data_only=True)

    # Skip first sheet (Presentation) and last 3 (All prefixes/suffixes/roots)
    data_sheets = wb.worksheets[1:-3] if len(wb.worksheets) > 4 else wb.worksheets[1:]

    result: dict[str, dict] = {}
    for ws in data_sheets:
        header = None
        for row in ws.iter_rows(values_only=True):
            if header is None:
                header = [str(c).strip() if c else "" for c in row]
                if "Word" not in header:
                    break  # not a data sheet
                continue
            try:
                word_val = row[header.index("Word")]
                if not word_val or not isinstance(word_val, str):
                    continue
                word = word_val.strip().lower()
                if word in result:
                    continue  # first occurrence wins

                segm_raw = str(row[header.index("MorphoLexSegm")] or "")
                segmentation = _parse_morpholex_segmentation(segm_raw)
                morpheme_count = len(segmentation.split("|")) if segmentation else 1

                # Derive prefix/suffix counts from PRS_signature (e.g., "0,1,1")
                prs = str(row[header.index("PRS_signature")] or "0,1,0")
                prs_parts = prs.split(",")
                n_prefixes = int(prs_parts[0]) if len(prs_parts) >= 1 else 0
                n_suffixes = int(prs_parts[2]) if len(prs_parts) >= 3 else 0

                result[word] = {
                    "morpheme_count": morpheme_count,
                    "n_prefixes": n_prefixes,
                    "n_suffixes": n_suffixes,
                    "is_monomorphemic": morpheme_count == 1,
                    "morphological_segmentation": segmentation,
                }
            except (ValueError, KeyError, IndexError):
                continue
    wb.close()
    return result

Export from __init__.py.

  • [ ] Step 4: Run test to verify it passes

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_morpholex -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/morphology.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_morpholex() loader for MorphoLex-en morphological segmentation"

Task 4: load_cyplex()

Files: - Create: packages/data/src/phonolex_data/loaders/child_frequency.py - Test: packages/data/tests/test_new_loaders.py

Context: Reads data/norms/cyplex/CYPLEX_all_age_bands.csv. Columns include Word, CYPLEX79_log, CYPLEX1012_log, CYPLEX13_log (Zipf-scale log frequencies). ~91K words (union across age bands). BOM-encoded CSV (starts with \ufeff).

  • [ ] Step 1: Write the failing test
# Add to packages/data/tests/test_new_loaders.py

def test_load_cyplex():
    from phonolex_data.loaders import load_cyplex

    result = load_cyplex()
    assert isinstance(result, dict)
    assert len(result) > 50000

    assert "the" in result
    entry = result["the"]
    expected_keys = {"freq_cyplex_7_9", "freq_cyplex_10_12", "freq_cyplex_13"}
    assert set(entry.keys()) == expected_keys
    # "the" should have high Zipf frequency in all bands
    for key in expected_keys:
        val = entry[key]
        assert val is None or isinstance(val, float)
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_cyplex -v Expected: FAIL

  • [ ] Step 3: Implement load_cyplex()

Create packages/data/src/phonolex_data/loaders/child_frequency.py:

"""Child frequency loaders."""

from __future__ import annotations

import csv
from pathlib import Path

from phonolex_data.loaders._helpers import get_data_dir


def load_cyplex(path: str | Path | None = None) -> dict[str, dict[str, float | None]]:
    """Load CYP-LEX child frequency norms (all 3 age bands).

    Reads CYPLEX_all_age_bands.csv — maps CYPLEX79_log, CYPLEX1012_log,
    CYPLEX13_log to freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13.

    Returns:
        {word: {freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13}}
    """
    path = (
        Path(path) if path
        else get_data_dir() / "norms" / "cyplex" / "CYPLEX_all_age_bands.csv"
    )
    column_map = {
        "CYPLEX79_log": "freq_cyplex_7_9",
        "CYPLEX1012_log": "freq_cyplex_10_12",
        "CYPLEX13_log": "freq_cyplex_13",
    }
    result: dict[str, dict[str, float | None]] = {}
    with open(path, encoding="utf-8-sig") as f:  # utf-8-sig handles BOM
        reader = csv.DictReader(f)
        for row in reader:
            word = row.get("Word", "").strip().lower()
            if not word:
                continue
            entry: dict[str, float | None] = {}
            for src_col, dest_key in column_map.items():
                raw = row.get(src_col, "").strip()
                try:
                    entry[dest_key] = float(raw) if raw else None
                except ValueError:
                    entry[dest_key] = None
            result[word] = entry
    return result

Export from __init__.py.

  • [ ] Step 4: Run test to verify it passes

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_cyplex -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/child_frequency.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_cyplex() loader for CYP-LEX child frequency norms"

Chunk 2: New Association Loaders

Task 5: load_men()

Files: - Modify: packages/data/src/phonolex_data/loaders/associations.py - Test: packages/data/tests/test_new_loaders.py

Context: Reads data/norms/men/MEN-TR-3k.txt. Space-delimited, NO header row. Format: word1 word2 score. 3,000 pairs. Scores range 0-50.

  • [ ] Step 1: Write the failing test
# Add to packages/data/tests/test_new_loaders.py

def test_load_men():
    from phonolex_data.loaders import load_men

    result = load_men()
    assert isinstance(result, list)
    assert len(result) == 3000

    w1, w2, score = result[0]
    assert isinstance(w1, str)
    assert isinstance(w2, str)
    assert isinstance(score, float)
    assert 0.0 <= score <= 50.0
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_men -v Expected: FAIL

  • [ ] Step 3: Implement load_men()

Add to packages/data/src/phonolex_data/loaders/associations.py:

def load_men(path: str | Path | None = None) -> list[tuple[str, str, float]]:
    """Load MEN semantic relatedness dataset (Bruni et al. 2014).

    Returns:
        [(word1, word2, relatedness_score), ...]  — 3,000 pairs, scores 0-50
    """
    path = Path(path) if path else get_data_dir() / "norms" / "men" / "MEN-TR-3k.txt"
    result: list[tuple[str, str, float]] = []
    with open(path) as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) != 3:
                continue
            try:
                result.append((parts[0].lower(), parts[1].lower(), float(parts[2])))
            except ValueError:
                continue
    return result

Export from __init__.py.

  • [ ] Step 4: Run test to verify it passes

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_men -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_men() loader for MEN semantic relatedness dataset"

Task 6: load_wordsim()

Files: - Modify: packages/data/src/phonolex_data/loaders/associations.py - Test: packages/data/tests/test_new_loaders.py

Context: Reads data/norms/wordsim353/combined.csv. CSV with header. Columns: Word 1, Word 2, Human (mean). 353 pairs. Scores 0-10. Note: there's also a combined.tab but we use the CSV.

  • [ ] Step 1: Write the failing test
def test_load_wordsim():
    from phonolex_data.loaders import load_wordsim

    result = load_wordsim()
    assert isinstance(result, list)
    assert 340 <= len(result) <= 360  # ~353 pairs

    w1, w2, score = result[0]
    assert isinstance(w1, str)
    assert isinstance(w2, str)
    assert isinstance(score, float)
    assert 0.0 <= score <= 10.0
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_wordsim -v Expected: FAIL

  • [ ] Step 3: Implement load_wordsim()

Add to packages/data/src/phonolex_data/loaders/associations.py:

def load_wordsim(path: str | Path | None = None) -> list[tuple[str, str, float]]:
    """Load WordSim-353 semantic relatedness dataset (Finkelstein et al. 2002).

    Returns:
        [(word1, word2, relatedness_score), ...]  — ~353 pairs, scores 0-10
    """
    path = Path(path) if path else get_data_dir() / "norms" / "wordsim353" / "combined.csv"
    result: list[tuple[str, str, float]] = []
    with open(path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            try:
                result.append((
                    row["Word 1"].strip().lower(),
                    row["Word 2"].strip().lower(),
                    float(row["Human (mean)"]),
                ))
            except (ValueError, KeyError):
                continue
    return result

Export from __init__.py.

  • [ ] Step 4: Run test to verify it passes

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_wordsim -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_wordsim() loader for WordSim-353 dataset"

Task 7: load_spp()

Files: - Modify: packages/data/src/phonolex_data/loaders/associations.py - Test: packages/data/tests/test_new_loaders.py

Context: Reads data/norms/spp/spp_ldt_item_analysis.xlsx. Actual xlsx format. Key columns: target, prime_1st_assoc, first_priming_overall, other_priming_overall, firstassoc_fas, firstassoc_lsa. 1,661 rows. SPP measures priming effects — the relationship is (prime → target). Values can be negative (inhibition).

  • [ ] Step 1: Write the failing test
def test_load_spp():
    from phonolex_data.loaders import load_spp

    result = load_spp()
    assert isinstance(result, list)
    assert len(result) > 1500  # ~1,661 pairs

    target, prime, first_priming, other_priming, fas, lsa = result[0]
    assert isinstance(target, str)
    assert isinstance(prime, str)
    # Priming values can be negative
    assert isinstance(first_priming, (float, type(None)))
    assert isinstance(fas, (float, type(None)))
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_spp -v Expected: FAIL

  • [ ] Step 3: Implement load_spp()

Add to packages/data/src/phonolex_data/loaders/associations.py:

def load_spp(path: str | Path | None = None) -> list[tuple]:
    """Load Semantic Priming Project dataset (Hutchison et al. 2013).

    Reads spp_ldt_item_analysis.xlsx.

    Returns:
        [(target, prime, first_priming_overall, other_priming_overall,
          firstassoc_fas, firstassoc_lsa), ...]
    """
    from phonolex_data.loaders._helpers import require_openpyxl

    openpyxl = require_openpyxl()
    path = (
        Path(path) if path
        else get_data_dir() / "norms" / "spp" / "spp_ldt_item_analysis.xlsx"
    )
    wb = openpyxl.load_workbook(path, read_only=True, data_only=True)
    ws = wb.active

    result: list[tuple] = []
    header = None
    for row in ws.iter_rows(values_only=True):
        if header is None:
            header = [str(c).strip() if c else "" for c in row]
            continue
        try:
            target = str(row[header.index("target")]).strip().lower()
            prime_raw = row[header.index("prime_1st_assoc")]
            prime = str(prime_raw).strip().lower() if prime_raw else ""
            if not target or not prime:
                continue

            def _float_or_none(col_name: str) -> float | None:
                idx = header.index(col_name)
                val = row[idx]
                if val is None or str(val).strip() == "":
                    return None
                return float(val)

            result.append((
                target,
                prime,
                _float_or_none("first_priming_overall"),
                _float_or_none("other_priming_overall"),
                _float_or_none("firstassoc_fas"),
                _float_or_none("firstassoc_lsa"),
            ))
        except (ValueError, KeyError):
            continue
    wb.close()
    return result

Export from __init__.py.

  • [ ] Step 4: Run test to verify it passes

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_spp -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_spp() loader for Semantic Priming Project dataset"

Task 8: load_eccc()

Files: - Modify: packages/data/src/phonolex_data/loaders/associations.py - Test: packages/data/tests/test_new_loaders.py

Context: Reads data/norms/eccc/confusionCorpus_v1.2.csv. Columns: Target, Confusion, Consistency (raw listener count, NOT a proportion — e.g. 9 means 9 out of N listeners), N-Listeners (int, typically 15), Counts (string like "9 2 1 1 1 1"), Phoneme-distance (int). To get the proportion (0-1), divide Consistency by N-Listeners. Multiple rows per (Target, Confusion) pair across different conditions. Aggregate per unique (target, confusion) pair: mean of (Consistency/N-Listeners) proportions, sum of Consistency counts as total_instances, mean Phoneme-distance.

  • [ ] Step 1: Write the failing test
def test_load_eccc():
    from phonolex_data.loaders import load_eccc

    result = load_eccc()
    assert isinstance(result, list)
    assert len(result) > 1000  # aggregated pairs

    target, confusion, consistency, n_instances, phoneme_distance = result[0]
    assert isinstance(target, str)
    assert isinstance(confusion, str)
    assert isinstance(consistency, float)
    assert 0.0 <= consistency <= 1.0
    assert isinstance(n_instances, int)
    assert n_instances >= 1
    assert isinstance(phoneme_distance, float)
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_eccc -v Expected: FAIL

  • [ ] Step 3: Implement load_eccc()

Add to packages/data/src/phonolex_data/loaders/associations.py:

def load_eccc(path: str | Path | None = None) -> list[tuple]:
    """Load ECCC speech-in-noise confusion corpus (Mondol & Bhatt 2023).

    Aggregates per (target, confusion) pair across conditions.

    Returns:
        [(target, confusion, mean_consistency, total_instances, mean_phoneme_distance), ...]
    """
    path = (
        Path(path) if path
        else get_data_dir() / "norms" / "eccc" / "confusionCorpus_v1.2.csv"
    )
    # Aggregate across conditions
    pair_data: dict[tuple[str, str], dict] = {}
    with open(path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            target = row.get("Target", "").strip().lower()
            confusion = row.get("Confusion", "").strip().lower()
            if not target or not confusion or target == confusion:
                continue
            try:
                # Consistency is a raw listener count, NOT a proportion
                raw_consistency = int(row["Consistency"])
                n_listeners = int(row["N-Listeners"])
                if n_listeners == 0:
                    continue
                consistency = raw_consistency / n_listeners  # proportion 0-1
                phoneme_dist = float(row["Phoneme-distance"])
            except (ValueError, KeyError):
                continue

            key = (target, confusion)
            if key not in pair_data:
                pair_data[key] = {
                    "consistencies": [],
                    "raw_counts": [],
                    "distances": [],
                }
            pair_data[key]["consistencies"].append(consistency)
            pair_data[key]["raw_counts"].append(raw_consistency)
            pair_data[key]["distances"].append(phoneme_dist)

    result: list[tuple] = []
    for (target, confusion), data in pair_data.items():
        mean_consistency = sum(data["consistencies"]) / len(data["consistencies"])
        total_instances = sum(data["raw_counts"])
        mean_distance = sum(data["distances"]) / len(data["distances"])
        result.append((target, confusion, mean_consistency, total_instances, mean_distance))

    return result

Export from __init__.py.

  • [ ] Step 4: Run test to verify it passes

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_eccc -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_eccc() loader for ECCC speech-in-noise confusion corpus"

Task 9: Update load_simlex() to return POS

Files: - Modify: packages/data/src/phonolex_data/loaders/associations.py - Modify: packages/data/tests/test_datasets.py (update existing test) - Test: packages/data/tests/test_new_loaders.py

Context: The existing load_simlex() returns list[tuple[str, str, float]]. SimLex-999.txt has a POS column (tab-delimited, values: N, V, A). Update to return list[tuple[str, str, float, str]] — 4th element is POS.

  • [ ] Step 1: Write the test for new return type
# Add to packages/data/tests/test_new_loaders.py

def test_load_simlex_with_pos():
    from phonolex_data.loaders import load_simlex

    result = load_simlex()
    assert isinstance(result, list)
    assert len(result) == 999

    w1, w2, score, pos = result[0]
    assert isinstance(w1, str)
    assert isinstance(w2, str)
    assert isinstance(score, float)
    assert isinstance(pos, str)
    assert pos in ("N", "V", "A")
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_simlex_with_pos -v Expected: FAIL (tuple has 3 elements, not 4)

  • [ ] Step 3: Update load_simlex()

In packages/data/src/phonolex_data/loaders/associations.py, update:

def load_simlex(path: str | Path | None = None) -> list[tuple[str, str, float, str]]:
    """Load SimLex-999 word similarity dataset (Hill et al. 2015).

    Returns:
        [(word1, word2, similarity_score, pos), ...]
    """
    path = Path(path) if path else get_data_dir() / "norms" / "SimLex-999.txt"
    result: list[tuple[str, str, float, str]] = []
    with open(path) as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            result.append((
                row["word1"].strip().lower(),
                row["word2"].strip().lower(),
                float(row["SimLex999"]),
                row["POS"].strip(),
            ))
    return result
  • [ ] Step 4: Update existing test in test_datasets.py

The existing test_load_simlex in packages/data/tests/test_datasets.py expects 3-tuples. Update the unpacking:

# In packages/data/tests/test_datasets.py, find:
w1, w2, score = sl[0]
# Replace with:
w1, w2, score, pos = sl[0]
assert isinstance(pos, str)
assert pos in ("N", "V", "A")

Also update any assert len(result[0]) == 3 to assert len(result[0]) == 4.

  • [ ] Step 5: Run both tests

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_simlex_with_pos packages/data/tests/test_datasets.py::test_load_simlex -v Expected: PASS

  • [ ] Step 6: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/tests/test_new_loaders.py packages/data/tests/test_datasets.py
git commit -m "feat: update load_simlex() to return POS column"

Chunk 3: Pipeline Schema + Words Module

Task 10: Pipeline schema.py — Data Contract

Files: - Create: packages/data/src/phonolex_data/pipeline/__init__.py (empty initially) - Create: packages/data/src/phonolex_data/pipeline/schema.py - Test: packages/data/tests/test_pipeline.py

Context: Defines the 4 dataclasses from the spec: WordRecord, EdgeRecord, DerivedData, LexicalDatabase. These are the data contract consumed by all pipeline stages and downstream consumers.

  • [ ] Step 1: Write the test
# packages/data/tests/test_pipeline.py
"""Tests for the integrated lexical database pipeline."""

from __future__ import annotations


def test_word_record_creation():
    from phonolex_data.pipeline.schema import WordRecord

    # CMU word with full phonological data
    wr = WordRecord(
        word="cat",
        has_phonology=True,
        ipa="kæt",
        phonemes=["k", "æ", "t"],
        phoneme_count=3,
        syllables=[{"onset": ["k"], "nucleus": "æ", "coda": ["t"], "stress": 1}],
        syllable_count=1,
        initial_phoneme="k",
        final_phoneme="t",
        wcm_score=3,
    )
    assert wr.word == "cat"
    assert wr.has_phonology is True
    assert wr.frequency is None  # norms default to None

    # Norm-only word
    wr2 = WordRecord(word="café", has_phonology=False)
    assert wr2.ipa is None
    assert wr2.phonemes == []
    assert wr2.phoneme_count is None


def test_edge_record_creation():
    from phonolex_data.pipeline.schema import EdgeRecord

    er = EdgeRecord(
        source="cat",
        target="dog",
        edge_sources=["SWOW", "USF"],
        swow_strength=0.15,
        usf_forward=0.08,
    )
    assert er.source == "cat"
    assert er.edge_sources == ["SWOW", "USF"]
    assert er.men_relatedness is None  # defaults to None


def test_lexical_database_creation():
    from phonolex_data.pipeline.schema import LexicalDatabase, WordRecord, DerivedData

    db = LexicalDatabase(
        words={"cat": WordRecord(word="cat", has_phonology=True)},
        edges=[],
        derived=DerivedData(),
        phoible_vectors={},
    )
    assert "cat" in db.words
    assert db.edges == []
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_word_record_creation -v Expected: FAIL (module not found)

  • [ ] Step 3: Implement schema.py

Create packages/data/src/phonolex_data/pipeline/__init__.py:

"""Integrated lexical database pipeline."""

Create packages/data/src/phonolex_data/pipeline/schema.py:

"""Data contract for the integrated lexical database pipeline.

Shared types consumed by all pipeline stages and downstream consumers.
"""

from __future__ import annotations

from dataclasses import dataclass, field


@dataclass
class WordRecord:
    """A single word with all phonological and psycholinguistic data."""

    word: str
    has_phonology: bool = False

    # Phonological fields — populated for CMU dict words, None/empty for norm-only
    ipa: str | None = None
    phonemes: list[str] = field(default_factory=list)
    phoneme_count: int | None = None
    syllables: list[dict] = field(default_factory=list)
    syllable_count: int | None = None
    initial_phoneme: str | None = None
    final_phoneme: str | None = None
    wcm_score: int | None = None

    # Norms — all optional, None means no data
    frequency: float | None = None
    log_frequency: float | None = None
    contextual_diversity: float | None = None
    prevalence: float | None = None
    aoa: float | None = None
    aoa_kuperman: float | None = None
    imageability: float | None = None
    familiarity: float | None = None
    concreteness: float | None = None
    size: float | None = None
    valence: float | None = None
    arousal: float | None = None
    dominance: float | None = None
    iconicity: float | None = None
    boi: float | None = None
    socialness: float | None = None
    auditory: float | None = None
    visual: float | None = None
    haptic: float | None = None
    gustatory: float | None = None
    olfactory: float | None = None
    interoceptive: float | None = None
    hand_arm: float | None = None
    foot_leg: float | None = None
    head: float | None = None
    mouth: float | None = None
    torso: float | None = None
    elp_lexical_decision_rt: float | None = None
    semantic_diversity: float | None = None

    # Morphology (MorphoLex)
    morpheme_count: int | None = None
    is_monomorphemic: bool | None = None
    n_prefixes: int | None = None
    n_suffixes: int | None = None
    morphological_segmentation: str | None = None

    # Phonotactic probability (IPhOD)
    neighborhood_density: int | None = None
    phono_prob_avg: float | None = None
    positional_prob_avg: float | None = None
    str_phono_prob_avg: float | None = None
    str_positional_prob_avg: float | None = None
    str_neighborhood_density: int | None = None

    # Child frequency (CYP-LEX)
    freq_cyplex_7_9: float | None = None
    freq_cyplex_10_12: float | None = None
    freq_cyplex_13: float | None = None

    # Vocab memberships
    vocab_memberships: set[str] = field(default_factory=set)


@dataclass
class EdgeRecord:
    """A relationship between two words from one or more association datasets."""

    source: str = ""
    target: str = ""
    edge_sources: list[str] = field(default_factory=list)

    swow_strength: float | None = None
    usf_forward: float | None = None
    usf_backward: float | None = None
    men_relatedness: float | None = None
    simlex_similarity: float | None = None
    simlex_pos: str | None = None
    wordsim_relatedness: float | None = None

    # SPP
    spp_first_priming: float | None = None
    spp_other_priming: float | None = None
    spp_fas: float | None = None
    spp_lsa: float | None = None

    # ECCC
    eccc_consistency: float | None = None
    eccc_n_instances: int | None = None
    eccc_phoneme_distance: float | None = None


@dataclass
class DerivedData:
    """Computed data derived from word records and PHOIBLE vectors."""

    percentiles: dict[str, dict[str, float | None]] = field(default_factory=dict)
    minimal_pairs: list[tuple] = field(default_factory=list)
    phoneme_data: dict[str, dict] = field(default_factory=dict)
    phoneme_norms: dict[str, float] = field(default_factory=dict)
    phoneme_dots: list[tuple] = field(default_factory=list)
    components: list[dict] = field(default_factory=list)
    word_syllable_data: dict = field(default_factory=dict)
    component_key_to_id: dict = field(default_factory=dict)
    property_ranges: dict[str, tuple[float, float]] = field(default_factory=dict)


@dataclass
class LexicalDatabase:
    """The complete integrated lexical database."""

    words: dict[str, WordRecord] = field(default_factory=dict)
    edges: list[EdgeRecord] = field(default_factory=list)
    derived: DerivedData = field(default_factory=DerivedData)
    phoible_vectors: dict = field(default_factory=dict)
  • [ ] Step 4: Run tests

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/pipeline/__init__.py packages/data/src/phonolex_data/pipeline/schema.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline schema — WordRecord, EdgeRecord, DerivedData, LexicalDatabase dataclasses"

Task 11: Pipeline words.py — Build Word Records

Files: - Create: packages/data/src/phonolex_data/pipeline/words.py - Test: packages/data/tests/test_pipeline.py

Context: build_words() loads CMU dict (phonological backbone), runs syllabification + WCM + normalization, then loads and merges all 15 norm datasets. Words not in CMU dict but present in norm datasets get has_phonology=False records. Returns dict[str, WordRecord].

Important: This function loads real datasets and takes a while. Tests should verify structure, not run the full pipeline. Use a focused integration test that checks a few known words.

Prerequisite fix: cmudict_to_phono() currently returns only {"phonemes": [...], "ipa": "..."} — it does NOT return stress information. The syllabifier requires PhonemeWithStress objects with stress markers. Before implementing build_words(), extend cmudict_to_phono() in packages/data/src/phonolex_data/loaders/cmudict.py to also return a "stress_pattern" list parallel to "phonemes". For each ARPAbet token, if it ends in 0/1/2 (vowels have stress digits), capture that digit as an int; otherwise None:

def cmudict_to_phono(
    cmu: dict[str, list[str]] | None = None,
    arpa_map: dict[str, str] | None = None,
) -> dict[str, dict[str, Any]]:
    """Convert raw CMUdict to PhonoFeatures-compatible format.

    Returns:
        {word: {"phonemes": [ipa, ...], "ipa": "...", "stress_pattern": [int|None, ...]}}
    """
    if cmu is None:
        cmu = load_cmudict()
    if arpa_map is None:
        arpa_map = load_arpa_to_ipa()

    result: dict[str, dict[str, Any]] = {}
    for word, arpa_phones in cmu.items():
        ipa_phones = []
        stress_pattern = []
        for p in arpa_phones:
            ipa = arpa_map.get(p) or arpa_map.get(p.rstrip("012"), p)
            ipa_phones.append(ipa)
            # Extract stress digit from ARPAbet vowel tokens (e.g., AE1 → 1)
            if p[-1:] in ("0", "1", "2"):
                stress_pattern.append(int(p[-1]))
            else:
                stress_pattern.append(None)
        result[word] = {
            "phonemes": ipa_phones,
            "ipa": "".join(ipa_phones),
            "stress_pattern": stress_pattern,
        }
    return result

Also add packages/data/src/phonolex_data/loaders/cmudict.py to the commit in Step 5.

  • [ ] Step 1: Write the integration test
# Add to packages/data/tests/test_pipeline.py

import pytest


@pytest.mark.slow
def test_build_words_structure():
    """Integration test — loads real data, checks structure of result."""
    from phonolex_data.pipeline.words import build_words

    words = build_words()
    assert isinstance(words, dict)
    assert len(words) > 100000  # union of all datasets

    # A word that's in CMU dict should have phonology
    assert "cat" in words
    cat = words["cat"]
    assert cat.has_phonology is True
    assert cat.ipa is not None
    assert len(cat.phonemes) > 0
    assert cat.phoneme_count is not None
    assert cat.syllable_count is not None
    assert cat.wcm_score is not None

    # Check that norms merged correctly — "cat" should have frequency
    assert cat.frequency is not None

    # Check that at least some norm-only words exist (words in SUBTLEX but not CMU)
    norm_only = [w for w, r in words.items() if not r.has_phonology]
    assert len(norm_only) > 0, "Expected some norm-only words from SUBTLEX/prevalence"

    # Verify norm-only words have null phonological fields
    if norm_only:
        sample = words[norm_only[0]]
        assert sample.ipa is None
        assert sample.phonemes == []
        assert sample.phoneme_count is None
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_words_structure -v -m slow Expected: FAIL

  • [ ] Step 3: Implement build_words()

Create packages/data/src/phonolex_data/pipeline/words.py:

"""Assemble word records from CMU dict + all norm datasets."""

from __future__ import annotations

from phonolex_data.loaders import (
    cmudict_to_phono,
    load_warriner,
    load_kuperman,
    load_glasgow,
    load_concreteness,
    load_sensorimotor,
    load_semantic_diversity,
    load_socialness,
    load_boi,
    load_subtlex,
    load_elp,
    load_iconicity,
    load_prevalence,
    load_iphod,
    load_morpholex,
    load_cyplex,
    load_all_vocab,
)
from phonolex_data.phonology.syllabification import syllabify, PhonemeWithStress
from phonolex_data.phonology.wcm import compute_wcm
from phonolex_data.phonology.normalize import normalize_phoneme
from phonolex_data.pipeline.schema import WordRecord


# Map from norm loader key names → WordRecord field names
# Most are 1:1, listed here for clarity and to handle exceptions
_NORM_FIELD_MAP: dict[str, str] = {
    # Glasgow
    "aoa": "aoa",
    "imageability": "imageability",
    "familiarity": "familiarity",
    "size": "size",
    # Warriner
    "valence": "valence",
    "arousal": "arousal",
    "dominance": "dominance",
    # Kuperman
    "aoa_kuperman": "aoa_kuperman",
    # Concreteness
    "concreteness": "concreteness",
    # SUBTLEX
    "frequency": "frequency",
    "log_frequency": "log_frequency",
    "contextual_diversity": "contextual_diversity",
    # Sensorimotor
    "auditory": "auditory",
    "visual": "visual",
    "haptic": "haptic",
    "gustatory": "gustatory",
    "olfactory": "olfactory",
    "interoceptive": "interoceptive",
    "hand_arm": "hand_arm",
    "foot_leg": "foot_leg",
    "head": "head",
    "mouth": "mouth",
    "torso": "torso",
    # Others
    "semantic_diversity": "semantic_diversity",
    "socialness": "socialness",
    "boi": "boi",
    "lexical_decision_rt": "elp_lexical_decision_rt",  # load_elp() returns "lexical_decision_rt"
    "iconicity": "iconicity",
    "prevalence": "prevalence",
    # IPhOD
    "neighborhood_density": "neighborhood_density",
    "phono_prob_avg": "phono_prob_avg",
    "positional_prob_avg": "positional_prob_avg",
    "str_neighborhood_density": "str_neighborhood_density",
    "str_phono_prob_avg": "str_phono_prob_avg",
    "str_positional_prob_avg": "str_positional_prob_avg",
    # MorphoLex
    "morpheme_count": "morpheme_count",
    "n_prefixes": "n_prefixes",
    "n_suffixes": "n_suffixes",
    "is_monomorphemic": "is_monomorphemic",
    "morphological_segmentation": "morphological_segmentation",
    # CYP-LEX
    "freq_cyplex_7_9": "freq_cyplex_7_9",
    "freq_cyplex_10_12": "freq_cyplex_10_12",
    "freq_cyplex_13": "freq_cyplex_13",
}


def _build_phonological_record(phono_data: dict) -> WordRecord:
    """Create a WordRecord from cmudict_to_phono() output with syllabification + WCM."""
    ipa = phono_data.get("ipa", "")
    phonemes_raw = phono_data.get("phonemes", [])
    stress_pattern = phono_data.get("stress_pattern", [])

    # Build PhonemeWithStress list for syllabifier
    phonemes_with_stress = []
    for i, p in enumerate(phonemes_raw):
        stress = stress_pattern[i] if i < len(stress_pattern) else None
        phonemes_with_stress.append(PhonemeWithStress(phoneme=p, stress=stress))

    syllables_obj = syllabify(phonemes_with_stress)
    syllables = [
        {
            "onset": [str(p) for p in s.onset],
            "nucleus": str(s.nucleus),
            "coda": [str(p) for p in s.coda],
            "stress": s.stress,
        }
        for s in syllables_obj
    ]

    phonemes = [normalize_phoneme(p) for p in phonemes_raw]

    wcm = compute_wcm(phonemes, syllables)

    return WordRecord(
        word=phono_data.get("word", ""),
        has_phonology=True,
        ipa=ipa,
        phonemes=phonemes,
        phoneme_count=len(phonemes),
        syllables=syllables,
        syllable_count=len(syllables),
        initial_phoneme=phonemes[0] if phonemes else None,
        final_phoneme=phonemes[-1] if phonemes else None,
        wcm_score=wcm,
    )


def _merge_norms(words: dict[str, WordRecord], norm_data: dict[str, dict]) -> None:
    """Merge a norm dataset into word records. Creates norm-only records for new words."""
    for word, props in norm_data.items():
        if word not in words:
            words[word] = WordRecord(word=word, has_phonology=False)
        record = words[word]
        for src_key, value in props.items():
            dest_field = _NORM_FIELD_MAP.get(src_key)
            if dest_field and hasattr(record, dest_field):
                setattr(record, dest_field, value)


def build_words() -> dict[str, WordRecord]:
    """Build all word records from CMU dict + norm datasets.

    Returns dict[str, WordRecord] — union of all source datasets.
    """
    print("Loading CMU dict ...")
    cmu_phono = cmudict_to_phono()

    # Build phonological records from CMU dict
    print(f"  Syllabifying {len(cmu_phono):,} CMU entries ...")
    words: dict[str, WordRecord] = {}
    for word, phono_data in cmu_phono.items():
        phono_data["word"] = word
        try:
            words[word] = _build_phonological_record(phono_data)
        except Exception:
            # Skip words that fail syllabification
            continue

    print(f"  {len(words):,} words with phonological data")

    # Load and merge all norm datasets
    print("Loading norm datasets ...")
    norm_loaders = [
        ("Warriner", load_warriner),
        ("Kuperman", load_kuperman),
        ("Glasgow", load_glasgow),
        ("Concreteness", load_concreteness),
        ("Sensorimotor", load_sensorimotor),
        ("Semantic Diversity", load_semantic_diversity),
        ("Socialness", load_socialness),
        ("BOI", load_boi),
        ("SUBTLEX", load_subtlex),
        ("ELP", load_elp),
        ("Iconicity", load_iconicity),
        ("Prevalence", load_prevalence),
        ("IPhOD", load_iphod),
        ("MorphoLex", load_morpholex),
        ("CYP-LEX", load_cyplex),
    ]

    for name, loader in norm_loaders:
        try:
            data = loader()
            _merge_norms(words, data)
            print(f"  {name}: {len(data):,} entries")
        except Exception as e:
            print(f"  WARNING: {name} failed: {e}")

    # Load vocab list memberships
    print("Loading vocab lists ...")
    try:
        vocab_data = load_all_vocab()
        for word, memberships in vocab_data.items():
            if word in words:
                words[word].vocab_memberships = memberships
    except Exception as e:
        print(f"  WARNING: vocab lists failed: {e}")

    norm_only = sum(1 for r in words.values() if not r.has_phonology)
    print(f"Total: {len(words):,} words ({len(words) - norm_only:,} with phonology, {norm_only:,} norm-only)")

    return words
  • [ ] Step 4: Run test

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_words_structure -v -m slow Expected: PASS (may take 30-60 seconds to load all datasets)

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/pipeline/words.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline words module — build_words() assembles all word records"

Task 12: Pipeline edges.py — Build Edge Records

Files: - Create: packages/data/src/phonolex_data/pipeline/edges.py - Test: packages/data/tests/test_pipeline.py

Context: build_edges(words) loads 7 association datasets, builds an edge index keyed by sorted word pairs, and merges multiple sources per pair. Only includes edges where both words exist in the words dict.

  • [ ] Step 1: Write the test
# Add to packages/data/tests/test_pipeline.py

@pytest.mark.slow
def test_build_edges_structure():
    """Integration test — builds edges from real association data."""
    from phonolex_data.pipeline.schema import WordRecord
    from phonolex_data.pipeline.edges import build_edges

    # Create a minimal word dict with known words
    words = {
        "cat": WordRecord(word="cat", has_phonology=True),
        "dog": WordRecord(word="dog", has_phonology=True),
        "happy": WordRecord(word="happy", has_phonology=True),
        "sad": WordRecord(word="sad", has_phonology=True),
        "old": WordRecord(word="old", has_phonology=True),
        "new": WordRecord(word="new", has_phonology=True),
    }

    edges = build_edges(words)
    assert isinstance(edges, list)
    # With only 6 words, we should get at least a few edges from SWOW/USF
    # (cat-dog is a very common association pair)
    assert len(edges) >= 0  # May be 0 if none of these pairs exist

    # Check EdgeRecord structure
    if edges:
        edge = edges[0]
        assert hasattr(edge, "source")
        assert hasattr(edge, "target")
        assert hasattr(edge, "edge_sources")
        assert isinstance(edge.edge_sources, list)
        assert len(edge.edge_sources) > 0
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_edges_structure -v -m slow Expected: FAIL

  • [ ] Step 3: Implement build_edges()

Create packages/data/src/phonolex_data/pipeline/edges.py:

"""Assemble edge records from association datasets."""

from __future__ import annotations

from phonolex_data.loaders import (
    load_swow,
    load_free_association,
    load_simlex,
    load_men,
    load_wordsim,
    load_spp,
    load_eccc,
)
from phonolex_data.pipeline.schema import EdgeRecord, WordRecord


def _sorted_pair(w1: str, w2: str) -> tuple[str, str]:
    return (w1, w2) if w1 <= w2 else (w2, w1)


def _get_or_create(
    index: dict[tuple[str, str], EdgeRecord],
    w1: str, w2: str,
) -> EdgeRecord:
    key = _sorted_pair(w1, w2)
    if key not in index:
        index[key] = EdgeRecord(source=key[0], target=key[1])
    return index[key]


def build_edges(words: dict[str, WordRecord]) -> list[EdgeRecord]:
    """Build all edge records from 7 association datasets.

    Only includes edges where both words exist in the words dict.
    """
    index: dict[tuple[str, str], EdgeRecord] = {}

    def _in_vocab(w: str) -> bool:
        return w in words

    # 1. SWOW
    print("Loading SWOW ...")
    try:
        swow = load_swow()
        for cue, responses in swow.items():
            if not _in_vocab(cue):
                continue
            for response, strength in responses.items():
                if _in_vocab(response) and cue != response:
                    edge = _get_or_create(index, cue, response)
                    if "SWOW" not in edge.edge_sources:
                        edge.edge_sources.append("SWOW")
                    edge.swow_strength = strength
        print(f"  SWOW: {sum(1 for e in index.values() if 'SWOW' in e.edge_sources):,} edges")
    except Exception as e:
        print(f"  WARNING: SWOW failed: {e}")

    # 2. USF (Free Association)
    print("Loading USF ...")
    try:
        usf = load_free_association()
        for cue, targets in usf.items():
            if not _in_vocab(cue):
                continue
            for target, strength in targets.items():
                if _in_vocab(target) and cue != target:
                    edge = _get_or_create(index, cue, target)
                    if "USF" not in edge.edge_sources:
                        edge.edge_sources.append("USF")
                    # Store directional: if cue < target, it's forward
                    key = _sorted_pair(cue, target)
                    if cue == key[0]:
                        edge.usf_forward = strength
                    else:
                        edge.usf_backward = strength
        print(f"  USF: {sum(1 for e in index.values() if 'USF' in e.edge_sources):,} edges")
    except Exception as e:
        print(f"  WARNING: USF failed: {e}")

    # 3. SimLex
    print("Loading SimLex ...")
    try:
        simlex = load_simlex()
        for w1, w2, score, pos in simlex:
            if _in_vocab(w1) and _in_vocab(w2):
                edge = _get_or_create(index, w1, w2)
                if "SimLex" not in edge.edge_sources:
                    edge.edge_sources.append("SimLex")
                edge.simlex_similarity = score
                edge.simlex_pos = pos
        print(f"  SimLex: {sum(1 for e in index.values() if 'SimLex' in e.edge_sources):,} edges")
    except Exception as e:
        print(f"  WARNING: SimLex failed: {e}")

    # 4. MEN
    print("Loading MEN ...")
    try:
        men = load_men()
        for w1, w2, score in men:
            if _in_vocab(w1) and _in_vocab(w2):
                edge = _get_or_create(index, w1, w2)
                if "MEN" not in edge.edge_sources:
                    edge.edge_sources.append("MEN")
                edge.men_relatedness = score
        print(f"  MEN: {sum(1 for e in index.values() if 'MEN' in e.edge_sources):,} edges")
    except Exception as e:
        print(f"  WARNING: MEN failed: {e}")

    # 5. WordSim-353
    print("Loading WordSim ...")
    try:
        wordsim = load_wordsim()
        for w1, w2, score in wordsim:
            if _in_vocab(w1) and _in_vocab(w2):
                edge = _get_or_create(index, w1, w2)
                if "WordSim" not in edge.edge_sources:
                    edge.edge_sources.append("WordSim")
                edge.wordsim_relatedness = score
        print(f"  WordSim: {sum(1 for e in index.values() if 'WordSim' in e.edge_sources):,} edges")
    except Exception as e:
        print(f"  WARNING: WordSim failed: {e}")

    # 6. SPP
    print("Loading SPP ...")
    try:
        spp = load_spp()
        for target, prime, first_priming, other_priming, fas, lsa in spp:
            if _in_vocab(target) and _in_vocab(prime):
                edge = _get_or_create(index, target, prime)
                if "SPP" not in edge.edge_sources:
                    edge.edge_sources.append("SPP")
                edge.spp_first_priming = first_priming
                edge.spp_other_priming = other_priming
                edge.spp_fas = fas
                edge.spp_lsa = lsa
        print(f"  SPP: {sum(1 for e in index.values() if 'SPP' in e.edge_sources):,} edges")
    except Exception as e:
        print(f"  WARNING: SPP failed: {e}")

    # 7. ECCC
    print("Loading ECCC ...")
    try:
        eccc = load_eccc()
        for target, confusion, consistency, n_instances, phon_dist in eccc:
            if _in_vocab(target) and _in_vocab(confusion):
                edge = _get_or_create(index, target, confusion)
                if "ECCC" not in edge.edge_sources:
                    edge.edge_sources.append("ECCC")
                edge.eccc_consistency = consistency
                edge.eccc_n_instances = n_instances
                edge.eccc_phoneme_distance = phon_dist
        print(f"  ECCC: {sum(1 for e in index.values() if 'ECCC' in e.edge_sources):,} edges")
    except Exception as e:
        print(f"  WARNING: ECCC failed: {e}")

    result = list(index.values())
    print(f"Total: {len(result):,} unique edges")
    return result
  • [ ] Step 4: Run test

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_edges_structure -v -m slow Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/pipeline/edges.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline edges module — build_edges() merges 7 association datasets"

Chunk 4: Pipeline Derived + Orchestrator + Consumer Rewrite

Task 13: Pipeline derived.py — Compute Derived Data

Files: - Create: packages/data/src/phonolex_data/pipeline/derived.py - Test: packages/data/tests/test_pipeline.py

Context: Replicates sections 3-8 of export-to-d1.py: percentiles, phoneme index, syllable components, phoneme dot products, minimal pairs, property ranges. Operates only on has_phonology=True words for phonological computations. Percentiles computed across all words that have each property.

Reference: Read packages/web/workers/scripts/export-to-d1.py lines 176-376 for the exact logic to replicate.

  • [ ] Step 1: Write the test
# Add to packages/data/tests/test_pipeline.py

def test_build_derived_with_minimal_data():
    """Unit test with synthetic data — no file I/O."""
    from phonolex_data.pipeline.schema import WordRecord, DerivedData
    from phonolex_data.pipeline.derived import build_derived

    words = {
        "cat": WordRecord(
            word="cat", has_phonology=True, ipa="kæt",
            phonemes=["k", "æ", "t"], phoneme_count=3,
            syllables=[{"onset": ["k"], "nucleus": "æ", "coda": ["t"], "stress": 1}],
            syllable_count=1, frequency=100.0,
        ),
        "bat": WordRecord(
            word="bat", has_phonology=True, ipa="bæt",
            phonemes=["b", "æ", "t"], phoneme_count=3,
            syllables=[{"onset": ["b"], "nucleus": "æ", "coda": ["t"], "stress": 1}],
            syllable_count=1, frequency=50.0,
        ),
        "rare": WordRecord(word="rare", has_phonology=False, frequency=10.0),
    }

    # Minimal PHOIBLE vectors for testing
    phoible = {
        "76d": {
            "k": [1.0] * 76,
            "b": [0.5] * 76,
            "æ": [0.0] * 76,
            "t": [-0.5] * 76,
        },
        "feature_names": [f"f{i}" for i in range(38)],
    }

    derived = build_derived(words, phoible)
    assert isinstance(derived, DerivedData)

    # Percentiles should be computed for all words with frequency
    assert "cat" in derived.percentiles
    assert "rare" in derived.percentiles  # norm-only words get percentiles too
    assert "frequency_percentile" in derived.percentiles["cat"]

    # Minimal pairs: cat-bat differ in position 0 (k vs b)
    assert len(derived.minimal_pairs) >= 1
    mp = derived.minimal_pairs[0]
    assert mp[0] in ("bat", "cat")  # sorted
    assert mp[1] in ("bat", "cat")

    # Components extracted from phonology words only
    assert len(derived.components) > 0
    assert "cat" in derived.word_syllable_data
    assert "rare" not in derived.word_syllable_data  # norm-only excluded

    # Property ranges
    assert "frequency" in derived.property_ranges
  • [ ] Step 2: Run test to verify it fails

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_derived_with_minimal_data -v Expected: FAIL

  • [ ] Step 3: Implement build_derived()

Create packages/data/src/phonolex_data/pipeline/derived.py:

"""Compute derived data: percentiles, minimal pairs, phoneme dots, syllable components."""

from __future__ import annotations

import bisect

import numpy as np

from phonolex_data.pipeline.schema import DerivedData, WordRecord


# Properties eligible for percentile computation.
# IMPORTANT: This list MUST match FILTERABLE_PROPERTIES in
# packages/web/workers/scripts/config.py — keep them in sync.
# These are defined here because config.py lives in a different package
# (workers/scripts) and cannot be cleanly imported from packages/data.
PERCENTILE_PROPERTIES = (
    "syllable_count", "phoneme_count", "wcm_score",
    "phono_prob_avg", "positional_prob_avg",
    "neighborhood_density", "str_phono_prob_avg", "str_positional_prob_avg",
    "str_neighborhood_density",
    "frequency", "log_frequency", "contextual_diversity", "prevalence",
    "aoa", "aoa_kuperman",
    "elp_lexical_decision_rt",
    "imageability", "familiarity", "concreteness", "size",
    "valence", "arousal", "dominance",
    "iconicity", "boi", "socialness", "semantic_diversity",
    "auditory", "visual", "haptic", "gustatory", "olfactory", "interoceptive",
    "hand_arm", "foot_leg", "head", "mouth", "torso",
    "morpheme_count", "n_prefixes", "n_suffixes",
    "freq_cyplex_7_9", "freq_cyplex_10_12", "freq_cyplex_13",
)

VOWEL_IPA = {
    "i", "ɪ", "e", "ɛ", "æ", "ɑ", "ɔ", "o", "ʊ", "u",
    "ʌ", "ə", "ɚ", "ɝ", "eɪ", "aɪ", "ɔɪ", "aʊ", "oʊ",
}


def _compute_percentiles(
    words: dict[str, WordRecord],
) -> dict[str, dict[str, float | None]]:
    """Compute percentiles for all percentile-eligible properties."""
    # Build sorted arrays
    sorted_arrays: dict[str, list[float]] = {}
    for prop_id in PERCENTILE_PROPERTIES:
        values = []
        for record in words.values():
            val = getattr(record, prop_id, None)
            if val is not None:
                values.append(float(val))
        values.sort()
        sorted_arrays[prop_id] = values

    # Compute per-word percentiles
    result: dict[str, dict[str, float | None]] = {}
    for word, record in words.items():
        pcts: dict[str, float | None] = {}
        for prop_id in PERCENTILE_PROPERTIES:
            val = getattr(record, prop_id, None)
            sorted_vals = sorted_arrays.get(prop_id, [])
            if val is not None and sorted_vals:
                upper = bisect.bisect_right(sorted_vals, float(val))
                pcts[f"{prop_id}_percentile"] = round((upper / len(sorted_vals)) * 100, 1)
            else:
                pcts[f"{prop_id}_percentile"] = None
        result[word] = pcts
    return result


def _compute_phoneme_data(
    phoible_vectors: dict,
) -> tuple[dict[str, dict], dict[str, float], list[tuple[str, str, float]]]:
    """Extract phoneme data, norms, and pairwise dot products from PHOIBLE vectors."""
    phoible_76d = phoible_vectors.get("76d", {})
    feature_names = phoible_vectors.get("feature_names", [])

    # Phoneme index
    phonemes_data: dict[str, dict] = {}
    for ipa, vec76 in phoible_76d.items():
        ptype = "vowel" if ipa in VOWEL_IPA else "consonant"
        features = {}
        for i, fname in enumerate(feature_names):
            idx = i * 2
            if idx + 1 < len(vec76):
                val = vec76[idx]
                if val > 0.5:
                    features[fname] = "+"
                elif val < -0.5:
                    features[fname] = "-"
                else:
                    features[fname] = "0"
        phonemes_data[ipa] = {"type": ptype, "features": features}

    # Phoneme norms (norm_sq)
    phoneme_norms: dict[str, float] = {}
    for ipa, vec76 in phoible_76d.items():
        v = np.array(vec76[:76], dtype=np.float32)
        phoneme_norms[ipa] = float(np.dot(v, v))

    # Pairwise dot products
    phoneme_ipa_list = sorted(phoible_76d.keys())
    phoneme_dots: list[tuple[str, str, float]] = []
    for i, ipa1 in enumerate(phoneme_ipa_list):
        v1 = np.array(phoible_76d[ipa1][:76], dtype=np.float32)
        for j in range(i + 1, len(phoneme_ipa_list)):
            ipa2 = phoneme_ipa_list[j]
            v2 = np.array(phoible_76d[ipa2][:76], dtype=np.float32)
            dot = float(np.dot(v1, v2))
            if dot != 0.0:
                phoneme_dots.append((ipa1, ipa2, dot))

    return phonemes_data, phoneme_norms, phoneme_dots


def _extract_syllable_components(
    words: dict[str, WordRecord],
) -> tuple[list[dict], dict, dict]:
    """Extract unique syllable components and word-syllable mappings.

    Only operates on words with has_phonology=True.
    """
    component_keys: set[tuple[str, tuple[str, ...]]] = set()
    word_syllable_data: dict[str, list[dict]] = {}

    for word, record in words.items():
        if not record.has_phonology or not record.syllables:
            continue

        word_syls = []
        for syl in record.syllables:
            onset_key = ("onset", tuple(syl.get("onset", [])))
            component_keys.add(onset_key)

            nuc_ipa = syl.get("nucleus", "")
            nuc_key = ("nucleus", (nuc_ipa,) if nuc_ipa else ())
            component_keys.add(nuc_key)

            coda_key = ("coda", tuple(syl.get("coda", [])))
            component_keys.add(coda_key)

            word_syls.append({
                "onset_key": onset_key,
                "nucleus_key": nuc_key,
                "coda_key": coda_key,
            })

        word_syllable_data[word] = word_syls

    # Assign IDs
    component_key_to_id: dict[tuple, int] = {}
    component_list: list[dict] = []
    for i, key in enumerate(sorted(component_keys)):
        ctype, phons = key
        component_key_to_id[key] = i
        component_list.append({"id": i, "type": ctype, "phonemes": list(phons)})

    return component_list, word_syllable_data, component_key_to_id


def _compute_minimal_pairs(
    words: dict[str, WordRecord],
) -> list[tuple[str, str, str, str, int, str]]:
    """Precompute minimal pairs — only for words with has_phonology=True."""
    by_length: dict[int, list[str]] = {}
    for word, record in words.items():
        if not record.has_phonology:
            continue
        length = record.phoneme_count or 0
        if length >= 2:
            by_length.setdefault(length, []).append(word)

    minimal_pairs: list[tuple[str, str, str, str, int, str]] = []
    for length, word_list in by_length.items():
        word_list.sort()
        for i in range(len(word_list)):
            w1 = word_list[i]
            p1 = words[w1].phonemes
            for j in range(i + 1, len(word_list)):
                w2 = word_list[j]
                p2 = words[w2].phonemes
                diff_count = 0
                diff_pos = -1
                diff_p1 = ""
                diff_p2 = ""
                for k in range(length):
                    if p1[k] != p2[k]:
                        diff_count += 1
                        diff_pos = k
                        diff_p1 = p1[k]
                        diff_p2 = p2[k]
                        if diff_count > 1:
                            break
                if diff_count != 1:
                    continue
                if diff_pos == 0:
                    pos_type = "initial"
                elif diff_pos == length - 1:
                    pos_type = "final"
                else:
                    pos_type = "medial"
                minimal_pairs.append((w1, w2, diff_p1, diff_p2, diff_pos, pos_type))

    return minimal_pairs


def _compute_property_ranges(
    words: dict[str, WordRecord],
) -> dict[str, tuple[float, float]]:
    """Compute min/max ranges for all filterable properties."""
    ranges: dict[str, tuple[float, float]] = {}
    for prop_id in PERCENTILE_PROPERTIES:
        values = [
            getattr(r, prop_id) for r in words.values()
            if getattr(r, prop_id, None) is not None
        ]
        if values:
            ranges[prop_id] = (min(values), max(values))
        else:
            ranges[prop_id] = (0, 0)
    return ranges


def build_derived(
    words: dict[str, WordRecord],
    phoible_vectors: dict,
) -> DerivedData:
    """Compute all derived data from word records and PHOIBLE vectors."""
    print("Computing percentiles ...")
    percentiles = _compute_percentiles(words)

    print("Building phoneme index + dot products ...")
    phoneme_data, phoneme_norms, phoneme_dots = _compute_phoneme_data(phoible_vectors)

    print("Extracting syllable components ...")
    components, word_syllable_data, component_key_to_id = _extract_syllable_components(words)

    print("Precomputing minimal pairs ...")
    minimal_pairs = _compute_minimal_pairs(words)

    print("Computing property ranges ...")
    property_ranges = _compute_property_ranges(words)

    print(f"  {len(phoneme_data)} phonemes, {len(phoneme_dots):,} dot products")
    print(f"  {len(components)} components, {len(minimal_pairs):,} minimal pairs")

    return DerivedData(
        percentiles=percentiles,
        minimal_pairs=minimal_pairs,
        phoneme_data=phoneme_data,
        phoneme_norms=phoneme_norms,
        phoneme_dots=phoneme_dots,
        components=components,
        word_syllable_data=word_syllable_data,
        component_key_to_id=component_key_to_id,
        property_ranges=property_ranges,
    )
  • [ ] Step 4: Run test

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_derived_with_minimal_data -v Expected: PASS

  • [ ] Step 5: Commit
git add packages/data/src/phonolex_data/pipeline/derived.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline derived module — percentiles, minimal pairs, phoneme dots, syllable components"

Task 14: Pipeline Orchestrator + Integration Test

Files: - Modify: packages/data/src/phonolex_data/pipeline/__init__.py - Test: packages/data/tests/test_pipeline.py

  • [ ] Step 1: Write the integration test
# Add to packages/data/tests/test_pipeline.py

@pytest.mark.slow
def test_build_lexical_database_integration():
    """Full integration test — builds entire database from raw datasets."""
    from phonolex_data.pipeline import build_lexical_database

    db = build_lexical_database()

    # Word count should be >100K (union of all datasets)
    assert len(db.words) > 100000

    # Should have both phonology and norm-only words
    phono_count = sum(1 for r in db.words.values() if r.has_phonology)
    norm_only_count = sum(1 for r in db.words.values() if not r.has_phonology)
    assert phono_count > 100000  # CMU dict has ~134K
    assert norm_only_count > 0

    # Edges
    assert len(db.edges) > 0

    # Derived data
    assert len(db.derived.percentiles) == len(db.words)
    assert len(db.derived.minimal_pairs) > 0
    assert len(db.derived.phoneme_data) > 0
    assert len(db.derived.phoneme_dots) > 0
    assert len(db.derived.components) > 0
    assert len(db.derived.word_syllable_data) > 0

    # PHOIBLE vectors
    assert "76d" in db.phoible_vectors
  • [ ] Step 2: Implement orchestrator

Update packages/data/src/phonolex_data/pipeline/__init__.py:

"""Integrated lexical database pipeline.

Usage:
    from phonolex_data.pipeline import build_lexical_database
    db = build_lexical_database()
"""

from __future__ import annotations

from phonolex_data.loaders import load_phoible
from phonolex_data.pipeline.edges import build_edges
from phonolex_data.pipeline.derived import build_derived
from phonolex_data.pipeline.schema import LexicalDatabase
from phonolex_data.pipeline.words import build_words


def build_lexical_database() -> LexicalDatabase:
    """Build the complete integrated lexical database from raw datasets."""
    phoible = load_phoible()
    words = build_words()
    edges = build_edges(words)
    derived = build_derived(words, phoible)
    return LexicalDatabase(
        words=words,
        edges=edges,
        derived=derived,
        phoible_vectors=phoible,
    )
  • [ ] Step 3: Run integration test

Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_lexical_database_integration -v -m slow Expected: PASS (may take several minutes)

  • [ ] Step 4: Commit
git add packages/data/src/phonolex_data/pipeline/__init__.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline orchestrator — build_lexical_database() assembles everything"

Task 15: Update config.py — New Properties + Edge Types

Files: - Modify: packages/web/workers/scripts/config.py

Context: Add PropertyDefs for new properties (neighborhood_density, stressed IPhOD variants, CYP-LEX frequencies, morphological_segmentation, semantic_diversity, log_frequency). Update ECCC and SPP edge type strength_keys to match new EdgeRecord field names. Remove passes_vocabulary_filter() — no longer needed.

  • [ ] Step 1: Add new property definitions

Add to config.py in the appropriate categories:

In PHONOTACTIC_PROBABILITY, add after positional_prob_avg:

        PropertyDef(
            id="neighborhood_density",
            label="Neighborhood Density",
            short_label="ND",
            source="IPhOD2 (Vaden et al., 2009)",
            description="Number of phonological neighbors",
            scale="0-50+",
            interpretation="Higher = more similar-sounding words",
            display_format=".0f",
            slider_step=1,
            is_integer=True,
        ),
        PropertyDef(
            id="str_phono_prob_avg",
            label="Biphone Probability — Stressed (Avg)",
            short_label="sBPP",
            source="IPhOD2 (Vaden et al., 2009)",
            description="Mean biphone probability with stress marking",
            scale="0-1",
            interpretation="Higher = more typical stressed sound sequences",
            display_format=".4f",
            slider_step=0.001,
        ),
        PropertyDef(
            id="str_positional_prob_avg",
            label="Positional Segment Probability — Stressed (Avg)",
            short_label="sPSP",
            source="IPhOD2 (Vaden et al., 2009)",
            description="Mean positional segment probability with stress marking",
            scale="0-1",
            interpretation="Higher = more common stressed phonemes in those positions",
            display_format=".4f",
            slider_step=0.001,
        ),
        PropertyDef(
            id="str_neighborhood_density",
            label="Neighborhood Density — Stressed",
            short_label="sND",
            source="IPhOD2 (Vaden et al., 2009)",
            description="Phonological neighbors accounting for stress",
            scale="0-50+",
            interpretation="Higher = more stress-matched neighbors",
            display_format=".0f",
            slider_step=1,
            is_integer=True,
        ),

Update source citations for phono_prob_avg and positional_prob_avg from "Vitevitch & Luce (2004)" to "IPhOD2 (Vaden et al., 2009)".

Add a new CHILD_FREQUENCY category after MORPHOLOGICAL_PROPERTIES:

CHILD_FREQUENCY = PropertyCategory(
    id="child_frequency",
    label="Child Frequency (CYP-LEX)",
    properties=(
        PropertyDef(
            id="freq_cyplex_7_9",
            label="Child Frequency (Age 7-9)",
            short_label="CF7",
            source="CYP-LEX (Sheridan & Jakobson, 2019)",
            description="Zipf-scale word frequency in children's media, ages 7-9",
            scale="0-7",
            interpretation="Higher = more frequent in age-band media",
            display_format=".2f",
            slider_step=0.1,
        ),
        PropertyDef(
            id="freq_cyplex_10_12",
            label="Child Frequency (Age 10-12)",
            short_label="CF10",
            source="CYP-LEX (Sheridan & Jakobson, 2019)",
            description="Zipf-scale word frequency in children's media, ages 10-12",
            scale="0-7",
            interpretation="Higher = more frequent in age-band media",
            display_format=".2f",
            slider_step=0.1,
        ),
        PropertyDef(
            id="freq_cyplex_13",
            label="Child Frequency (Age 13+)",
            short_label="CF13",
            source="CYP-LEX (Sheridan & Jakobson, 2019)",
            description="Zipf-scale word frequency in children's media, ages 13+",
            scale="0-7",
            interpretation="Higher = more frequent in age-band media",
            display_format=".2f",
            slider_step=0.1,
        ),
    ),
)

Add CHILD_FREQUENCY to PROPERTY_CATEGORIES tuple.

Also add missing PropertyDefs for existing-but-unfilterable properties that are now in PROPERTY_COLUMNS. If semantic_diversity and log_frequency are already defined in config.py, verify they're correct. If not, add them to the appropriate categories: - log_frequencyFREQUENCY category (alongside frequency and contextual_diversity) - semantic_diversityPSYCHOLINGUISTIC or a new category

Note: morphological_segmentation is a TEXT column, not a numeric filter — it does NOT need a PropertyDef or slider. It's just a display column.

  • [ ] Step 2: Update edge type definitions

Update SPP and ECCC in EDGE_TYPES:

    "ECCC": {
        "label": "Perceptual Confusability (ECCC)",
        "description": "Words confused by listeners in noise",
        "strength_key": "eccc_consistency",
    },
    "SPP": {
        "label": "Semantic Priming (SPP)",
        "description": "Priming effects in lexical decision and naming",
        "strength_key": "spp_first_priming",
    },

  • [ ] Step 3: Remove passes_vocabulary_filter()

Delete the entire passes_vocabulary_filter() function — no longer used.

  • [ ] Step 4: Commit
git add packages/web/workers/scripts/config.py
git commit -m "feat: add new property definitions for IPhOD, CYP-LEX; update edge type keys"

Task 16: Rewrite export-to-d1.py

Files: - Modify: packages/web/workers/scripts/export-to-d1.py

Context: Replace the entire pickle-based pipeline with a thin SQL writer that calls build_lexical_database(). Keep the SQL generation format identical (same table structure, batch INSERT). Key changes: nullable phonological columns, has_phonology flag, new property columns, updated edge columns (SPP/ECCC renamed), SQL NULLs for norm-only words.

  • [ ] Step 1: Rewrite export-to-d1.py

Replace the entire file. The new version: 1. Imports build_lexical_database from the pipeline 2. Calls it to get the LexicalDatabase 3. Writes SQL using the same batch INSERT pattern 4. No pickle, no WCM computation, no norm loading — all in the pipeline

Key structural changes from the old version: - GRAPH_PATH → deleted - compute_wcm() → deleted (in pipeline) - passes_vocabulary_filter() → deleted (no filtering) - PROPERTY_COLUMNS → updated to include new columns. Note: has_phonology goes in the base columns list (alongside word, ipa, etc.), NOT in PROPERTY_COLUMNS - words table schema: phonological columns become nullable, add has_phonology INTEGER NOT NULL DEFAULT 1 - edges table schema: spp_priming_z/short/longspp_first_priming/other_priming, eccc_confusabilityeccc_consistency - INSERT logic: use sql_val(None) (→ NULL) for norm-only word phonological fields

The implementer MUST read the existing export-to-d1.py thoroughly before rewriting. Preserve: - sql_val() and sql_json() helpers - Batch size of 20 - Same INSERT format - Same table ordering (words, edges, minimal_pairs, phonemes, phoneme_dots, components, word_syllables, metadata) - Same index creation

Additional guidance: - The word_syllables INSERT must iterate db.derived.word_syllable_data (which already excludes norm-only words) — do NOT loop over all words for syllable rows - The old is_clean_edge() filter (rejects SWOW edges with tabs/newlines/length>100) is no longer needed in the exporter — verify that load_swow() in packages/data/src/phonolex_data/loaders/associations.py already handles this, or add filtering there - For word INSERTs, access WordRecord fields via getattr(record, field_name) instead of dict .get()

The new PROPERTY_COLUMNS list (add to existing):

PROPERTY_COLUMNS = [
    "wcm_score", "frequency", "log_frequency",
    "contextual_diversity", "prevalence", "aoa",
    "aoa_kuperman", "elp_lexical_decision_rt",
    "phono_prob_avg", "positional_prob_avg",
    "neighborhood_density",
    "str_phono_prob_avg", "str_positional_prob_avg", "str_neighborhood_density",
    "imageability", "familiarity", "concreteness", "size",
    "valence", "arousal", "dominance",
    "iconicity", "boi", "socialness",
    "semantic_diversity",
    "auditory", "visual", "haptic",
    "gustatory", "olfactory", "interoceptive",
    "hand_arm", "foot_leg", "head", "mouth", "torso",
    "morpheme_count", "is_monomorphemic",
    "n_prefixes", "n_suffixes",
    "morphological_segmentation",
    "freq_cyplex_7_9", "freq_cyplex_10_12", "freq_cyplex_13",
]

The new edge columns:

EDGE_COLUMNS = [
    "source", "target", "edge_sources",
    "swow_strength", "usf_forward", "usf_backward",
    "men_relatedness",
    "eccc_consistency", "eccc_n_instances", "eccc_phoneme_distance",
    "spp_first_priming", "spp_other_priming",
    "spp_fas", "spp_lsa",
    "simlex_similarity", "simlex_pos",
    "wordsim_relatedness",
]

For the words INSERT, access WordRecord fields via getattr(record, field_name) instead of data.get(field_name).

  • [ ] Step 2: Run end-to-end

Run: cd /Users/jneumann/Repos/PhonoLex && uv run python packages/web/workers/scripts/export-to-d1.py Expected: Produces packages/web/workers/scripts/d1-seed.sql with >100K words

  • [ ] Step 3: Verify SQL structure
head -100 /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql
grep -c "INSERT INTO words" /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql
grep "has_phonology" /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql | head -5
  • [ ] Step 4: Commit
git add packages/web/workers/scripts/export-to-d1.py
git commit -m "feat: rewrite export-to-d1.py to use pipeline — eliminates pickle dependency"

Task 17: Update TypeScript Types + Properties

Files: - Modify: packages/web/workers/src/types.ts - Modify: packages/web/workers/src/config/properties.ts

Context: Make phonological fields nullable in WordRow and WordResponse. Add has_phonology field. Add new property columns. Add new edge columns. Mirror the config.py property changes in properties.ts.

  • [ ] Step 1: Update types.ts

In WordRow, change:

ipa: string;                ipa: string | null;
phonemes: string;           phonemes: string | null;
phonemes_str: string;       phonemes_str: string | null;
syllables: string;          syllables: string | null;
phoneme_count: number;      phoneme_count: number | null;
syllable_count: number;     syllable_count: number | null;

Add: has_phonology: number; (0 or 1)

In WordResponse, change:

ipa: string;                ipa: string | null;
phonemes: string[];         phonemes: string[] | null;
syllables: ...;             syllables: ... | null;
phoneme_count: number;      phoneme_count: number | null;
syllable_count: number;     syllable_count: number | null;

Add: has_phonology: boolean;

In EdgeRow, update SPP and ECCC columns to match new names: - eccc_confusabilityeccc_consistency - spp_priming_zspp_first_priming - spp_priming_shortspp_other_priming - Remove spp_priming_long - Add: spp_fas, spp_lsa, eccc_n_instances, eccc_phoneme_distance

Also update EdgeResponse — it has the same old field names and must be updated identically.

After updating types, grep for all usages of the old field names across the entire packages/web/workers/src/ directory:

grep -rn "eccc_confusability\|spp_priming_z\|spp_priming_short\|spp_priming_long" packages/web/workers/src/
Update every occurrence found (route handlers, response builders, etc.).

  • [ ] Step 2: Update properties.ts

Add the same new PropertyDefs as in config.py Task 15: neighborhood_density, stressed IPhOD variants, CYP-LEX frequencies. Add the new child_frequency category. Update phono_prob/positional_prob source to IPhOD2.

  • [ ] Step 3: Verify TypeScript compiles

Run: cd /Users/jneumann/Repos/PhonoLex/packages/web/workers && npx tsc --noEmit Expected: No errors (or fix any errors that arise from nullable changes)

  • [ ] Step 4: Commit
git add packages/web/workers/src/types.ts packages/web/workers/src/config/properties.ts
git commit -m "feat: update TypeScript types for nullable phonology + new properties"

Task 18: End-to-End Verification

Files: None created — verification only.

  • [ ] Step 1: Run full pipeline

cd /Users/jneumann/Repos/PhonoLex && uv run python packages/web/workers/scripts/export-to-d1.py
Expected: Produces d1-seed.sql with stats printed (word count, edge count, etc.)

  • [ ] Step 2: Verify word counts

# Count INSERT INTO words statements (each has up to 20 rows)
grep -c "INSERT INTO words" /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql
# Verify has_phonology split
grep -c "has_phonology" /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql | head -5
# Check file size (should be significantly larger than before)
ls -lh /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql
Expected: Substantially more INSERT INTO words statements than before (~44K/20 = 2,200 before; now >5,000+)

  • [ ] Step 3: Seed local D1 and verify
cd /Users/jneumann/Repos/PhonoLex/packages/web/workers
npx wrangler d1 execute phonolex --local --file scripts/d1-seed.sql
npx wrangler dev

Test in browser: - Custom Word Lists: filter by frequency, verify results include both phonology and norm-only words - Lookup: search for a word with phonology (e.g., "cat") — should show full data - Lookup: search for a norm-only word — should show available norms, no phonological data - Similarity: verify still works (only phonology words) - Text Analysis: paste text, verify percentile stats - Contrastive Sets: verify minimal pairs load

  • [ ] Step 4: Run all existing tests
cd /Users/jneumann/Repos/PhonoLex/packages/web/workers && npm test
cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/ -v
  • [ ] Step 5: Verify no pickle references remain

grep -r "pickle" /Users/jneumann/Repos/PhonoLex/packages/ --include="*.py" --include="*.ts" -l
grep -r "cognitive_graph" /Users/jneumann/Repos/PhonoLex/packages/ --include="*.py" --include="*.ts" -l
Expected: No results (or only in archived/docs files)

  • [ ] Step 6: Commit any fixes
git add -A
git commit -m "fix: address issues found in end-to-end verification"