Integrated Lexical Database Pipeline — Implementation Plan¶
For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Build the integrated lexical database directly from 25 raw source datasets, eliminating the pickle dependency.
Architecture: Focused pipeline modules in packages/data/src/phonolex_data/pipeline/ (schema, words, edges, derived, orchestrator). New loaders for 8 datasets added to packages/data/src/phonolex_data/loaders/. Consumer export-to-d1.py becomes a thin SQL writer calling build_lexical_database().
Tech Stack: Python 3.10+, openpyxl, numpy, pytest. D1 seed SQL output. TypeScript types in Hono workers.
Spec: docs/superpowers/specs/2026-03-13-integrated-lexical-database-pipeline-design.md
File Structure¶
New files to create:¶
packages/data/src/phonolex_data/loaders/morphology.py—load_morpholex()packages/data/src/phonolex_data/loaders/child_frequency.py—load_cyplex()packages/data/src/phonolex_data/pipeline/__init__.py— orchestrator (build_lexical_database())packages/data/src/phonolex_data/pipeline/schema.py— WordRecord, EdgeRecord, DerivedData, LexicalDatabase dataclassespackages/data/src/phonolex_data/pipeline/words.py—build_words()packages/data/src/phonolex_data/pipeline/edges.py—build_edges()packages/data/src/phonolex_data/pipeline/derived.py—build_derived()packages/data/tests/test_new_loaders.py— tests for 8 new loaders + simlex updatepackages/data/tests/test_pipeline.py— tests for pipeline modules
Existing files to modify:¶
packages/data/src/phonolex_data/loaders/norms.py— addload_prevalence(),load_iphod()packages/data/src/phonolex_data/loaders/associations.py— addload_men(),load_wordsim(),load_spp(),load_eccc(), updateload_simlex()packages/data/src/phonolex_data/loaders/__init__.py— export new functionspackages/web/workers/scripts/export-to-d1.py— rewrite to use pipelinepackages/web/workers/scripts/config.py— add new PropertyDefs, update edge typespackages/web/workers/src/types.ts— nullable phonological fields, new columnspackages/web/workers/src/config/properties.ts— add new property categories
Chunk 1: New Norm Loaders¶
Task 1: load_prevalence()¶
Files:
- Modify: packages/data/src/phonolex_data/loaders/norms.py
- Test: packages/data/tests/test_new_loaders.py
Context: Reads data/norms/prevalence/English_Word_Prevalences.xlsx. The proportion column is Pknown (0-1), NOT Prevalence (which is log-scale). ~62K words. Follow pattern of existing norm loaders (e.g., load_warriner()).
- [ ] Step 1: Write the failing test
# packages/data/tests/test_new_loaders.py
"""Tests for newly added loaders (prevalence, iphod, morpholex, cyplex, men, wordsim, spp, eccc)."""
from __future__ import annotations
import pytest
def test_load_prevalence():
from phonolex_data.loaders import load_prevalence
result = load_prevalence()
assert isinstance(result, dict)
assert len(result) > 50000 # ~62K words
# Spot check a common word
assert "the" in result
entry = result["the"]
assert "prevalence" in entry
assert 0.0 <= entry["prevalence"] <= 1.0
# "the" should be known by nearly everyone
assert entry["prevalence"] > 0.9
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_prevalence -v
Expected: FAIL (ImportError — load_prevalence not yet defined)
- [ ] Step 3: Implement
load_prevalence()
Add to packages/data/src/phonolex_data/loaders/norms.py:
def load_prevalence(path: str | Path | None = None) -> dict[str, dict[str, float]]:
"""Load Brysbaert et al. (2019) word prevalence norms.
Returns:
{word: {prevalence: float}} — proportion of people who know the word (0-1)
"""
openpyxl = require_openpyxl()
path = Path(path) if path else get_data_dir() / "norms" / "prevalence" / "English_Word_Prevalences.xlsx"
wb = openpyxl.load_workbook(path, read_only=True, data_only=True)
ws = wb.active
result: dict[str, dict[str, float]] = {}
header = None
for row in ws.iter_rows(values_only=True):
if header is None:
header = [str(c).strip() if c else "" for c in row]
continue
word = row[header.index("Word")]
if not word or not isinstance(word, str):
continue
try:
prevalence = float(row[header.index("Pknown")])
result[word.strip().lower()] = {"prevalence": prevalence}
except (ValueError, TypeError):
continue
wb.close()
return result
Add to packages/data/src/phonolex_data/loaders/__init__.py:
from phonolex_data.loaders.norms import load_prevalence
__all__ list if one exists.)
- [ ] Step 4: Run test to verify it passes
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_prevalence -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/norms.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_prevalence() loader for Brysbaert word prevalence norms"
Task 2: load_iphod()¶
Files:
- Modify: packages/data/src/phonolex_data/loaders/norms.py
- Test: packages/data/tests/test_new_loaders.py
Context: Reads data/norms/iphod/IPhOD2_Words.txt (tab-delimited). Key columns: Word, unsDENS (int), unsBPAV (float), unsPOSPAV (float), strDENS (int), strBPAV (float), strPOSPAV (float). ~54K words. Some words have multiple pronunciation rows — take the first occurrence. Replaces the old load_phonotactic_probability() from phoible.py.
- [ ] Step 1: Write the failing test
# Add to packages/data/tests/test_new_loaders.py
def test_load_iphod():
from phonolex_data.loaders import load_iphod
result = load_iphod()
assert isinstance(result, dict)
assert len(result) > 30000 # ~54K unique words
# Spot check
assert "cat" in result
entry = result["cat"]
expected_keys = {
"neighborhood_density", "phono_prob_avg", "positional_prob_avg",
"str_neighborhood_density", "str_phono_prob_avg", "str_positional_prob_avg",
}
assert set(entry.keys()) == expected_keys
assert isinstance(entry["neighborhood_density"], int)
assert isinstance(entry["phono_prob_avg"], float)
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_iphod -v
Expected: FAIL
- [ ] Step 3: Implement
load_iphod()
Add to packages/data/src/phonolex_data/loaders/norms.py:
def load_iphod(path: str | Path | None = None) -> dict[str, dict[str, float | int]]:
"""Load IPhOD2 phonotactic probability and neighborhood density norms.
Replaces load_phonotactic_probability() (Vitevitch & Luce JSON).
Returns:
{word: {neighborhood_density, phono_prob_avg, positional_prob_avg,
str_neighborhood_density, str_phono_prob_avg, str_positional_prob_avg}}
"""
path = Path(path) if path else get_data_dir() / "norms" / "iphod" / "IPhOD2_Words.txt"
result: dict[str, dict[str, float | int]] = {}
with open(path, encoding="utf-8") as f:
reader = csv.DictReader(f, delimiter="\t")
for row in reader:
word = row.get("Word", "").strip().lower()
if not word or word in result: # take first pronunciation only
continue
try:
result[word] = {
"neighborhood_density": int(float(row["unsDENS"])),
"phono_prob_avg": float(row["unsBPAV"]),
"positional_prob_avg": float(row["unsPOSPAV"]),
"str_neighborhood_density": int(float(row["strDENS"])),
"str_phono_prob_avg": float(row["strBPAV"]),
"str_positional_prob_avg": float(row["strPOSPAV"]),
}
except (ValueError, KeyError):
continue
return result
Export from __init__.py.
- [ ] Step 4: Run test to verify it passes
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_iphod -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/norms.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_iphod() loader for IPhOD2 phonotactic probability norms"
Task 3: load_morpholex()¶
Files:
- Create: packages/data/src/phonolex_data/loaders/morphology.py
- Test: packages/data/tests/test_new_loaders.py
Context: Reads data/norms/morpholex/MorphoLEX_en.xlsx. Important: The first sheet ("Presentation") is a legend — data is spread across 30 PRS-signature sheets (e.g., '0-1-0', '1-1-1'). Skip the first sheet and the last 3 sheets ('All prefixes', 'All suffixes', 'All roots'). Columns per data sheet: Word, MorphoLexSegm (segmentation using {<>()} brackets like {(dark)}>ness>), Nmorph, PRS_signature (comma-separated P,R,S counts, e.g., "0,1,1"). There are NO nPrefix/nSuffix columns — derive from PRS_signature. ~70K words across all sheets. Segmentation uses curly braces {}, angle brackets <>, and parens () — strip all bracket types to get morpheme segments.
- [ ] Step 1: Write the failing test
# Add to packages/data/tests/test_new_loaders.py
def test_load_morpholex():
from phonolex_data.loaders import load_morpholex
result = load_morpholex()
assert isinstance(result, dict)
assert len(result) > 50000 # ~70K words
# Check a known polymorphemic word
# "unbreakable" may not be in the dataset, pick a common word
# MorphoLex has "darkness" → (dark)ness
if "darkness" in result:
entry = result["darkness"]
expected_keys = {
"morpheme_count", "n_prefixes", "n_suffixes",
"is_monomorphemic", "morphological_segmentation",
}
assert set(entry.keys()) == expected_keys
assert entry["morpheme_count"] >= 2
assert entry["is_monomorphemic"] is False
assert isinstance(entry["morphological_segmentation"], str)
assert "|" in entry["morphological_segmentation"]
# Check a monomorphemic word
if "cat" in result:
assert result["cat"]["is_monomorphemic"] is True
assert result["cat"]["morpheme_count"] == 1
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_morpholex -v
Expected: FAIL
- [ ] Step 3: Implement
load_morpholex()
Create packages/data/src/phonolex_data/loaders/morphology.py:
"""Morphology loaders."""
from __future__ import annotations
import re
from pathlib import Path
from phonolex_data.loaders._helpers import get_data_dir, require_openpyxl
def _parse_morpholex_segmentation(segm: str) -> str:
"""Convert MorphoLex bracket notation to pipe-delimited.
MorphoLex uses {}, <>, and () brackets:
'{(cat)}' → 'cat'
'{(dark)}>ness>' → 'dark|ness'
'{<un<(break)>able>}' → 'un|break|able'
Strategy: strip all bracket types, keep text segments.
"""
segments = re.findall(r"[^<>(){}]+", segm)
segments = [s.strip() for s in segments if s.strip()]
return "|".join(segments) if segments else segm
def load_morpholex(path: str | Path | None = None) -> dict[str, dict]:
"""Load MorphoLex-en morphological segmentation data.
Data is spread across 30 PRS-signature sheets (skip first 'Presentation'
sheet and last 3 summary sheets). Derive prefix/suffix counts from
PRS_signature column (comma-separated P,R,S counts).
Returns:
{word: {morpheme_count, n_prefixes, n_suffixes,
is_monomorphemic, morphological_segmentation}}
"""
openpyxl = require_openpyxl()
path = Path(path) if path else get_data_dir() / "norms" / "morpholex" / "MorphoLEX_en.xlsx"
wb = openpyxl.load_workbook(path, read_only=True, data_only=True)
# Skip first sheet (Presentation) and last 3 (All prefixes/suffixes/roots)
data_sheets = wb.worksheets[1:-3] if len(wb.worksheets) > 4 else wb.worksheets[1:]
result: dict[str, dict] = {}
for ws in data_sheets:
header = None
for row in ws.iter_rows(values_only=True):
if header is None:
header = [str(c).strip() if c else "" for c in row]
if "Word" not in header:
break # not a data sheet
continue
try:
word_val = row[header.index("Word")]
if not word_val or not isinstance(word_val, str):
continue
word = word_val.strip().lower()
if word in result:
continue # first occurrence wins
segm_raw = str(row[header.index("MorphoLexSegm")] or "")
segmentation = _parse_morpholex_segmentation(segm_raw)
morpheme_count = len(segmentation.split("|")) if segmentation else 1
# Derive prefix/suffix counts from PRS_signature (e.g., "0,1,1")
prs = str(row[header.index("PRS_signature")] or "0,1,0")
prs_parts = prs.split(",")
n_prefixes = int(prs_parts[0]) if len(prs_parts) >= 1 else 0
n_suffixes = int(prs_parts[2]) if len(prs_parts) >= 3 else 0
result[word] = {
"morpheme_count": morpheme_count,
"n_prefixes": n_prefixes,
"n_suffixes": n_suffixes,
"is_monomorphemic": morpheme_count == 1,
"morphological_segmentation": segmentation,
}
except (ValueError, KeyError, IndexError):
continue
wb.close()
return result
Export from __init__.py.
- [ ] Step 4: Run test to verify it passes
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_morpholex -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/morphology.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_morpholex() loader for MorphoLex-en morphological segmentation"
Task 4: load_cyplex()¶
Files:
- Create: packages/data/src/phonolex_data/loaders/child_frequency.py
- Test: packages/data/tests/test_new_loaders.py
Context: Reads data/norms/cyplex/CYPLEX_all_age_bands.csv. Columns include Word, CYPLEX79_log, CYPLEX1012_log, CYPLEX13_log (Zipf-scale log frequencies). ~91K words (union across age bands). BOM-encoded CSV (starts with \ufeff).
- [ ] Step 1: Write the failing test
# Add to packages/data/tests/test_new_loaders.py
def test_load_cyplex():
from phonolex_data.loaders import load_cyplex
result = load_cyplex()
assert isinstance(result, dict)
assert len(result) > 50000
assert "the" in result
entry = result["the"]
expected_keys = {"freq_cyplex_7_9", "freq_cyplex_10_12", "freq_cyplex_13"}
assert set(entry.keys()) == expected_keys
# "the" should have high Zipf frequency in all bands
for key in expected_keys:
val = entry[key]
assert val is None or isinstance(val, float)
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_cyplex -v
Expected: FAIL
- [ ] Step 3: Implement
load_cyplex()
Create packages/data/src/phonolex_data/loaders/child_frequency.py:
"""Child frequency loaders."""
from __future__ import annotations
import csv
from pathlib import Path
from phonolex_data.loaders._helpers import get_data_dir
def load_cyplex(path: str | Path | None = None) -> dict[str, dict[str, float | None]]:
"""Load CYP-LEX child frequency norms (all 3 age bands).
Reads CYPLEX_all_age_bands.csv — maps CYPLEX79_log, CYPLEX1012_log,
CYPLEX13_log to freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13.
Returns:
{word: {freq_cyplex_7_9, freq_cyplex_10_12, freq_cyplex_13}}
"""
path = (
Path(path) if path
else get_data_dir() / "norms" / "cyplex" / "CYPLEX_all_age_bands.csv"
)
column_map = {
"CYPLEX79_log": "freq_cyplex_7_9",
"CYPLEX1012_log": "freq_cyplex_10_12",
"CYPLEX13_log": "freq_cyplex_13",
}
result: dict[str, dict[str, float | None]] = {}
with open(path, encoding="utf-8-sig") as f: # utf-8-sig handles BOM
reader = csv.DictReader(f)
for row in reader:
word = row.get("Word", "").strip().lower()
if not word:
continue
entry: dict[str, float | None] = {}
for src_col, dest_key in column_map.items():
raw = row.get(src_col, "").strip()
try:
entry[dest_key] = float(raw) if raw else None
except ValueError:
entry[dest_key] = None
result[word] = entry
return result
Export from __init__.py.
- [ ] Step 4: Run test to verify it passes
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_cyplex -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/child_frequency.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_cyplex() loader for CYP-LEX child frequency norms"
Chunk 2: New Association Loaders¶
Task 5: load_men()¶
Files:
- Modify: packages/data/src/phonolex_data/loaders/associations.py
- Test: packages/data/tests/test_new_loaders.py
Context: Reads data/norms/men/MEN-TR-3k.txt. Space-delimited, NO header row. Format: word1 word2 score. 3,000 pairs. Scores range 0-50.
- [ ] Step 1: Write the failing test
# Add to packages/data/tests/test_new_loaders.py
def test_load_men():
from phonolex_data.loaders import load_men
result = load_men()
assert isinstance(result, list)
assert len(result) == 3000
w1, w2, score = result[0]
assert isinstance(w1, str)
assert isinstance(w2, str)
assert isinstance(score, float)
assert 0.0 <= score <= 50.0
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_men -v
Expected: FAIL
- [ ] Step 3: Implement
load_men()
Add to packages/data/src/phonolex_data/loaders/associations.py:
def load_men(path: str | Path | None = None) -> list[tuple[str, str, float]]:
"""Load MEN semantic relatedness dataset (Bruni et al. 2014).
Returns:
[(word1, word2, relatedness_score), ...] — 3,000 pairs, scores 0-50
"""
path = Path(path) if path else get_data_dir() / "norms" / "men" / "MEN-TR-3k.txt"
result: list[tuple[str, str, float]] = []
with open(path) as f:
for line in f:
parts = line.strip().split()
if len(parts) != 3:
continue
try:
result.append((parts[0].lower(), parts[1].lower(), float(parts[2])))
except ValueError:
continue
return result
Export from __init__.py.
- [ ] Step 4: Run test to verify it passes
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_men -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_men() loader for MEN semantic relatedness dataset"
Task 6: load_wordsim()¶
Files:
- Modify: packages/data/src/phonolex_data/loaders/associations.py
- Test: packages/data/tests/test_new_loaders.py
Context: Reads data/norms/wordsim353/combined.csv. CSV with header. Columns: Word 1, Word 2, Human (mean). 353 pairs. Scores 0-10. Note: there's also a combined.tab but we use the CSV.
- [ ] Step 1: Write the failing test
def test_load_wordsim():
from phonolex_data.loaders import load_wordsim
result = load_wordsim()
assert isinstance(result, list)
assert 340 <= len(result) <= 360 # ~353 pairs
w1, w2, score = result[0]
assert isinstance(w1, str)
assert isinstance(w2, str)
assert isinstance(score, float)
assert 0.0 <= score <= 10.0
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_wordsim -v
Expected: FAIL
- [ ] Step 3: Implement
load_wordsim()
Add to packages/data/src/phonolex_data/loaders/associations.py:
def load_wordsim(path: str | Path | None = None) -> list[tuple[str, str, float]]:
"""Load WordSim-353 semantic relatedness dataset (Finkelstein et al. 2002).
Returns:
[(word1, word2, relatedness_score), ...] — ~353 pairs, scores 0-10
"""
path = Path(path) if path else get_data_dir() / "norms" / "wordsim353" / "combined.csv"
result: list[tuple[str, str, float]] = []
with open(path) as f:
reader = csv.DictReader(f)
for row in reader:
try:
result.append((
row["Word 1"].strip().lower(),
row["Word 2"].strip().lower(),
float(row["Human (mean)"]),
))
except (ValueError, KeyError):
continue
return result
Export from __init__.py.
- [ ] Step 4: Run test to verify it passes
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_wordsim -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_wordsim() loader for WordSim-353 dataset"
Task 7: load_spp()¶
Files:
- Modify: packages/data/src/phonolex_data/loaders/associations.py
- Test: packages/data/tests/test_new_loaders.py
Context: Reads data/norms/spp/spp_ldt_item_analysis.xlsx. Actual xlsx format. Key columns: target, prime_1st_assoc, first_priming_overall, other_priming_overall, firstassoc_fas, firstassoc_lsa. 1,661 rows. SPP measures priming effects — the relationship is (prime → target). Values can be negative (inhibition).
- [ ] Step 1: Write the failing test
def test_load_spp():
from phonolex_data.loaders import load_spp
result = load_spp()
assert isinstance(result, list)
assert len(result) > 1500 # ~1,661 pairs
target, prime, first_priming, other_priming, fas, lsa = result[0]
assert isinstance(target, str)
assert isinstance(prime, str)
# Priming values can be negative
assert isinstance(first_priming, (float, type(None)))
assert isinstance(fas, (float, type(None)))
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_spp -v
Expected: FAIL
- [ ] Step 3: Implement
load_spp()
Add to packages/data/src/phonolex_data/loaders/associations.py:
def load_spp(path: str | Path | None = None) -> list[tuple]:
"""Load Semantic Priming Project dataset (Hutchison et al. 2013).
Reads spp_ldt_item_analysis.xlsx.
Returns:
[(target, prime, first_priming_overall, other_priming_overall,
firstassoc_fas, firstassoc_lsa), ...]
"""
from phonolex_data.loaders._helpers import require_openpyxl
openpyxl = require_openpyxl()
path = (
Path(path) if path
else get_data_dir() / "norms" / "spp" / "spp_ldt_item_analysis.xlsx"
)
wb = openpyxl.load_workbook(path, read_only=True, data_only=True)
ws = wb.active
result: list[tuple] = []
header = None
for row in ws.iter_rows(values_only=True):
if header is None:
header = [str(c).strip() if c else "" for c in row]
continue
try:
target = str(row[header.index("target")]).strip().lower()
prime_raw = row[header.index("prime_1st_assoc")]
prime = str(prime_raw).strip().lower() if prime_raw else ""
if not target or not prime:
continue
def _float_or_none(col_name: str) -> float | None:
idx = header.index(col_name)
val = row[idx]
if val is None or str(val).strip() == "":
return None
return float(val)
result.append((
target,
prime,
_float_or_none("first_priming_overall"),
_float_or_none("other_priming_overall"),
_float_or_none("firstassoc_fas"),
_float_or_none("firstassoc_lsa"),
))
except (ValueError, KeyError):
continue
wb.close()
return result
Export from __init__.py.
- [ ] Step 4: Run test to verify it passes
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_spp -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_spp() loader for Semantic Priming Project dataset"
Task 8: load_eccc()¶
Files:
- Modify: packages/data/src/phonolex_data/loaders/associations.py
- Test: packages/data/tests/test_new_loaders.py
Context: Reads data/norms/eccc/confusionCorpus_v1.2.csv. Columns: Target, Confusion, Consistency (raw listener count, NOT a proportion — e.g. 9 means 9 out of N listeners), N-Listeners (int, typically 15), Counts (string like "9 2 1 1 1 1"), Phoneme-distance (int). To get the proportion (0-1), divide Consistency by N-Listeners. Multiple rows per (Target, Confusion) pair across different conditions. Aggregate per unique (target, confusion) pair: mean of (Consistency/N-Listeners) proportions, sum of Consistency counts as total_instances, mean Phoneme-distance.
- [ ] Step 1: Write the failing test
def test_load_eccc():
from phonolex_data.loaders import load_eccc
result = load_eccc()
assert isinstance(result, list)
assert len(result) > 1000 # aggregated pairs
target, confusion, consistency, n_instances, phoneme_distance = result[0]
assert isinstance(target, str)
assert isinstance(confusion, str)
assert isinstance(consistency, float)
assert 0.0 <= consistency <= 1.0
assert isinstance(n_instances, int)
assert n_instances >= 1
assert isinstance(phoneme_distance, float)
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_eccc -v
Expected: FAIL
- [ ] Step 3: Implement
load_eccc()
Add to packages/data/src/phonolex_data/loaders/associations.py:
def load_eccc(path: str | Path | None = None) -> list[tuple]:
"""Load ECCC speech-in-noise confusion corpus (Mondol & Bhatt 2023).
Aggregates per (target, confusion) pair across conditions.
Returns:
[(target, confusion, mean_consistency, total_instances, mean_phoneme_distance), ...]
"""
path = (
Path(path) if path
else get_data_dir() / "norms" / "eccc" / "confusionCorpus_v1.2.csv"
)
# Aggregate across conditions
pair_data: dict[tuple[str, str], dict] = {}
with open(path) as f:
reader = csv.DictReader(f)
for row in reader:
target = row.get("Target", "").strip().lower()
confusion = row.get("Confusion", "").strip().lower()
if not target or not confusion or target == confusion:
continue
try:
# Consistency is a raw listener count, NOT a proportion
raw_consistency = int(row["Consistency"])
n_listeners = int(row["N-Listeners"])
if n_listeners == 0:
continue
consistency = raw_consistency / n_listeners # proportion 0-1
phoneme_dist = float(row["Phoneme-distance"])
except (ValueError, KeyError):
continue
key = (target, confusion)
if key not in pair_data:
pair_data[key] = {
"consistencies": [],
"raw_counts": [],
"distances": [],
}
pair_data[key]["consistencies"].append(consistency)
pair_data[key]["raw_counts"].append(raw_consistency)
pair_data[key]["distances"].append(phoneme_dist)
result: list[tuple] = []
for (target, confusion), data in pair_data.items():
mean_consistency = sum(data["consistencies"]) / len(data["consistencies"])
total_instances = sum(data["raw_counts"])
mean_distance = sum(data["distances"]) / len(data["distances"])
result.append((target, confusion, mean_consistency, total_instances, mean_distance))
return result
Export from __init__.py.
- [ ] Step 4: Run test to verify it passes
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_eccc -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/src/phonolex_data/loaders/__init__.py packages/data/tests/test_new_loaders.py
git commit -m "feat: add load_eccc() loader for ECCC speech-in-noise confusion corpus"
Task 9: Update load_simlex() to return POS¶
Files:
- Modify: packages/data/src/phonolex_data/loaders/associations.py
- Modify: packages/data/tests/test_datasets.py (update existing test)
- Test: packages/data/tests/test_new_loaders.py
Context: The existing load_simlex() returns list[tuple[str, str, float]]. SimLex-999.txt has a POS column (tab-delimited, values: N, V, A). Update to return list[tuple[str, str, float, str]] — 4th element is POS.
- [ ] Step 1: Write the test for new return type
# Add to packages/data/tests/test_new_loaders.py
def test_load_simlex_with_pos():
from phonolex_data.loaders import load_simlex
result = load_simlex()
assert isinstance(result, list)
assert len(result) == 999
w1, w2, score, pos = result[0]
assert isinstance(w1, str)
assert isinstance(w2, str)
assert isinstance(score, float)
assert isinstance(pos, str)
assert pos in ("N", "V", "A")
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_simlex_with_pos -v
Expected: FAIL (tuple has 3 elements, not 4)
- [ ] Step 3: Update
load_simlex()
In packages/data/src/phonolex_data/loaders/associations.py, update:
def load_simlex(path: str | Path | None = None) -> list[tuple[str, str, float, str]]:
"""Load SimLex-999 word similarity dataset (Hill et al. 2015).
Returns:
[(word1, word2, similarity_score, pos), ...]
"""
path = Path(path) if path else get_data_dir() / "norms" / "SimLex-999.txt"
result: list[tuple[str, str, float, str]] = []
with open(path) as f:
reader = csv.DictReader(f, delimiter="\t")
for row in reader:
result.append((
row["word1"].strip().lower(),
row["word2"].strip().lower(),
float(row["SimLex999"]),
row["POS"].strip(),
))
return result
- [ ] Step 4: Update existing test in
test_datasets.py
The existing test_load_simlex in packages/data/tests/test_datasets.py expects 3-tuples. Update the unpacking:
# In packages/data/tests/test_datasets.py, find:
w1, w2, score = sl[0]
# Replace with:
w1, w2, score, pos = sl[0]
assert isinstance(pos, str)
assert pos in ("N", "V", "A")
Also update any assert len(result[0]) == 3 to assert len(result[0]) == 4.
- [ ] Step 5: Run both tests
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_new_loaders.py::test_load_simlex_with_pos packages/data/tests/test_datasets.py::test_load_simlex -v
Expected: PASS
- [ ] Step 6: Commit
git add packages/data/src/phonolex_data/loaders/associations.py packages/data/tests/test_new_loaders.py packages/data/tests/test_datasets.py
git commit -m "feat: update load_simlex() to return POS column"
Chunk 3: Pipeline Schema + Words Module¶
Task 10: Pipeline schema.py — Data Contract¶
Files:
- Create: packages/data/src/phonolex_data/pipeline/__init__.py (empty initially)
- Create: packages/data/src/phonolex_data/pipeline/schema.py
- Test: packages/data/tests/test_pipeline.py
Context: Defines the 4 dataclasses from the spec: WordRecord, EdgeRecord, DerivedData, LexicalDatabase. These are the data contract consumed by all pipeline stages and downstream consumers.
- [ ] Step 1: Write the test
# packages/data/tests/test_pipeline.py
"""Tests for the integrated lexical database pipeline."""
from __future__ import annotations
def test_word_record_creation():
from phonolex_data.pipeline.schema import WordRecord
# CMU word with full phonological data
wr = WordRecord(
word="cat",
has_phonology=True,
ipa="kæt",
phonemes=["k", "æ", "t"],
phoneme_count=3,
syllables=[{"onset": ["k"], "nucleus": "æ", "coda": ["t"], "stress": 1}],
syllable_count=1,
initial_phoneme="k",
final_phoneme="t",
wcm_score=3,
)
assert wr.word == "cat"
assert wr.has_phonology is True
assert wr.frequency is None # norms default to None
# Norm-only word
wr2 = WordRecord(word="café", has_phonology=False)
assert wr2.ipa is None
assert wr2.phonemes == []
assert wr2.phoneme_count is None
def test_edge_record_creation():
from phonolex_data.pipeline.schema import EdgeRecord
er = EdgeRecord(
source="cat",
target="dog",
edge_sources=["SWOW", "USF"],
swow_strength=0.15,
usf_forward=0.08,
)
assert er.source == "cat"
assert er.edge_sources == ["SWOW", "USF"]
assert er.men_relatedness is None # defaults to None
def test_lexical_database_creation():
from phonolex_data.pipeline.schema import LexicalDatabase, WordRecord, DerivedData
db = LexicalDatabase(
words={"cat": WordRecord(word="cat", has_phonology=True)},
edges=[],
derived=DerivedData(),
phoible_vectors={},
)
assert "cat" in db.words
assert db.edges == []
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_word_record_creation -v
Expected: FAIL (module not found)
- [ ] Step 3: Implement
schema.py
Create packages/data/src/phonolex_data/pipeline/__init__.py:
"""Integrated lexical database pipeline."""
Create packages/data/src/phonolex_data/pipeline/schema.py:
"""Data contract for the integrated lexical database pipeline.
Shared types consumed by all pipeline stages and downstream consumers.
"""
from __future__ import annotations
from dataclasses import dataclass, field
@dataclass
class WordRecord:
"""A single word with all phonological and psycholinguistic data."""
word: str
has_phonology: bool = False
# Phonological fields — populated for CMU dict words, None/empty for norm-only
ipa: str | None = None
phonemes: list[str] = field(default_factory=list)
phoneme_count: int | None = None
syllables: list[dict] = field(default_factory=list)
syllable_count: int | None = None
initial_phoneme: str | None = None
final_phoneme: str | None = None
wcm_score: int | None = None
# Norms — all optional, None means no data
frequency: float | None = None
log_frequency: float | None = None
contextual_diversity: float | None = None
prevalence: float | None = None
aoa: float | None = None
aoa_kuperman: float | None = None
imageability: float | None = None
familiarity: float | None = None
concreteness: float | None = None
size: float | None = None
valence: float | None = None
arousal: float | None = None
dominance: float | None = None
iconicity: float | None = None
boi: float | None = None
socialness: float | None = None
auditory: float | None = None
visual: float | None = None
haptic: float | None = None
gustatory: float | None = None
olfactory: float | None = None
interoceptive: float | None = None
hand_arm: float | None = None
foot_leg: float | None = None
head: float | None = None
mouth: float | None = None
torso: float | None = None
elp_lexical_decision_rt: float | None = None
semantic_diversity: float | None = None
# Morphology (MorphoLex)
morpheme_count: int | None = None
is_monomorphemic: bool | None = None
n_prefixes: int | None = None
n_suffixes: int | None = None
morphological_segmentation: str | None = None
# Phonotactic probability (IPhOD)
neighborhood_density: int | None = None
phono_prob_avg: float | None = None
positional_prob_avg: float | None = None
str_phono_prob_avg: float | None = None
str_positional_prob_avg: float | None = None
str_neighborhood_density: int | None = None
# Child frequency (CYP-LEX)
freq_cyplex_7_9: float | None = None
freq_cyplex_10_12: float | None = None
freq_cyplex_13: float | None = None
# Vocab memberships
vocab_memberships: set[str] = field(default_factory=set)
@dataclass
class EdgeRecord:
"""A relationship between two words from one or more association datasets."""
source: str = ""
target: str = ""
edge_sources: list[str] = field(default_factory=list)
swow_strength: float | None = None
usf_forward: float | None = None
usf_backward: float | None = None
men_relatedness: float | None = None
simlex_similarity: float | None = None
simlex_pos: str | None = None
wordsim_relatedness: float | None = None
# SPP
spp_first_priming: float | None = None
spp_other_priming: float | None = None
spp_fas: float | None = None
spp_lsa: float | None = None
# ECCC
eccc_consistency: float | None = None
eccc_n_instances: int | None = None
eccc_phoneme_distance: float | None = None
@dataclass
class DerivedData:
"""Computed data derived from word records and PHOIBLE vectors."""
percentiles: dict[str, dict[str, float | None]] = field(default_factory=dict)
minimal_pairs: list[tuple] = field(default_factory=list)
phoneme_data: dict[str, dict] = field(default_factory=dict)
phoneme_norms: dict[str, float] = field(default_factory=dict)
phoneme_dots: list[tuple] = field(default_factory=list)
components: list[dict] = field(default_factory=list)
word_syllable_data: dict = field(default_factory=dict)
component_key_to_id: dict = field(default_factory=dict)
property_ranges: dict[str, tuple[float, float]] = field(default_factory=dict)
@dataclass
class LexicalDatabase:
"""The complete integrated lexical database."""
words: dict[str, WordRecord] = field(default_factory=dict)
edges: list[EdgeRecord] = field(default_factory=list)
derived: DerivedData = field(default_factory=DerivedData)
phoible_vectors: dict = field(default_factory=dict)
- [ ] Step 4: Run tests
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/pipeline/__init__.py packages/data/src/phonolex_data/pipeline/schema.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline schema — WordRecord, EdgeRecord, DerivedData, LexicalDatabase dataclasses"
Task 11: Pipeline words.py — Build Word Records¶
Files:
- Create: packages/data/src/phonolex_data/pipeline/words.py
- Test: packages/data/tests/test_pipeline.py
Context: build_words() loads CMU dict (phonological backbone), runs syllabification + WCM + normalization, then loads and merges all 15 norm datasets. Words not in CMU dict but present in norm datasets get has_phonology=False records. Returns dict[str, WordRecord].
Important: This function loads real datasets and takes a while. Tests should verify structure, not run the full pipeline. Use a focused integration test that checks a few known words.
Prerequisite fix: cmudict_to_phono() currently returns only {"phonemes": [...], "ipa": "..."} — it does NOT return stress information. The syllabifier requires PhonemeWithStress objects with stress markers. Before implementing build_words(), extend cmudict_to_phono() in packages/data/src/phonolex_data/loaders/cmudict.py to also return a "stress_pattern" list parallel to "phonemes". For each ARPAbet token, if it ends in 0/1/2 (vowels have stress digits), capture that digit as an int; otherwise None:
def cmudict_to_phono(
cmu: dict[str, list[str]] | None = None,
arpa_map: dict[str, str] | None = None,
) -> dict[str, dict[str, Any]]:
"""Convert raw CMUdict to PhonoFeatures-compatible format.
Returns:
{word: {"phonemes": [ipa, ...], "ipa": "...", "stress_pattern": [int|None, ...]}}
"""
if cmu is None:
cmu = load_cmudict()
if arpa_map is None:
arpa_map = load_arpa_to_ipa()
result: dict[str, dict[str, Any]] = {}
for word, arpa_phones in cmu.items():
ipa_phones = []
stress_pattern = []
for p in arpa_phones:
ipa = arpa_map.get(p) or arpa_map.get(p.rstrip("012"), p)
ipa_phones.append(ipa)
# Extract stress digit from ARPAbet vowel tokens (e.g., AE1 → 1)
if p[-1:] in ("0", "1", "2"):
stress_pattern.append(int(p[-1]))
else:
stress_pattern.append(None)
result[word] = {
"phonemes": ipa_phones,
"ipa": "".join(ipa_phones),
"stress_pattern": stress_pattern,
}
return result
Also add packages/data/src/phonolex_data/loaders/cmudict.py to the commit in Step 5.
- [ ] Step 1: Write the integration test
# Add to packages/data/tests/test_pipeline.py
import pytest
@pytest.mark.slow
def test_build_words_structure():
"""Integration test — loads real data, checks structure of result."""
from phonolex_data.pipeline.words import build_words
words = build_words()
assert isinstance(words, dict)
assert len(words) > 100000 # union of all datasets
# A word that's in CMU dict should have phonology
assert "cat" in words
cat = words["cat"]
assert cat.has_phonology is True
assert cat.ipa is not None
assert len(cat.phonemes) > 0
assert cat.phoneme_count is not None
assert cat.syllable_count is not None
assert cat.wcm_score is not None
# Check that norms merged correctly — "cat" should have frequency
assert cat.frequency is not None
# Check that at least some norm-only words exist (words in SUBTLEX but not CMU)
norm_only = [w for w, r in words.items() if not r.has_phonology]
assert len(norm_only) > 0, "Expected some norm-only words from SUBTLEX/prevalence"
# Verify norm-only words have null phonological fields
if norm_only:
sample = words[norm_only[0]]
assert sample.ipa is None
assert sample.phonemes == []
assert sample.phoneme_count is None
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_words_structure -v -m slow
Expected: FAIL
- [ ] Step 3: Implement
build_words()
Create packages/data/src/phonolex_data/pipeline/words.py:
"""Assemble word records from CMU dict + all norm datasets."""
from __future__ import annotations
from phonolex_data.loaders import (
cmudict_to_phono,
load_warriner,
load_kuperman,
load_glasgow,
load_concreteness,
load_sensorimotor,
load_semantic_diversity,
load_socialness,
load_boi,
load_subtlex,
load_elp,
load_iconicity,
load_prevalence,
load_iphod,
load_morpholex,
load_cyplex,
load_all_vocab,
)
from phonolex_data.phonology.syllabification import syllabify, PhonemeWithStress
from phonolex_data.phonology.wcm import compute_wcm
from phonolex_data.phonology.normalize import normalize_phoneme
from phonolex_data.pipeline.schema import WordRecord
# Map from norm loader key names → WordRecord field names
# Most are 1:1, listed here for clarity and to handle exceptions
_NORM_FIELD_MAP: dict[str, str] = {
# Glasgow
"aoa": "aoa",
"imageability": "imageability",
"familiarity": "familiarity",
"size": "size",
# Warriner
"valence": "valence",
"arousal": "arousal",
"dominance": "dominance",
# Kuperman
"aoa_kuperman": "aoa_kuperman",
# Concreteness
"concreteness": "concreteness",
# SUBTLEX
"frequency": "frequency",
"log_frequency": "log_frequency",
"contextual_diversity": "contextual_diversity",
# Sensorimotor
"auditory": "auditory",
"visual": "visual",
"haptic": "haptic",
"gustatory": "gustatory",
"olfactory": "olfactory",
"interoceptive": "interoceptive",
"hand_arm": "hand_arm",
"foot_leg": "foot_leg",
"head": "head",
"mouth": "mouth",
"torso": "torso",
# Others
"semantic_diversity": "semantic_diversity",
"socialness": "socialness",
"boi": "boi",
"lexical_decision_rt": "elp_lexical_decision_rt", # load_elp() returns "lexical_decision_rt"
"iconicity": "iconicity",
"prevalence": "prevalence",
# IPhOD
"neighborhood_density": "neighborhood_density",
"phono_prob_avg": "phono_prob_avg",
"positional_prob_avg": "positional_prob_avg",
"str_neighborhood_density": "str_neighborhood_density",
"str_phono_prob_avg": "str_phono_prob_avg",
"str_positional_prob_avg": "str_positional_prob_avg",
# MorphoLex
"morpheme_count": "morpheme_count",
"n_prefixes": "n_prefixes",
"n_suffixes": "n_suffixes",
"is_monomorphemic": "is_monomorphemic",
"morphological_segmentation": "morphological_segmentation",
# CYP-LEX
"freq_cyplex_7_9": "freq_cyplex_7_9",
"freq_cyplex_10_12": "freq_cyplex_10_12",
"freq_cyplex_13": "freq_cyplex_13",
}
def _build_phonological_record(phono_data: dict) -> WordRecord:
"""Create a WordRecord from cmudict_to_phono() output with syllabification + WCM."""
ipa = phono_data.get("ipa", "")
phonemes_raw = phono_data.get("phonemes", [])
stress_pattern = phono_data.get("stress_pattern", [])
# Build PhonemeWithStress list for syllabifier
phonemes_with_stress = []
for i, p in enumerate(phonemes_raw):
stress = stress_pattern[i] if i < len(stress_pattern) else None
phonemes_with_stress.append(PhonemeWithStress(phoneme=p, stress=stress))
syllables_obj = syllabify(phonemes_with_stress)
syllables = [
{
"onset": [str(p) for p in s.onset],
"nucleus": str(s.nucleus),
"coda": [str(p) for p in s.coda],
"stress": s.stress,
}
for s in syllables_obj
]
phonemes = [normalize_phoneme(p) for p in phonemes_raw]
wcm = compute_wcm(phonemes, syllables)
return WordRecord(
word=phono_data.get("word", ""),
has_phonology=True,
ipa=ipa,
phonemes=phonemes,
phoneme_count=len(phonemes),
syllables=syllables,
syllable_count=len(syllables),
initial_phoneme=phonemes[0] if phonemes else None,
final_phoneme=phonemes[-1] if phonemes else None,
wcm_score=wcm,
)
def _merge_norms(words: dict[str, WordRecord], norm_data: dict[str, dict]) -> None:
"""Merge a norm dataset into word records. Creates norm-only records for new words."""
for word, props in norm_data.items():
if word not in words:
words[word] = WordRecord(word=word, has_phonology=False)
record = words[word]
for src_key, value in props.items():
dest_field = _NORM_FIELD_MAP.get(src_key)
if dest_field and hasattr(record, dest_field):
setattr(record, dest_field, value)
def build_words() -> dict[str, WordRecord]:
"""Build all word records from CMU dict + norm datasets.
Returns dict[str, WordRecord] — union of all source datasets.
"""
print("Loading CMU dict ...")
cmu_phono = cmudict_to_phono()
# Build phonological records from CMU dict
print(f" Syllabifying {len(cmu_phono):,} CMU entries ...")
words: dict[str, WordRecord] = {}
for word, phono_data in cmu_phono.items():
phono_data["word"] = word
try:
words[word] = _build_phonological_record(phono_data)
except Exception:
# Skip words that fail syllabification
continue
print(f" {len(words):,} words with phonological data")
# Load and merge all norm datasets
print("Loading norm datasets ...")
norm_loaders = [
("Warriner", load_warriner),
("Kuperman", load_kuperman),
("Glasgow", load_glasgow),
("Concreteness", load_concreteness),
("Sensorimotor", load_sensorimotor),
("Semantic Diversity", load_semantic_diversity),
("Socialness", load_socialness),
("BOI", load_boi),
("SUBTLEX", load_subtlex),
("ELP", load_elp),
("Iconicity", load_iconicity),
("Prevalence", load_prevalence),
("IPhOD", load_iphod),
("MorphoLex", load_morpholex),
("CYP-LEX", load_cyplex),
]
for name, loader in norm_loaders:
try:
data = loader()
_merge_norms(words, data)
print(f" {name}: {len(data):,} entries")
except Exception as e:
print(f" WARNING: {name} failed: {e}")
# Load vocab list memberships
print("Loading vocab lists ...")
try:
vocab_data = load_all_vocab()
for word, memberships in vocab_data.items():
if word in words:
words[word].vocab_memberships = memberships
except Exception as e:
print(f" WARNING: vocab lists failed: {e}")
norm_only = sum(1 for r in words.values() if not r.has_phonology)
print(f"Total: {len(words):,} words ({len(words) - norm_only:,} with phonology, {norm_only:,} norm-only)")
return words
- [ ] Step 4: Run test
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_words_structure -v -m slow
Expected: PASS (may take 30-60 seconds to load all datasets)
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/pipeline/words.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline words module — build_words() assembles all word records"
Task 12: Pipeline edges.py — Build Edge Records¶
Files:
- Create: packages/data/src/phonolex_data/pipeline/edges.py
- Test: packages/data/tests/test_pipeline.py
Context: build_edges(words) loads 7 association datasets, builds an edge index keyed by sorted word pairs, and merges multiple sources per pair. Only includes edges where both words exist in the words dict.
- [ ] Step 1: Write the test
# Add to packages/data/tests/test_pipeline.py
@pytest.mark.slow
def test_build_edges_structure():
"""Integration test — builds edges from real association data."""
from phonolex_data.pipeline.schema import WordRecord
from phonolex_data.pipeline.edges import build_edges
# Create a minimal word dict with known words
words = {
"cat": WordRecord(word="cat", has_phonology=True),
"dog": WordRecord(word="dog", has_phonology=True),
"happy": WordRecord(word="happy", has_phonology=True),
"sad": WordRecord(word="sad", has_phonology=True),
"old": WordRecord(word="old", has_phonology=True),
"new": WordRecord(word="new", has_phonology=True),
}
edges = build_edges(words)
assert isinstance(edges, list)
# With only 6 words, we should get at least a few edges from SWOW/USF
# (cat-dog is a very common association pair)
assert len(edges) >= 0 # May be 0 if none of these pairs exist
# Check EdgeRecord structure
if edges:
edge = edges[0]
assert hasattr(edge, "source")
assert hasattr(edge, "target")
assert hasattr(edge, "edge_sources")
assert isinstance(edge.edge_sources, list)
assert len(edge.edge_sources) > 0
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_edges_structure -v -m slow
Expected: FAIL
- [ ] Step 3: Implement
build_edges()
Create packages/data/src/phonolex_data/pipeline/edges.py:
"""Assemble edge records from association datasets."""
from __future__ import annotations
from phonolex_data.loaders import (
load_swow,
load_free_association,
load_simlex,
load_men,
load_wordsim,
load_spp,
load_eccc,
)
from phonolex_data.pipeline.schema import EdgeRecord, WordRecord
def _sorted_pair(w1: str, w2: str) -> tuple[str, str]:
return (w1, w2) if w1 <= w2 else (w2, w1)
def _get_or_create(
index: dict[tuple[str, str], EdgeRecord],
w1: str, w2: str,
) -> EdgeRecord:
key = _sorted_pair(w1, w2)
if key not in index:
index[key] = EdgeRecord(source=key[0], target=key[1])
return index[key]
def build_edges(words: dict[str, WordRecord]) -> list[EdgeRecord]:
"""Build all edge records from 7 association datasets.
Only includes edges where both words exist in the words dict.
"""
index: dict[tuple[str, str], EdgeRecord] = {}
def _in_vocab(w: str) -> bool:
return w in words
# 1. SWOW
print("Loading SWOW ...")
try:
swow = load_swow()
for cue, responses in swow.items():
if not _in_vocab(cue):
continue
for response, strength in responses.items():
if _in_vocab(response) and cue != response:
edge = _get_or_create(index, cue, response)
if "SWOW" not in edge.edge_sources:
edge.edge_sources.append("SWOW")
edge.swow_strength = strength
print(f" SWOW: {sum(1 for e in index.values() if 'SWOW' in e.edge_sources):,} edges")
except Exception as e:
print(f" WARNING: SWOW failed: {e}")
# 2. USF (Free Association)
print("Loading USF ...")
try:
usf = load_free_association()
for cue, targets in usf.items():
if not _in_vocab(cue):
continue
for target, strength in targets.items():
if _in_vocab(target) and cue != target:
edge = _get_or_create(index, cue, target)
if "USF" not in edge.edge_sources:
edge.edge_sources.append("USF")
# Store directional: if cue < target, it's forward
key = _sorted_pair(cue, target)
if cue == key[0]:
edge.usf_forward = strength
else:
edge.usf_backward = strength
print(f" USF: {sum(1 for e in index.values() if 'USF' in e.edge_sources):,} edges")
except Exception as e:
print(f" WARNING: USF failed: {e}")
# 3. SimLex
print("Loading SimLex ...")
try:
simlex = load_simlex()
for w1, w2, score, pos in simlex:
if _in_vocab(w1) and _in_vocab(w2):
edge = _get_or_create(index, w1, w2)
if "SimLex" not in edge.edge_sources:
edge.edge_sources.append("SimLex")
edge.simlex_similarity = score
edge.simlex_pos = pos
print(f" SimLex: {sum(1 for e in index.values() if 'SimLex' in e.edge_sources):,} edges")
except Exception as e:
print(f" WARNING: SimLex failed: {e}")
# 4. MEN
print("Loading MEN ...")
try:
men = load_men()
for w1, w2, score in men:
if _in_vocab(w1) and _in_vocab(w2):
edge = _get_or_create(index, w1, w2)
if "MEN" not in edge.edge_sources:
edge.edge_sources.append("MEN")
edge.men_relatedness = score
print(f" MEN: {sum(1 for e in index.values() if 'MEN' in e.edge_sources):,} edges")
except Exception as e:
print(f" WARNING: MEN failed: {e}")
# 5. WordSim-353
print("Loading WordSim ...")
try:
wordsim = load_wordsim()
for w1, w2, score in wordsim:
if _in_vocab(w1) and _in_vocab(w2):
edge = _get_or_create(index, w1, w2)
if "WordSim" not in edge.edge_sources:
edge.edge_sources.append("WordSim")
edge.wordsim_relatedness = score
print(f" WordSim: {sum(1 for e in index.values() if 'WordSim' in e.edge_sources):,} edges")
except Exception as e:
print(f" WARNING: WordSim failed: {e}")
# 6. SPP
print("Loading SPP ...")
try:
spp = load_spp()
for target, prime, first_priming, other_priming, fas, lsa in spp:
if _in_vocab(target) and _in_vocab(prime):
edge = _get_or_create(index, target, prime)
if "SPP" not in edge.edge_sources:
edge.edge_sources.append("SPP")
edge.spp_first_priming = first_priming
edge.spp_other_priming = other_priming
edge.spp_fas = fas
edge.spp_lsa = lsa
print(f" SPP: {sum(1 for e in index.values() if 'SPP' in e.edge_sources):,} edges")
except Exception as e:
print(f" WARNING: SPP failed: {e}")
# 7. ECCC
print("Loading ECCC ...")
try:
eccc = load_eccc()
for target, confusion, consistency, n_instances, phon_dist in eccc:
if _in_vocab(target) and _in_vocab(confusion):
edge = _get_or_create(index, target, confusion)
if "ECCC" not in edge.edge_sources:
edge.edge_sources.append("ECCC")
edge.eccc_consistency = consistency
edge.eccc_n_instances = n_instances
edge.eccc_phoneme_distance = phon_dist
print(f" ECCC: {sum(1 for e in index.values() if 'ECCC' in e.edge_sources):,} edges")
except Exception as e:
print(f" WARNING: ECCC failed: {e}")
result = list(index.values())
print(f"Total: {len(result):,} unique edges")
return result
- [ ] Step 4: Run test
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_edges_structure -v -m slow
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/pipeline/edges.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline edges module — build_edges() merges 7 association datasets"
Chunk 4: Pipeline Derived + Orchestrator + Consumer Rewrite¶
Task 13: Pipeline derived.py — Compute Derived Data¶
Files:
- Create: packages/data/src/phonolex_data/pipeline/derived.py
- Test: packages/data/tests/test_pipeline.py
Context: Replicates sections 3-8 of export-to-d1.py: percentiles, phoneme index, syllable components, phoneme dot products, minimal pairs, property ranges. Operates only on has_phonology=True words for phonological computations. Percentiles computed across all words that have each property.
Reference: Read packages/web/workers/scripts/export-to-d1.py lines 176-376 for the exact logic to replicate.
- [ ] Step 1: Write the test
# Add to packages/data/tests/test_pipeline.py
def test_build_derived_with_minimal_data():
"""Unit test with synthetic data — no file I/O."""
from phonolex_data.pipeline.schema import WordRecord, DerivedData
from phonolex_data.pipeline.derived import build_derived
words = {
"cat": WordRecord(
word="cat", has_phonology=True, ipa="kæt",
phonemes=["k", "æ", "t"], phoneme_count=3,
syllables=[{"onset": ["k"], "nucleus": "æ", "coda": ["t"], "stress": 1}],
syllable_count=1, frequency=100.0,
),
"bat": WordRecord(
word="bat", has_phonology=True, ipa="bæt",
phonemes=["b", "æ", "t"], phoneme_count=3,
syllables=[{"onset": ["b"], "nucleus": "æ", "coda": ["t"], "stress": 1}],
syllable_count=1, frequency=50.0,
),
"rare": WordRecord(word="rare", has_phonology=False, frequency=10.0),
}
# Minimal PHOIBLE vectors for testing
phoible = {
"76d": {
"k": [1.0] * 76,
"b": [0.5] * 76,
"æ": [0.0] * 76,
"t": [-0.5] * 76,
},
"feature_names": [f"f{i}" for i in range(38)],
}
derived = build_derived(words, phoible)
assert isinstance(derived, DerivedData)
# Percentiles should be computed for all words with frequency
assert "cat" in derived.percentiles
assert "rare" in derived.percentiles # norm-only words get percentiles too
assert "frequency_percentile" in derived.percentiles["cat"]
# Minimal pairs: cat-bat differ in position 0 (k vs b)
assert len(derived.minimal_pairs) >= 1
mp = derived.minimal_pairs[0]
assert mp[0] in ("bat", "cat") # sorted
assert mp[1] in ("bat", "cat")
# Components extracted from phonology words only
assert len(derived.components) > 0
assert "cat" in derived.word_syllable_data
assert "rare" not in derived.word_syllable_data # norm-only excluded
# Property ranges
assert "frequency" in derived.property_ranges
- [ ] Step 2: Run test to verify it fails
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_derived_with_minimal_data -v
Expected: FAIL
- [ ] Step 3: Implement
build_derived()
Create packages/data/src/phonolex_data/pipeline/derived.py:
"""Compute derived data: percentiles, minimal pairs, phoneme dots, syllable components."""
from __future__ import annotations
import bisect
import numpy as np
from phonolex_data.pipeline.schema import DerivedData, WordRecord
# Properties eligible for percentile computation.
# IMPORTANT: This list MUST match FILTERABLE_PROPERTIES in
# packages/web/workers/scripts/config.py — keep them in sync.
# These are defined here because config.py lives in a different package
# (workers/scripts) and cannot be cleanly imported from packages/data.
PERCENTILE_PROPERTIES = (
"syllable_count", "phoneme_count", "wcm_score",
"phono_prob_avg", "positional_prob_avg",
"neighborhood_density", "str_phono_prob_avg", "str_positional_prob_avg",
"str_neighborhood_density",
"frequency", "log_frequency", "contextual_diversity", "prevalence",
"aoa", "aoa_kuperman",
"elp_lexical_decision_rt",
"imageability", "familiarity", "concreteness", "size",
"valence", "arousal", "dominance",
"iconicity", "boi", "socialness", "semantic_diversity",
"auditory", "visual", "haptic", "gustatory", "olfactory", "interoceptive",
"hand_arm", "foot_leg", "head", "mouth", "torso",
"morpheme_count", "n_prefixes", "n_suffixes",
"freq_cyplex_7_9", "freq_cyplex_10_12", "freq_cyplex_13",
)
VOWEL_IPA = {
"i", "ɪ", "e", "ɛ", "æ", "ɑ", "ɔ", "o", "ʊ", "u",
"ʌ", "ə", "ɚ", "ɝ", "eɪ", "aɪ", "ɔɪ", "aʊ", "oʊ",
}
def _compute_percentiles(
words: dict[str, WordRecord],
) -> dict[str, dict[str, float | None]]:
"""Compute percentiles for all percentile-eligible properties."""
# Build sorted arrays
sorted_arrays: dict[str, list[float]] = {}
for prop_id in PERCENTILE_PROPERTIES:
values = []
for record in words.values():
val = getattr(record, prop_id, None)
if val is not None:
values.append(float(val))
values.sort()
sorted_arrays[prop_id] = values
# Compute per-word percentiles
result: dict[str, dict[str, float | None]] = {}
for word, record in words.items():
pcts: dict[str, float | None] = {}
for prop_id in PERCENTILE_PROPERTIES:
val = getattr(record, prop_id, None)
sorted_vals = sorted_arrays.get(prop_id, [])
if val is not None and sorted_vals:
upper = bisect.bisect_right(sorted_vals, float(val))
pcts[f"{prop_id}_percentile"] = round((upper / len(sorted_vals)) * 100, 1)
else:
pcts[f"{prop_id}_percentile"] = None
result[word] = pcts
return result
def _compute_phoneme_data(
phoible_vectors: dict,
) -> tuple[dict[str, dict], dict[str, float], list[tuple[str, str, float]]]:
"""Extract phoneme data, norms, and pairwise dot products from PHOIBLE vectors."""
phoible_76d = phoible_vectors.get("76d", {})
feature_names = phoible_vectors.get("feature_names", [])
# Phoneme index
phonemes_data: dict[str, dict] = {}
for ipa, vec76 in phoible_76d.items():
ptype = "vowel" if ipa in VOWEL_IPA else "consonant"
features = {}
for i, fname in enumerate(feature_names):
idx = i * 2
if idx + 1 < len(vec76):
val = vec76[idx]
if val > 0.5:
features[fname] = "+"
elif val < -0.5:
features[fname] = "-"
else:
features[fname] = "0"
phonemes_data[ipa] = {"type": ptype, "features": features}
# Phoneme norms (norm_sq)
phoneme_norms: dict[str, float] = {}
for ipa, vec76 in phoible_76d.items():
v = np.array(vec76[:76], dtype=np.float32)
phoneme_norms[ipa] = float(np.dot(v, v))
# Pairwise dot products
phoneme_ipa_list = sorted(phoible_76d.keys())
phoneme_dots: list[tuple[str, str, float]] = []
for i, ipa1 in enumerate(phoneme_ipa_list):
v1 = np.array(phoible_76d[ipa1][:76], dtype=np.float32)
for j in range(i + 1, len(phoneme_ipa_list)):
ipa2 = phoneme_ipa_list[j]
v2 = np.array(phoible_76d[ipa2][:76], dtype=np.float32)
dot = float(np.dot(v1, v2))
if dot != 0.0:
phoneme_dots.append((ipa1, ipa2, dot))
return phonemes_data, phoneme_norms, phoneme_dots
def _extract_syllable_components(
words: dict[str, WordRecord],
) -> tuple[list[dict], dict, dict]:
"""Extract unique syllable components and word-syllable mappings.
Only operates on words with has_phonology=True.
"""
component_keys: set[tuple[str, tuple[str, ...]]] = set()
word_syllable_data: dict[str, list[dict]] = {}
for word, record in words.items():
if not record.has_phonology or not record.syllables:
continue
word_syls = []
for syl in record.syllables:
onset_key = ("onset", tuple(syl.get("onset", [])))
component_keys.add(onset_key)
nuc_ipa = syl.get("nucleus", "")
nuc_key = ("nucleus", (nuc_ipa,) if nuc_ipa else ())
component_keys.add(nuc_key)
coda_key = ("coda", tuple(syl.get("coda", [])))
component_keys.add(coda_key)
word_syls.append({
"onset_key": onset_key,
"nucleus_key": nuc_key,
"coda_key": coda_key,
})
word_syllable_data[word] = word_syls
# Assign IDs
component_key_to_id: dict[tuple, int] = {}
component_list: list[dict] = []
for i, key in enumerate(sorted(component_keys)):
ctype, phons = key
component_key_to_id[key] = i
component_list.append({"id": i, "type": ctype, "phonemes": list(phons)})
return component_list, word_syllable_data, component_key_to_id
def _compute_minimal_pairs(
words: dict[str, WordRecord],
) -> list[tuple[str, str, str, str, int, str]]:
"""Precompute minimal pairs — only for words with has_phonology=True."""
by_length: dict[int, list[str]] = {}
for word, record in words.items():
if not record.has_phonology:
continue
length = record.phoneme_count or 0
if length >= 2:
by_length.setdefault(length, []).append(word)
minimal_pairs: list[tuple[str, str, str, str, int, str]] = []
for length, word_list in by_length.items():
word_list.sort()
for i in range(len(word_list)):
w1 = word_list[i]
p1 = words[w1].phonemes
for j in range(i + 1, len(word_list)):
w2 = word_list[j]
p2 = words[w2].phonemes
diff_count = 0
diff_pos = -1
diff_p1 = ""
diff_p2 = ""
for k in range(length):
if p1[k] != p2[k]:
diff_count += 1
diff_pos = k
diff_p1 = p1[k]
diff_p2 = p2[k]
if diff_count > 1:
break
if diff_count != 1:
continue
if diff_pos == 0:
pos_type = "initial"
elif diff_pos == length - 1:
pos_type = "final"
else:
pos_type = "medial"
minimal_pairs.append((w1, w2, diff_p1, diff_p2, diff_pos, pos_type))
return minimal_pairs
def _compute_property_ranges(
words: dict[str, WordRecord],
) -> dict[str, tuple[float, float]]:
"""Compute min/max ranges for all filterable properties."""
ranges: dict[str, tuple[float, float]] = {}
for prop_id in PERCENTILE_PROPERTIES:
values = [
getattr(r, prop_id) for r in words.values()
if getattr(r, prop_id, None) is not None
]
if values:
ranges[prop_id] = (min(values), max(values))
else:
ranges[prop_id] = (0, 0)
return ranges
def build_derived(
words: dict[str, WordRecord],
phoible_vectors: dict,
) -> DerivedData:
"""Compute all derived data from word records and PHOIBLE vectors."""
print("Computing percentiles ...")
percentiles = _compute_percentiles(words)
print("Building phoneme index + dot products ...")
phoneme_data, phoneme_norms, phoneme_dots = _compute_phoneme_data(phoible_vectors)
print("Extracting syllable components ...")
components, word_syllable_data, component_key_to_id = _extract_syllable_components(words)
print("Precomputing minimal pairs ...")
minimal_pairs = _compute_minimal_pairs(words)
print("Computing property ranges ...")
property_ranges = _compute_property_ranges(words)
print(f" {len(phoneme_data)} phonemes, {len(phoneme_dots):,} dot products")
print(f" {len(components)} components, {len(minimal_pairs):,} minimal pairs")
return DerivedData(
percentiles=percentiles,
minimal_pairs=minimal_pairs,
phoneme_data=phoneme_data,
phoneme_norms=phoneme_norms,
phoneme_dots=phoneme_dots,
components=components,
word_syllable_data=word_syllable_data,
component_key_to_id=component_key_to_id,
property_ranges=property_ranges,
)
- [ ] Step 4: Run test
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_derived_with_minimal_data -v
Expected: PASS
- [ ] Step 5: Commit
git add packages/data/src/phonolex_data/pipeline/derived.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline derived module — percentiles, minimal pairs, phoneme dots, syllable components"
Task 14: Pipeline Orchestrator + Integration Test¶
Files:
- Modify: packages/data/src/phonolex_data/pipeline/__init__.py
- Test: packages/data/tests/test_pipeline.py
- [ ] Step 1: Write the integration test
# Add to packages/data/tests/test_pipeline.py
@pytest.mark.slow
def test_build_lexical_database_integration():
"""Full integration test — builds entire database from raw datasets."""
from phonolex_data.pipeline import build_lexical_database
db = build_lexical_database()
# Word count should be >100K (union of all datasets)
assert len(db.words) > 100000
# Should have both phonology and norm-only words
phono_count = sum(1 for r in db.words.values() if r.has_phonology)
norm_only_count = sum(1 for r in db.words.values() if not r.has_phonology)
assert phono_count > 100000 # CMU dict has ~134K
assert norm_only_count > 0
# Edges
assert len(db.edges) > 0
# Derived data
assert len(db.derived.percentiles) == len(db.words)
assert len(db.derived.minimal_pairs) > 0
assert len(db.derived.phoneme_data) > 0
assert len(db.derived.phoneme_dots) > 0
assert len(db.derived.components) > 0
assert len(db.derived.word_syllable_data) > 0
# PHOIBLE vectors
assert "76d" in db.phoible_vectors
- [ ] Step 2: Implement orchestrator
Update packages/data/src/phonolex_data/pipeline/__init__.py:
"""Integrated lexical database pipeline.
Usage:
from phonolex_data.pipeline import build_lexical_database
db = build_lexical_database()
"""
from __future__ import annotations
from phonolex_data.loaders import load_phoible
from phonolex_data.pipeline.edges import build_edges
from phonolex_data.pipeline.derived import build_derived
from phonolex_data.pipeline.schema import LexicalDatabase
from phonolex_data.pipeline.words import build_words
def build_lexical_database() -> LexicalDatabase:
"""Build the complete integrated lexical database from raw datasets."""
phoible = load_phoible()
words = build_words()
edges = build_edges(words)
derived = build_derived(words, phoible)
return LexicalDatabase(
words=words,
edges=edges,
derived=derived,
phoible_vectors=phoible,
)
- [ ] Step 3: Run integration test
Run: cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/test_pipeline.py::test_build_lexical_database_integration -v -m slow
Expected: PASS (may take several minutes)
- [ ] Step 4: Commit
git add packages/data/src/phonolex_data/pipeline/__init__.py packages/data/tests/test_pipeline.py
git commit -m "feat: add pipeline orchestrator — build_lexical_database() assembles everything"
Task 15: Update config.py — New Properties + Edge Types¶
Files:
- Modify: packages/web/workers/scripts/config.py
Context: Add PropertyDefs for new properties (neighborhood_density, stressed IPhOD variants, CYP-LEX frequencies, morphological_segmentation, semantic_diversity, log_frequency). Update ECCC and SPP edge type strength_keys to match new EdgeRecord field names. Remove passes_vocabulary_filter() — no longer needed.
- [ ] Step 1: Add new property definitions
Add to config.py in the appropriate categories:
In PHONOTACTIC_PROBABILITY, add after positional_prob_avg:
PropertyDef(
id="neighborhood_density",
label="Neighborhood Density",
short_label="ND",
source="IPhOD2 (Vaden et al., 2009)",
description="Number of phonological neighbors",
scale="0-50+",
interpretation="Higher = more similar-sounding words",
display_format=".0f",
slider_step=1,
is_integer=True,
),
PropertyDef(
id="str_phono_prob_avg",
label="Biphone Probability — Stressed (Avg)",
short_label="sBPP",
source="IPhOD2 (Vaden et al., 2009)",
description="Mean biphone probability with stress marking",
scale="0-1",
interpretation="Higher = more typical stressed sound sequences",
display_format=".4f",
slider_step=0.001,
),
PropertyDef(
id="str_positional_prob_avg",
label="Positional Segment Probability — Stressed (Avg)",
short_label="sPSP",
source="IPhOD2 (Vaden et al., 2009)",
description="Mean positional segment probability with stress marking",
scale="0-1",
interpretation="Higher = more common stressed phonemes in those positions",
display_format=".4f",
slider_step=0.001,
),
PropertyDef(
id="str_neighborhood_density",
label="Neighborhood Density — Stressed",
short_label="sND",
source="IPhOD2 (Vaden et al., 2009)",
description="Phonological neighbors accounting for stress",
scale="0-50+",
interpretation="Higher = more stress-matched neighbors",
display_format=".0f",
slider_step=1,
is_integer=True,
),
Update source citations for phono_prob_avg and positional_prob_avg from "Vitevitch & Luce (2004)" to "IPhOD2 (Vaden et al., 2009)".
Add a new CHILD_FREQUENCY category after MORPHOLOGICAL_PROPERTIES:
CHILD_FREQUENCY = PropertyCategory(
id="child_frequency",
label="Child Frequency (CYP-LEX)",
properties=(
PropertyDef(
id="freq_cyplex_7_9",
label="Child Frequency (Age 7-9)",
short_label="CF7",
source="CYP-LEX (Sheridan & Jakobson, 2019)",
description="Zipf-scale word frequency in children's media, ages 7-9",
scale="0-7",
interpretation="Higher = more frequent in age-band media",
display_format=".2f",
slider_step=0.1,
),
PropertyDef(
id="freq_cyplex_10_12",
label="Child Frequency (Age 10-12)",
short_label="CF10",
source="CYP-LEX (Sheridan & Jakobson, 2019)",
description="Zipf-scale word frequency in children's media, ages 10-12",
scale="0-7",
interpretation="Higher = more frequent in age-band media",
display_format=".2f",
slider_step=0.1,
),
PropertyDef(
id="freq_cyplex_13",
label="Child Frequency (Age 13+)",
short_label="CF13",
source="CYP-LEX (Sheridan & Jakobson, 2019)",
description="Zipf-scale word frequency in children's media, ages 13+",
scale="0-7",
interpretation="Higher = more frequent in age-band media",
display_format=".2f",
slider_step=0.1,
),
),
)
Add CHILD_FREQUENCY to PROPERTY_CATEGORIES tuple.
Also add missing PropertyDefs for existing-but-unfilterable properties that are now in PROPERTY_COLUMNS. If semantic_diversity and log_frequency are already defined in config.py, verify they're correct. If not, add them to the appropriate categories:
- log_frequency → FREQUENCY category (alongside frequency and contextual_diversity)
- semantic_diversity → PSYCHOLINGUISTIC or a new category
Note: morphological_segmentation is a TEXT column, not a numeric filter — it does NOT need a PropertyDef or slider. It's just a display column.
- [ ] Step 2: Update edge type definitions
Update SPP and ECCC in EDGE_TYPES:
"ECCC": {
"label": "Perceptual Confusability (ECCC)",
"description": "Words confused by listeners in noise",
"strength_key": "eccc_consistency",
},
"SPP": {
"label": "Semantic Priming (SPP)",
"description": "Priming effects in lexical decision and naming",
"strength_key": "spp_first_priming",
},
- [ ] Step 3: Remove
passes_vocabulary_filter()
Delete the entire passes_vocabulary_filter() function — no longer used.
- [ ] Step 4: Commit
git add packages/web/workers/scripts/config.py
git commit -m "feat: add new property definitions for IPhOD, CYP-LEX; update edge type keys"
Task 16: Rewrite export-to-d1.py¶
Files:
- Modify: packages/web/workers/scripts/export-to-d1.py
Context: Replace the entire pickle-based pipeline with a thin SQL writer that calls build_lexical_database(). Keep the SQL generation format identical (same table structure, batch INSERT). Key changes: nullable phonological columns, has_phonology flag, new property columns, updated edge columns (SPP/ECCC renamed), SQL NULLs for norm-only words.
- [ ] Step 1: Rewrite
export-to-d1.py
Replace the entire file. The new version:
1. Imports build_lexical_database from the pipeline
2. Calls it to get the LexicalDatabase
3. Writes SQL using the same batch INSERT pattern
4. No pickle, no WCM computation, no norm loading — all in the pipeline
Key structural changes from the old version:
- GRAPH_PATH → deleted
- compute_wcm() → deleted (in pipeline)
- passes_vocabulary_filter() → deleted (no filtering)
- PROPERTY_COLUMNS → updated to include new columns. Note: has_phonology goes in the base columns list (alongside word, ipa, etc.), NOT in PROPERTY_COLUMNS
- words table schema: phonological columns become nullable, add has_phonology INTEGER NOT NULL DEFAULT 1
- edges table schema: spp_priming_z/short/long → spp_first_priming/other_priming, eccc_confusability → eccc_consistency
- INSERT logic: use sql_val(None) (→ NULL) for norm-only word phonological fields
The implementer MUST read the existing export-to-d1.py thoroughly before rewriting. Preserve:
- sql_val() and sql_json() helpers
- Batch size of 20
- Same INSERT format
- Same table ordering (words, edges, minimal_pairs, phonemes, phoneme_dots, components, word_syllables, metadata)
- Same index creation
Additional guidance:
- The word_syllables INSERT must iterate db.derived.word_syllable_data (which already excludes norm-only words) — do NOT loop over all words for syllable rows
- The old is_clean_edge() filter (rejects SWOW edges with tabs/newlines/length>100) is no longer needed in the exporter — verify that load_swow() in packages/data/src/phonolex_data/loaders/associations.py already handles this, or add filtering there
- For word INSERTs, access WordRecord fields via getattr(record, field_name) instead of dict .get()
The new PROPERTY_COLUMNS list (add to existing):
PROPERTY_COLUMNS = [
"wcm_score", "frequency", "log_frequency",
"contextual_diversity", "prevalence", "aoa",
"aoa_kuperman", "elp_lexical_decision_rt",
"phono_prob_avg", "positional_prob_avg",
"neighborhood_density",
"str_phono_prob_avg", "str_positional_prob_avg", "str_neighborhood_density",
"imageability", "familiarity", "concreteness", "size",
"valence", "arousal", "dominance",
"iconicity", "boi", "socialness",
"semantic_diversity",
"auditory", "visual", "haptic",
"gustatory", "olfactory", "interoceptive",
"hand_arm", "foot_leg", "head", "mouth", "torso",
"morpheme_count", "is_monomorphemic",
"n_prefixes", "n_suffixes",
"morphological_segmentation",
"freq_cyplex_7_9", "freq_cyplex_10_12", "freq_cyplex_13",
]
The new edge columns:
EDGE_COLUMNS = [
"source", "target", "edge_sources",
"swow_strength", "usf_forward", "usf_backward",
"men_relatedness",
"eccc_consistency", "eccc_n_instances", "eccc_phoneme_distance",
"spp_first_priming", "spp_other_priming",
"spp_fas", "spp_lsa",
"simlex_similarity", "simlex_pos",
"wordsim_relatedness",
]
For the words INSERT, access WordRecord fields via getattr(record, field_name) instead of data.get(field_name).
- [ ] Step 2: Run end-to-end
Run: cd /Users/jneumann/Repos/PhonoLex && uv run python packages/web/workers/scripts/export-to-d1.py
Expected: Produces packages/web/workers/scripts/d1-seed.sql with >100K words
- [ ] Step 3: Verify SQL structure
head -100 /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql
grep -c "INSERT INTO words" /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql
grep "has_phonology" /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql | head -5
- [ ] Step 4: Commit
git add packages/web/workers/scripts/export-to-d1.py
git commit -m "feat: rewrite export-to-d1.py to use pipeline — eliminates pickle dependency"
Task 17: Update TypeScript Types + Properties¶
Files:
- Modify: packages/web/workers/src/types.ts
- Modify: packages/web/workers/src/config/properties.ts
Context: Make phonological fields nullable in WordRow and WordResponse. Add has_phonology field. Add new property columns. Add new edge columns. Mirror the config.py property changes in properties.ts.
- [ ] Step 1: Update
types.ts
In WordRow, change:
ipa: string; → ipa: string | null;
phonemes: string; → phonemes: string | null;
phonemes_str: string; → phonemes_str: string | null;
syllables: string; → syllables: string | null;
phoneme_count: number; → phoneme_count: number | null;
syllable_count: number; → syllable_count: number | null;
Add: has_phonology: number; (0 or 1)
In WordResponse, change:
ipa: string; → ipa: string | null;
phonemes: string[]; → phonemes: string[] | null;
syllables: ...; → syllables: ... | null;
phoneme_count: number; → phoneme_count: number | null;
syllable_count: number; → syllable_count: number | null;
Add: has_phonology: boolean;
In EdgeRow, update SPP and ECCC columns to match new names:
- eccc_confusability → eccc_consistency
- spp_priming_z → spp_first_priming
- spp_priming_short → spp_other_priming
- Remove spp_priming_long
- Add: spp_fas, spp_lsa, eccc_n_instances, eccc_phoneme_distance
Also update EdgeResponse — it has the same old field names and must be updated identically.
After updating types, grep for all usages of the old field names across the entire packages/web/workers/src/ directory:
grep -rn "eccc_confusability\|spp_priming_z\|spp_priming_short\|spp_priming_long" packages/web/workers/src/
- [ ] Step 2: Update
properties.ts
Add the same new PropertyDefs as in config.py Task 15: neighborhood_density, stressed IPhOD variants, CYP-LEX frequencies. Add the new child_frequency category. Update phono_prob/positional_prob source to IPhOD2.
- [ ] Step 3: Verify TypeScript compiles
Run: cd /Users/jneumann/Repos/PhonoLex/packages/web/workers && npx tsc --noEmit
Expected: No errors (or fix any errors that arise from nullable changes)
- [ ] Step 4: Commit
git add packages/web/workers/src/types.ts packages/web/workers/src/config/properties.ts
git commit -m "feat: update TypeScript types for nullable phonology + new properties"
Task 18: End-to-End Verification¶
Files: None created — verification only.
- [ ] Step 1: Run full pipeline
cd /Users/jneumann/Repos/PhonoLex && uv run python packages/web/workers/scripts/export-to-d1.py
d1-seed.sql with stats printed (word count, edge count, etc.)
- [ ] Step 2: Verify word counts
# Count INSERT INTO words statements (each has up to 20 rows)
grep -c "INSERT INTO words" /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql
# Verify has_phonology split
grep -c "has_phonology" /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql | head -5
# Check file size (should be significantly larger than before)
ls -lh /Users/jneumann/Repos/PhonoLex/packages/web/workers/scripts/d1-seed.sql
- [ ] Step 3: Seed local D1 and verify
cd /Users/jneumann/Repos/PhonoLex/packages/web/workers
npx wrangler d1 execute phonolex --local --file scripts/d1-seed.sql
npx wrangler dev
Test in browser: - Custom Word Lists: filter by frequency, verify results include both phonology and norm-only words - Lookup: search for a word with phonology (e.g., "cat") — should show full data - Lookup: search for a norm-only word — should show available norms, no phonological data - Similarity: verify still works (only phonology words) - Text Analysis: paste text, verify percentile stats - Contrastive Sets: verify minimal pairs load
- [ ] Step 4: Run all existing tests
cd /Users/jneumann/Repos/PhonoLex/packages/web/workers && npm test
cd /Users/jneumann/Repos/PhonoLex && uv run pytest packages/data/tests/ -v
- [ ] Step 5: Verify no pickle references remain
grep -r "pickle" /Users/jneumann/Repos/PhonoLex/packages/ --include="*.py" --include="*.ts" -l
grep -r "cognitive_graph" /Users/jneumann/Repos/PhonoLex/packages/ --include="*.py" --include="*.ts" -l
- [ ] Step 6: Commit any fixes
git add -A
git commit -m "fix: address issues found in end-to-end verification"