Corpus DEP Reannotation + Selectional Preference Population — Design (PHON-94)¶
Date: 2026-05-06
Status: Spec — pending user review
Branch: off release/v5.2.0 (working branch named at writing-plans handoff)
Tickets: PHON-94 (https://neumannsworkshop.atlassian.net/browse/PHON-94)
Predecessors: PHON-93 (runtime word-data layer, merged PR #85), PHON-92 (selectional-preference research-spike memo at packages/generation/research/2026-05-05-phon-92-selectional-preference/memo.md on research/phon-92-selectional-preference-spike), PHON-72 (FineWeb-Edu freq+POS), PHON-87/88 (CHILDES + grade-banded frequency)
Sibling tickets: PHON-95 (MLM iterative editor + argstruc CFG enumerator — was original PHON-93 scope before rescope; consumes selectional.parquet at enumeration time)
Problem¶
PHON-93 shipped the runtime word-data layer with three canonical Parquet artifacts. Two are populated; one is empty:
data/runtime/words.parquet— populated (125K phonology-covered words × 155 cols)data/runtime/edges.parquet— populated (~1.6M Qwensim/ECCC/WordSim edges)data/runtime/selectional.parquet— schema only, no rows
The schema is a single table (per phonolex_data.runtime.schema.selectional_schema()):
{
"verb": pl.Utf8,
"role": pl.Utf8,
"filler": pl.Utf8,
"count_v_r_f": pl.UInt32,
"count_v_r_star": pl.UInt32,
"ppmi": pl.Float32,
}
This was deliberate: PHON-93 needed to ship the runtime layer without blocking on a multi-hour corpus parse. PHON-94 is that parse.
But PHON-94 cannot just populate the locked schema as-is — three architectural realities surfaced during scope discussion:
-
Parity with frequency demands age-banding. The PhonoLex frequency surface is age-banded (PHON-72 adult general; PHON-86/87 child input by ageband; PHON-88 adult by grade band). A generator targeting toddler-vocabulary stimuli composes toddler-frequency with toddler-selectional priors; mixing adult selectional with toddler frequency creates a coherence hole no consumer can paper over. Selectional must be banded the same way frequency is.
-
PHON-72's spaCy methodology was POS-only. PHON-72's
build_frequency_corpus.py:84-86disabled the parser, lemmatizer, and NER pipes. There are no cached DEP parses; PHON-94 must reparse. But the same FineWeb-Edu parse produces both selectional triples and frequency aggregates at zero extra compute cost. A canonical spaCy methodology run once over each source corpus replaces the inconsistent per-ticket configs and gives all derived stats statistical consistency. -
The PHON-92 memo's
subcat_profileandrole_fillabilityartifacts are aggregates of the per-(verb, role, filler) data, not separate sources. Materializing them as additional Parquets or columns invites stale-derivation bugs. They belong as derived views computed at consumer-load by Polars groupby+aggregate — single source of truth =selectional.parquet.
Goal: populate selectional.parquet with banded per-(verb, role, filler) PPMI; regenerate FineWeb-Edu-derived frequency+POS columns from the same canonical parse pass; canonicalize the spaCy methodology so any future corpus-derived stat uses the same configuration.
Scope¶
In:
- Canonical spaCy methodology: phonolex_data.pipeline.canonical_spacy — one entry point, locks model + pipes + filters.
- data/runtime/selectional.parquet populated with banded (verb, role, filler, band, count_v_r_f, count_v_r_star, ppmi) rows. Schema extends selectional_schema() with one new band: pl.Utf8 column.
- FineWeb-Edu frequency+POS columns on words.parquet regenerated from the canonical parse pass:
- Existing columns (frequency, log_frequency, contextual_diversity, pos, pos_alt, all_pos, all_freqs, PHON-88 grade-banded freq columns) — values updated with parser-informed POS resolution.
- New columns: lemma (str), lemma_frequency, lemma_log_frequency, plus per-F-K-bin equivalents lemma_frequency_b1..b5. Per-lemma aggregates replicated across all surface forms of the lemma.
- Corpus passes (3 separate runs of canonical config):
- FineWeb-Edu — full corpus (1.06M docs / 800M tokens), 4× H100 SXM, sharded; selectional + freq+POS.
- CHILDES Eng-NA + Eng-UK — selectional only (frequency already shipped per PHON-87, methodology unchanged because that pass was MOR-tier, not spaCy).
- PhonBank — smoke-gated; include only if per-band triple density supports top-2K verbs at min_count=5.
- New loader module: packages/data/src/phonolex_data/loaders/selectional.py (NOT in norms.py per feedback_pos_not_norm.md — DEP labels are analyst-assigned, not psycholinguistic norms).
- WordStore derived-view methods: subcat_profile(verb, band), role_fillability(filler, band) — Polars groupby+aggregate over selectional.parquet, computed lazily, cached.
- Phase-0 probe (research/2026-05-06-phon-94-canonical-spacy-probe/) before authorizing the production parse.
- Tests at packages/data/tests/runtime/test_selectional_parquet.py and packages/data/tests/pipeline/test_canonical_spacy.py.
Out (sibling tickets / future work):
- PHON-95 — MLM iterative editor + argstruc CFG enumerator that consumes selectional.parquet at enumeration time. Independent of PHON-94 at the implementation level; v1 of PHON-95 can use boolean filtering and doesn't require selectional data to start.
- Cold-storage policy ticket — durable home for raw corpus parses, intermediate shard parquets, and other large derived data. PHON-94 lands shard intermediates on the local ExternalData1 external drive as an interim policy; the broader policy gets its own ticket.
- LM scorer / decoder PPL + MMR diversification (memo §8) — PHON-95's residual layer.
- Multi-rater coherence validation (memo §9) — PHON-69 (blocked, deferred).
- Verb-with-particle disambiguation ("give up" vs "give") — flagged in probe, accepted as v1 conflation.
Methodology principles¶
Canonical spaCy methodology run-once-per-corpus. A single phonolex_data.pipeline.canonical_spacy module locks the spaCy configuration; every corpus-derived stat (current and future) runs it. PHON-72's per-ticket POS-only config retired. Statistical consistency across all FineWeb-Edu-derived columns becomes a property of the architecture, not a coordination problem.
Parity with frequency. Selectional preference data is age-banded the same way frequency data is. A toddler-stimulus generator gets toddler-distribution selectional priors that compose coherently with toddler-distribution frequency priors. The materialized aggregate band (e.g., fineweb_general) is parity-matched to the existing un-banded frequency column.
Single source of truth. selectional.parquet holds raw counts + PPMI per (verb, role, filler, band). All higher-level views — verb subcategorization profiles, per-noun role-fillability marginals — are derived at consumer-load by Polars groupby+aggregate. No materialized derivations on words.parquet, no sibling Parquets for subcat or fillability. This eliminates the stale-derivation failure mode.
Probe-gated production. A 1,000-doc local probe (~10-15 min CPU run) verifies all spaCy-output presumptions (DEP label inventory, lemmatizer behavior, pronoun handling, passive voice prevalence, PP-attachment behavior, particle-verb prevalence, throughput) before the 4-H100-hour production parse is authorized. If any presumption breaks, the canonical config is adjusted before commit.
Lemma-keyed selectional, surface-keyed lexicon. selectional.parquet keys by lemma (verb + filler). words.parquet remains surface-keyed (CMU-dict-aligned). The mismatch is bridged at consumer time: PHON-95 lemmatizes the candidate at substitution. Both surface-keyed and lemma-keyed frequency columns live on words.parquet so consumers pick the right unit for the right query.
Cold storage for intermediates. Per-shard parquets and any other large intermediate corpus artifacts land on the local ExternalData1 external drive at /Volumes/ExternalData1/phonolex/raw_corpus_parses/{fineweb_edu,childes,phonbank}/. Only the final aggregated selectional.parquet (~1-2 GB post-min_count filter) goes in the repo via LFS. The broader cold-storage policy (durable home for raw datasets, intermediate parses, other large derived data) is a separate ticket filed at PHON-94 close.
Schema decisions¶
selectional.parquet (extends PHON-93's locked schema)¶
def selectional_schema():
return {
"verb": pl.Utf8, # lemma, lowercased
"role": pl.Utf8, # one of the 9 DEP roles below
"filler": pl.Utf8, # lemma, lowercased; NOUN/PROPN for nominal-arg roles, VERB for clausal
"band": pl.Utf8, # NEW — one of the bands enumerated below
"count_v_r_f": pl.UInt32,
"count_v_r_star": pl.UInt32,
"ppmi": pl.Float32,
}
Same triple appears as multiple rows, one per band it belongs to. PMI is computed per-band against that band's marginals — ppmi(fineweb_general) ≠ sum of ppmi(fineweb_grade_*); they're different statistics over different distributions.
Role inventory (9 DEP labels)¶
| Role | spaCy DEP source | Filler POS | Notes |
|---|---|---|---|
nsubj |
nsubj (+ nsubjpass remap — see below) |
NOUN, PROPN | Drop PRON. Memo §1's role inventory matches. |
dobj |
dobj |
NOUN, PROPN | Drop PRON. |
iobj |
iobj (or dative in some spaCy versions; probe confirms) |
NOUN, PROPN | Drop PRON. |
pobj_to |
pobj whose parent ADP lemma == "to" and grandparent is VERB |
NOUN, PROPN | V→prep→pobj only; NP-modifier PPs filtered out. |
pobj_with |
same pattern, "with" | NOUN, PROPN | |
pobj_in |
same pattern, "in" | NOUN, PROPN | |
pobj_on |
same pattern, "on" | NOUN, PROPN | |
xcomp |
xcomp |
VERB | Filler is the embedded predicate's lemma. PHON-95 v1 grammars don't use clausal complements; data captured for future grammars. |
ccomp |
ccomp |
VERB | Same as xcomp. |
Passive voice remap. nsubjpass instances ("the apple was eaten") map to dobj for selectional purposes — the patient role is what matters semantically. Standard practice in selectional-preference literature (Sayeed/Greenberg). Probe measures prevalence; remap is committed in the canonical extraction code.
PRON fillers dropped. Pronouns don't carry semantic selectional signal; "he"/"she"/"it" would dominate every nsubj/dobj row regardless of verb. Modern spaCy lemmatizers return surface pronoun forms (probe confirms — older versions returned -PRON- sentinel).
Bands¶
Per-corpus partition. FineWeb-Edu uses per-sentence F-K binning (revised from the initial fineweb_grade_K_8/9_12/13_16 design after four research probes — see research/2026-05-06-phon-94-{aoa-banding,readability,nb,chunked-fk}-probe/). CHILDES and PhonBank use participant-age tagging directly from the source data.
| Band | Source | Parity with |
|---|---|---|
fineweb_adult |
Full FineWeb-Edu, all sentences | PHON-72 frequency |
fineweb_b1 |
Sentences with F-K < 7.6 | (chunked F-K analog of PHON-88's freq_b1) |
fineweb_b2 |
F-K 7.6 – 10.7 | (chunked F-K analog of freq_b2) |
fineweb_b3 |
F-K 10.7 – 13.4 | (chunked F-K analog of freq_b3) |
fineweb_b4 |
F-K 13.4 – 16.8 | (chunked F-K analog of freq_b4) |
fineweb_b5 |
F-K ≥ 16.8 (clipped at 30) | (chunked F-K analog of freq_b5) |
childes_general |
Full CHILDES across all participant-age filters | PHON-87 general aggregate |
childes_age_0_1y ... childes_age_12_18y |
CHILDES utterances by participant age band | PHON-87 freq_childes_input_* columns |
phonbank_general |
Full PhonBank dataset.jsonl (English-language utterances) |
PHON-86 general aggregate |
phonbank_age_0_1y ... phonbank_age_5_plus |
PhonBank utterances by participant age band | PHON-86 freq_pb_* columns |
Banding methodology:
-
FineWeb-Edu: per-sentence F-K =
0.39·(W/S) + 11.8·(syl/W) − 15.59. Words come from spaCy alphabetic tokens; sentences fromdoc.sents; syllables fromwords.parquet[token].syllable_countwith vowel-cluster heuristic for OOV. F-K values clipped at 30. Bin boundaries (7.6 / 10.7 / 13.4 / 16.8) are quantile cuts at p20/p40/p60/p80 of the empirical F-K distribution measured on a 73K-chunk FineWeb-Edu sample. Each parsed sentence gets exactly one F-K bin assignment; the triples extracted from that sentence increment counters infineweb_adult(always) and the matchingfineweb_b{i}bin. Sentences with W < 5 are skipped (F-K is unstable on tiny chunks). -
CHILDES + PhonBank: bands come from the participant age tag in the source data, not from F-K. Each utterance increments its corpus's
*_generalaggregate plus the matching age band. The 8-band CHILDES + 6-band PhonBank inventories match the existingfreq_childes_input_*andfreq_pb_*columns on words.parquet.
PhonBank smoke-gate retired — the empirical inspection of dataset.jsonl (828K utterances, 22.9K vocab, age range 0-12y) confirms sufficient density for direct parsing.
Naming clarification: fineweb_b1..b5 are not identical to PHON-88's freq_b1..b5 columns on words.parquet. PHON-88 uses a composite (F-K + tier-1 + off-list) at chunk level, aggregated to per-word frequencies. PHON-94 uses pure F-K at sentence level, attributed to triples extracted from the sentence. The naming is parallel for clarity but the bands are computed differently.
words.parquet additions¶
| Column | Type | Source |
|---|---|---|
lemma |
str | spaCy token.lemma_.lower() from canonical pass |
lemma_frequency |
Float32 | Per-lemma aggregate of FineWeb-Edu surface counts |
lemma_log_frequency |
Float32 | log10(lemma_frequency + 1) |
lemma_contextual_diversity |
Float32 | Per-lemma CD (docs containing any surface form of the lemma) |
lemma_frequency_b1 |
Float32 | Per-lemma aggregate over sentences in F-K bin b1 (F-K < 7.6) |
lemma_frequency_b2 |
Float32 | bin b2 (F-K 7.6–10.7) |
lemma_frequency_b3 |
Float32 | bin b3 (F-K 10.7–13.4) |
lemma_frequency_b4 |
Float32 | bin b4 (F-K 13.4–16.8) |
lemma_frequency_b5 |
Float32 | bin b5 (F-K ≥ 16.8) |
Existing surface-keyed FineWeb-Edu columns (frequency, log_frequency, contextual_diversity, pos, pos_alt, all_pos, all_freqs, PHON-88 grade-banded freq) are regenerated from the canonical pass with parser-informed POS resolution; values may shift slightly from PHON-72/PHON-88 baselines.
CHILDES-derived columns (PHON-86/87) are unaffected — those use MOR-tier transcripts, not spaCy.
PMI computation¶
Per-band, with Laplace add-α=0.01 smoothing and min_count=5 floor:
P̂(f | v, r, b) = (c(v, r, f, b) + α) / (c(v, r, *, b) + α · |F_r,b|)
P̂(f | r, b) = (c(*, r, f, b) + α) / (c(*, r, *, b) + α · |F_r,b|)
PPMI(v, r, f, b) = max(0, log₂( P̂(f | v, r, b) / P̂(f | r, b) ))
Write-time filter: only min_count=5. A triple with c(v, r, f, b) < 5 is dropped at write time (treat as no evidence — single-occurrence triples are mostly parsing noise per Jurafsky/Martin SLP3 §J.3). Rows with ppmi == 0 (below-chance) are kept. This preserves the consumer-side signal: a (verb, role, band) with no positive-PMI fillers but with attested low-PMI fillers is distinguishable from one with no data at all.
Consumer-side filtering. Consumers query for ppmi > τ themselves (default τ=0). The coverage gate c(v, r, *, b) ≥ 50 is also consumer-side: the data layer exposes count_v_r_star on every row; PHON-95 (or any consumer) decides whether to trust a verb's zero-PPMI rejection signal in a given band, falling open when coverage is insufficient.
Storage estimate. Top-2K verbs × 9 roles × ~200 unique fillers × 9 bands ≈ 32M rows pre-floor. After min_count=5 floor: ~8-15M rows. Parquet compressed: ~1-2 GB. LFS-trackable.
Architecture¶
Module layout¶
packages/data/src/phonolex_data/
├── pipeline/
│ ├── canonical_spacy.py # NEW — spaCy config + load_canonical_pipeline()
│ └── ...
├── loaders/
│ ├── selectional.py # NEW — load selectional.parquet into WordStore
│ └── ...
└── runtime/
├── schema.py # MODIFIED — selectional_schema() gains band column
├── store.py # MODIFIED — WordStore.subcat_profile() / .role_fillability()
└── ...
research/2026-05-06-phon-94-canonical-spacy-probe/
├── probe.py # Phase-0 sanity check
├── notebook.md # Findings, decisions, surprises
└── README.md
research/2026-05-06-phon-94-corpus-parse/
├── build_selectional.py # Sharded parse + extract; writes shard-N.parquet
├── merge_shards.py # Polars groupby-sum across shards → final selectional.parquet + freq+POS deltas
├── launch_shards.sh # 4× RunPod H100 SXM
├── poll_progress.sh
└── notebook.md
Data flow¶
FineWeb-Edu corpus (HuggingFace streaming)
↓
4× H100 shards: canonical_spacy.parse() → extract triples + freq counts → write shard parquet
↓
ExternalData1 cold storage (raw shard parquets)
↓
merge_shards.py (local, Polars stream-merge)
├── selectional aggregation: groupby (verb, role, filler, band) → sum → compute PPMI → filter ≥ 0 → write
↓ ↓
│ data/runtime/selectional.parquet (LFS)
│
└── freq+POS aggregation: per-(surface, pos, band) and per-(lemma, pos, band) counts
↓
deltas to words.parquet via
export-to-d1.py pipeline regen
CHILDES corpus (MOR-tier participant utterances)
↓
1× H100: canonical_spacy.parse() → extract triples (selectional only; freq columns unchanged)
↓
merge → selectional.parquet (childes bands)
PhonBank corpus (smoke-gated)
↓
local CPU or 1× H100 if smoke passes
↓
merge → selectional.parquet (phonbank bands, conditional)
Consumer surface (WordStore derived views)¶
# selectional_edges loaded as a Polars DataFrame at WordStore startup
class WordStore:
def selectional(
self, verb: str, role: str, filler: str, band: str
) -> SelectionalEdge | None: ...
def subcat_profile(self, verb: str, band: str) -> SubcatProfile:
"""Derived view: groupby role, classify transitivity from dominant pattern."""
...
def role_fillability(self, filler: str, band: str) -> dict[str, float]:
"""Derived view: per-(filler, band) marginal P(role | filler)."""
...
subcat_profile and role_fillability are lazy + cached; first call triggers a Polars groupby, subsequent calls return the cached result.
Phase 0: probe (sanity-check gate)¶
Local 1,000-doc FineWeb-Edu sample (~3M tokens post-filter), canonical config, ~10-15 min CPU run. Outputs JSON stats + a markdown lab notebook.
Presumptions checked:
- DEP label histogram. Confirm
nsubj,dobj,iobj,pobj,xcomp,ccompare in the label inventory; verifyiobjdoesn't surface asdativein this spaCy version; flag any high-frequency label we hadn't accounted for. - Top 30 verb lemmas + sample triples. Spot-check lemmatization:
running/runs/rancollapse torun; common verb extraction looks clean; particle verbs don't crash. - Pronoun lemma form. Confirm
he/she/itnot-PRON-. Affects whether the PRON-drop filter even fires. - Passive voice prevalence. Count
nsubjpassinstances. >5% ofnsubj-like edges → committed remap todobj. - PP attachment: V-rooted vs N-rooted. Count both; verify
pobj_withextraction filters out NP-modifier PPs. - Coordination prevalence. Count
conjchains undernsubj/dobj— measures evidence loss from single-head extraction. - Particle-verb prevalence. V+
prtpatterns; flag conflation magnitude (acceptable for v1). - Throughput. Tokens/sec local CPU
_trf+ parser + lemmatizer. Calibrates H100×4 wallclock estimate.
Gate criteria: all eight presumptions either confirmed-as-expected or addressed in the canonical config before authorizing the production parse. Any surprise becomes a notebook entry + a commit.
Phase 1: production parse¶
After probe clears:
FineWeb-Edu — 4× RunPod H100 SXM, sharded i/N like PHON-72. ~3-4h wallclock. Per-shard parquet output to ExternalData1. Local merge to selectional.parquet + freq+POS deltas.
CHILDES — 1× H100, ~30-60 min. Single-pod parse, no sharding (corpus is 30M tokens). Selectional only (frequency already shipped).
PhonBank — smoke test on PhonBank's per-band triple density. If top-2K verbs have ≥ min_count=5 triples in the smallest band, commit to a parse pass; otherwise drop PhonBank from the band inventory and document the decision in the notebook.
Tests¶
packages/data/tests/pipeline/test_canonical_spacy.py:
- test_canonical_spacy_fixture — 10 hand-written sentences; locks the canonical config's behavior. Verifies expected DEP labels, lemmas, POS for each.
packages/data/tests/runtime/test_selectional_parquet.py:
- test_schema_roundtrip — tiny DF → write → read → schema matches selectional_schema().
- test_known_verb_dobj_admits_plausible_filler — (cut, dobj, cake) has ppmi > 0 at fineweb_adult band.
- test_known_verb_dobj_rejects_implausible_filler — (cut, dobj, thunder) is absent or has ppmi == 0.
- test_known_verb_dobj_admits_paper_meat — (cut, dobj, paper) and (cut, dobj, meat) have ppmi > 0 (broader sanity).
- test_coverage_gate_consumer_logic — WordStore.selectional() exposes count_v_r_star; consumers honor the ≥ 50 gate.
- test_band_consistency — for any triple in any fineweb_b{i}, the same triple in fineweb_adult has c_vrf_adult ≥ c_vrf_b{i}.
- test_passive_remap — sentence "the apple was eaten by the boy" produces (eat, dobj, apple) not (eat, nsubj, apple).
- test_wordstore_subcat_profile — WordStore.subcat_profile(verb='give', band='fineweb_adult') returns transitivity='ditrans'.
- test_wordstore_role_fillability — WordStore.role_fillability(filler='cake', band='fineweb_adult') shows dobj as dominant role.
Acceptance criteria¶
data/runtime/selectional.parquetpopulated and round-trips throughpl.read_parquet()with the schema-extended (band-column-included) signature.- Sanity test on a known verb:
cutadmitscake,paper,meatasdobj; rejectsthunder,idea. Confirmed atfineweb_adultband. - All bands in the inventory have at least the top-100 verbs populated above min_count=5.
words.parquetregenerated columns produce non-degenerate values (no all-zero columns; lemma column has reasonable English lemma forms).WordStore.subcat_profile()and.role_fillability()return non-empty results for any verb/filler with adequate corpus support.- Probe notebook committed with all 8 presumption checks reported and any required canonical-config adjustments documented.
- All tests pass on CI.
- PR back to
release/v5.2.0.
Open follow-ups¶
- Cold-storage policy ticket. PHON-94 uses
ExternalData1as an interim home for raw shard parquets; broader policy for raw datasets, intermediate parses, and other large derived data needs its own ticket. File at PHON-94 close. - Verb-with-particle disambiguation. "give up" vs "give" conflate at the lemma level. Acceptable for v1; revisit if PHON-95 evaluation surfaces it as a coherence-gap.
- Coordination evidence loss. Single-head extraction misses
Maryin "John and Mary ate". Acceptable for v1; coordination-aware extraction is a candidate enhancement if probe results show high prevalence in CHILDES. - CHILDES license posture documentation. PHON-86/87/88 already ship CHILDES-derived freq aggregates under the same posture (derived statistical aggregates, not per-utterance redistribution); PHON-94 selectional aggregates inherit. Confirm
data/SOURCES.mdcovers selectional under the existing CHILDES entry.