Corpus DEP Reannotation + Selectional Preference Population — Design (PHON-94)¶

Date: 2026-05-06 Status: Spec — pending user review Branch: off release/v5.2.0 (working branch named at writing-plans handoff) Tickets: PHON-94 (https://neumannsworkshop.atlassian.net/browse/PHON-94) Predecessors: PHON-93 (runtime word-data layer, merged PR #85), PHON-92 (selectional-preference research-spike memo at packages/generation/research/2026-05-05-phon-92-selectional-preference/memo.md on research/phon-92-selectional-preference-spike), PHON-72 (FineWeb-Edu freq+POS), PHON-87/88 (CHILDES + grade-banded frequency) Sibling tickets: PHON-95 (MLM iterative editor + argstruc CFG enumerator — was original PHON-93 scope before rescope; consumes selectional.parquet at enumeration time)

Problem¶

PHON-93 shipped the runtime word-data layer with three canonical Parquet artifacts. Two are populated; one is empty:

data/runtime/words.parquet — populated (125K phonology-covered words × 155 cols)
data/runtime/edges.parquet — populated (~1.6M Qwensim/ECCC/WordSim edges)
data/runtime/selectional.parquet — schema only, no rows

The schema is a single table (per phonolex_data.runtime.schema.selectional_schema()):

{
    "verb": pl.Utf8,
    "role": pl.Utf8,
    "filler": pl.Utf8,
    "count_v_r_f": pl.UInt32,
    "count_v_r_star": pl.UInt32,
    "ppmi": pl.Float32,
}

This was deliberate: PHON-93 needed to ship the runtime layer without blocking on a multi-hour corpus parse. PHON-94 is that parse.

But PHON-94 cannot just populate the locked schema as-is — three architectural realities surfaced during scope discussion:

Parity with frequency demands age-banding. The PhonoLex frequency surface is age-banded (PHON-72 adult general; PHON-86/87 child input by ageband; PHON-88 adult by grade band). A generator targeting toddler-vocabulary stimuli composes toddler-frequency with toddler-selectional priors; mixing adult selectional with toddler frequency creates a coherence hole no consumer can paper over. Selectional must be banded the same way frequency is.
PHON-72's spaCy methodology was POS-only. PHON-72's build_frequency_corpus.py:84-86 disabled the parser, lemmatizer, and NER pipes. There are no cached DEP parses; PHON-94 must reparse. But the same FineWeb-Edu parse produces both selectional triples and frequency aggregates at zero extra compute cost. A canonical spaCy methodology run once over each source corpus replaces the inconsistent per-ticket configs and gives all derived stats statistical consistency.
The PHON-92 memo's subcat_profile and role_fillability artifacts are aggregates of the per-(verb, role, filler) data, not separate sources. Materializing them as additional Parquets or columns invites stale-derivation bugs. They belong as derived views computed at consumer-load by Polars groupby+aggregate — single source of truth = selectional.parquet.

Goal: populate selectional.parquet with banded per-(verb, role, filler) PPMI; regenerate FineWeb-Edu-derived frequency+POS columns from the same canonical parse pass; canonicalize the spaCy methodology so any future corpus-derived stat uses the same configuration.

Scope¶

In: - Canonical spaCy methodology: phonolex_data.pipeline.canonical_spacy — one entry point, locks model + pipes + filters. - data/runtime/selectional.parquet populated with banded (verb, role, filler, band, count_v_r_f, count_v_r_star, ppmi) rows. Schema extends selectional_schema() with one new band: pl.Utf8 column. - FineWeb-Edu frequency+POS columns on words.parquet regenerated from the canonical parse pass: - Existing columns (frequency, log_frequency, contextual_diversity, pos, pos_alt, all_pos, all_freqs, PHON-88 grade-banded freq columns) — values updated with parser-informed POS resolution. - New columns: lemma (str), lemma_frequency, lemma_log_frequency, plus per-F-K-bin equivalents lemma_frequency_b1..b5. Per-lemma aggregates replicated across all surface forms of the lemma. - Corpus passes (3 separate runs of canonical config): - FineWeb-Edu — full corpus (1.06M docs / 800M tokens), 4× H100 SXM, sharded; selectional + freq+POS. - CHILDES Eng-NA + Eng-UK — selectional only (frequency already shipped per PHON-87, methodology unchanged because that pass was MOR-tier, not spaCy). - PhonBank — smoke-gated; include only if per-band triple density supports top-2K verbs at min_count=5. - New loader module: packages/data/src/phonolex_data/loaders/selectional.py (NOT in norms.py per feedback_pos_not_norm.md — DEP labels are analyst-assigned, not psycholinguistic norms). - WordStore derived-view methods: subcat_profile(verb, band), role_fillability(filler, band) — Polars groupby+aggregate over selectional.parquet, computed lazily, cached. - Phase-0 probe (research/2026-05-06-phon-94-canonical-spacy-probe/) before authorizing the production parse. - Tests at packages/data/tests/runtime/test_selectional_parquet.py and packages/data/tests/pipeline/test_canonical_spacy.py.

Out (sibling tickets / future work): - PHON-95 — MLM iterative editor + argstruc CFG enumerator that consumes selectional.parquet at enumeration time. Independent of PHON-94 at the implementation level; v1 of PHON-95 can use boolean filtering and doesn't require selectional data to start. - Cold-storage policy ticket — durable home for raw corpus parses, intermediate shard parquets, and other large derived data. PHON-94 lands shard intermediates on the local ExternalData1 external drive as an interim policy; the broader policy gets its own ticket. - LM scorer / decoder PPL + MMR diversification (memo §8) — PHON-95's residual layer. - Multi-rater coherence validation (memo §9) — PHON-69 (blocked, deferred). - Verb-with-particle disambiguation ("give up" vs "give") — flagged in probe, accepted as v1 conflation.

Methodology principles¶

Canonical spaCy methodology run-once-per-corpus. A single phonolex_data.pipeline.canonical_spacy module locks the spaCy configuration; every corpus-derived stat (current and future) runs it. PHON-72's per-ticket POS-only config retired. Statistical consistency across all FineWeb-Edu-derived columns becomes a property of the architecture, not a coordination problem.

Parity with frequency. Selectional preference data is age-banded the same way frequency data is. A toddler-stimulus generator gets toddler-distribution selectional priors that compose coherently with toddler-distribution frequency priors. The materialized aggregate band (e.g., fineweb_general) is parity-matched to the existing un-banded frequency column.

Single source of truth. selectional.parquet holds raw counts + PPMI per (verb, role, filler, band). All higher-level views — verb subcategorization profiles, per-noun role-fillability marginals — are derived at consumer-load by Polars groupby+aggregate. No materialized derivations on words.parquet, no sibling Parquets for subcat or fillability. This eliminates the stale-derivation failure mode.

Probe-gated production. A 1,000-doc local probe (~10-15 min CPU run) verifies all spaCy-output presumptions (DEP label inventory, lemmatizer behavior, pronoun handling, passive voice prevalence, PP-attachment behavior, particle-verb prevalence, throughput) before the 4-H100-hour production parse is authorized. If any presumption breaks, the canonical config is adjusted before commit.

Lemma-keyed selectional, surface-keyed lexicon. selectional.parquet keys by lemma (verb + filler). words.parquet remains surface-keyed (CMU-dict-aligned). The mismatch is bridged at consumer time: PHON-95 lemmatizes the candidate at substitution. Both surface-keyed and lemma-keyed frequency columns live on words.parquet so consumers pick the right unit for the right query.

Cold storage for intermediates. Per-shard parquets and any other large intermediate corpus artifacts land on the local ExternalData1 external drive at /Volumes/ExternalData1/phonolex/raw_corpus_parses/{fineweb_edu,childes,phonbank}/. Only the final aggregated selectional.parquet (~1-2 GB post-min_count filter) goes in the repo via LFS. The broader cold-storage policy (durable home for raw datasets, intermediate parses, other large derived data) is a separate ticket filed at PHON-94 close.

Schema decisions¶

`selectional.parquet` (extends PHON-93's locked schema)¶

def selectional_schema():
    return {
        "verb": pl.Utf8,             # lemma, lowercased
        "role": pl.Utf8,             # one of the 9 DEP roles below
        "filler": pl.Utf8,           # lemma, lowercased; NOUN/PROPN for nominal-arg roles, VERB for clausal
        "band": pl.Utf8,             # NEW — one of the bands enumerated below
        "count_v_r_f": pl.UInt32,
        "count_v_r_star": pl.UInt32,
        "ppmi": pl.Float32,
    }

Same triple appears as multiple rows, one per band it belongs to. PMI is computed per-band against that band's marginals — ppmi(fineweb_general) ≠ sum of ppmi(fineweb_grade_*); they're different statistics over different distributions.

Role inventory (9 DEP labels)¶

Role	spaCy DEP source	Filler POS	Notes
`nsubj`	`nsubj` (+ `nsubjpass` remap — see below)	NOUN, PROPN	Drop PRON. Memo §1's role inventory matches.
`dobj`	`dobj`	NOUN, PROPN	Drop PRON.
`iobj`	`iobj` (or `dative` in some spaCy versions; probe confirms)	NOUN, PROPN	Drop PRON.
`pobj_to`	`pobj` whose parent ADP lemma == "to" and grandparent is VERB	NOUN, PROPN	V→prep→pobj only; NP-modifier PPs filtered out.
`pobj_with`	same pattern, "with"	NOUN, PROPN
`pobj_in`	same pattern, "in"	NOUN, PROPN
`pobj_on`	same pattern, "on"	NOUN, PROPN
`xcomp`	`xcomp`	VERB	Filler is the embedded predicate's lemma. PHON-95 v1 grammars don't use clausal complements; data captured for future grammars.
`ccomp`	`ccomp`	VERB	Same as xcomp.

Passive voice remap. nsubjpass instances ("the apple was eaten") map to dobj for selectional purposes — the patient role is what matters semantically. Standard practice in selectional-preference literature (Sayeed/Greenberg). Probe measures prevalence; remap is committed in the canonical extraction code.

PRON fillers dropped. Pronouns don't carry semantic selectional signal; "he"/"she"/"it" would dominate every nsubj/dobj row regardless of verb. Modern spaCy lemmatizers return surface pronoun forms (probe confirms — older versions returned -PRON- sentinel).

Bands¶

Per-corpus partition. FineWeb-Edu uses per-sentence F-K binning (revised from the initial fineweb_grade_K_8/9_12/13_16 design after four research probes — see research/2026-05-06-phon-94-{aoa-banding,readability,nb,chunked-fk}-probe/). CHILDES and PhonBank use participant-age tagging directly from the source data.

Band	Source	Parity with
`fineweb_adult`	Full FineWeb-Edu, all sentences	PHON-72 `frequency`
`fineweb_b1`	Sentences with F-K < 7.6	(chunked F-K analog of PHON-88's `freq_b1`)
`fineweb_b2`	F-K 7.6 – 10.7	(chunked F-K analog of `freq_b2`)
`fineweb_b3`	F-K 10.7 – 13.4	(chunked F-K analog of `freq_b3`)
`fineweb_b4`	F-K 13.4 – 16.8	(chunked F-K analog of `freq_b4`)
`fineweb_b5`	F-K ≥ 16.8 (clipped at 30)	(chunked F-K analog of `freq_b5`)
`childes_general`	Full CHILDES across all participant-age filters	PHON-87 general aggregate
`childes_age_0_1y` ... `childes_age_12_18y`	CHILDES utterances by participant age band	PHON-87 `freq_childes_input_*` columns
`phonbank_general`	Full PhonBank `dataset.jsonl` (English-language utterances)	PHON-86 general aggregate
`phonbank_age_0_1y` ... `phonbank_age_5_plus`	PhonBank utterances by participant age band	PHON-86 `freq_pb_*` columns

Banding methodology:

FineWeb-Edu: per-sentence F-K = 0.39·(W/S) + 11.8·(syl/W) − 15.59. Words come from spaCy alphabetic tokens; sentences from doc.sents; syllables from words.parquet[token].syllable_count with vowel-cluster heuristic for OOV. F-K values clipped at 30. Bin boundaries (7.6 / 10.7 / 13.4 / 16.8) are quantile cuts at p20/p40/p60/p80 of the empirical F-K distribution measured on a 73K-chunk FineWeb-Edu sample. Each parsed sentence gets exactly one F-K bin assignment; the triples extracted from that sentence increment counters in fineweb_adult (always) and the matching fineweb_b{i} bin. Sentences with W < 5 are skipped (F-K is unstable on tiny chunks).
CHILDES + PhonBank: bands come from the participant age tag in the source data, not from F-K. Each utterance increments its corpus's *_general aggregate plus the matching age band. The 8-band CHILDES + 6-band PhonBank inventories match the existing freq_childes_input_* and freq_pb_* columns on words.parquet.

PhonBank smoke-gate retired — the empirical inspection of dataset.jsonl (828K utterances, 22.9K vocab, age range 0-12y) confirms sufficient density for direct parsing.

Naming clarification: fineweb_b1..b5 are not identical to PHON-88's freq_b1..b5 columns on words.parquet. PHON-88 uses a composite (F-K + tier-1 + off-list) at chunk level, aggregated to per-word frequencies. PHON-94 uses pure F-K at sentence level, attributed to triples extracted from the sentence. The naming is parallel for clarity but the bands are computed differently.

`words.parquet` additions¶

Column	Type	Source
`lemma`	str	spaCy `token.lemma_.lower()` from canonical pass
`lemma_frequency`	Float32	Per-lemma aggregate of FineWeb-Edu surface counts
`lemma_log_frequency`	Float32	log10(`lemma_frequency` + 1)
`lemma_contextual_diversity`	Float32	Per-lemma CD (docs containing any surface form of the lemma)
`lemma_frequency_b1`	Float32	Per-lemma aggregate over sentences in F-K bin b1 (F-K < 7.6)
`lemma_frequency_b2`	Float32	bin b2 (F-K 7.6–10.7)
`lemma_frequency_b3`	Float32	bin b3 (F-K 10.7–13.4)
`lemma_frequency_b4`	Float32	bin b4 (F-K 13.4–16.8)
`lemma_frequency_b5`	Float32	bin b5 (F-K ≥ 16.8)

Existing surface-keyed FineWeb-Edu columns (frequency, log_frequency, contextual_diversity, pos, pos_alt, all_pos, all_freqs, PHON-88 grade-banded freq) are regenerated from the canonical pass with parser-informed POS resolution; values may shift slightly from PHON-72/PHON-88 baselines.

CHILDES-derived columns (PHON-86/87) are unaffected — those use MOR-tier transcripts, not spaCy.

PMI computation¶

Per-band, with Laplace add-α=0.01 smoothing and min_count=5 floor:

P̂(f | v, r, b) = (c(v, r, f, b) + α) / (c(v, r, *, b) + α · |F_r,b|)
P̂(f | r, b)    = (c(*, r, f, b) + α) / (c(*, r, *, b) + α · |F_r,b|)
PPMI(v, r, f, b) = max(0, log₂( P̂(f | v, r, b) / P̂(f | r, b) ))

Write-time filter: only min_count=5. A triple with c(v, r, f, b) < 5 is dropped at write time (treat as no evidence — single-occurrence triples are mostly parsing noise per Jurafsky/Martin SLP3 §J.3). Rows with ppmi == 0 (below-chance) are kept. This preserves the consumer-side signal: a (verb, role, band) with no positive-PMI fillers but with attested low-PMI fillers is distinguishable from one with no data at all.

Consumer-side filtering. Consumers query for ppmi > τ themselves (default τ=0). The coverage gate c(v, r, *, b) ≥ 50 is also consumer-side: the data layer exposes count_v_r_star on every row; PHON-95 (or any consumer) decides whether to trust a verb's zero-PPMI rejection signal in a given band, falling open when coverage is insufficient.

Storage estimate. Top-2K verbs × 9 roles × ~200 unique fillers × 9 bands ≈ 32M rows pre-floor. After min_count=5 floor: ~8-15M rows. Parquet compressed: ~1-2 GB. LFS-trackable.

Architecture¶

Module layout¶

packages/data/src/phonolex_data/
├── pipeline/
│   ├── canonical_spacy.py         # NEW — spaCy config + load_canonical_pipeline()
│   └── ...
├── loaders/
│   ├── selectional.py             # NEW — load selectional.parquet into WordStore
│   └── ...
└── runtime/
    ├── schema.py                  # MODIFIED — selectional_schema() gains band column
    ├── store.py                   # MODIFIED — WordStore.subcat_profile() / .role_fillability()
    └── ...

research/2026-05-06-phon-94-canonical-spacy-probe/
├── probe.py                       # Phase-0 sanity check
├── notebook.md                    # Findings, decisions, surprises
└── README.md

research/2026-05-06-phon-94-corpus-parse/
├── build_selectional.py           # Sharded parse + extract; writes shard-N.parquet
├── merge_shards.py                # Polars groupby-sum across shards → final selectional.parquet + freq+POS deltas
├── launch_shards.sh               # 4× RunPod H100 SXM
├── poll_progress.sh
└── notebook.md

Data flow¶

FineWeb-Edu corpus (HuggingFace streaming)
    ↓
4× H100 shards: canonical_spacy.parse() → extract triples + freq counts → write shard parquet
    ↓
ExternalData1 cold storage (raw shard parquets)
    ↓
merge_shards.py (local, Polars stream-merge)
    ├── selectional aggregation: groupby (verb, role, filler, band) → sum → compute PPMI → filter ≥ 0 → write
    ↓                                                                        ↓
    │                                                                  data/runtime/selectional.parquet (LFS)
    │
    └── freq+POS aggregation: per-(surface, pos, band) and per-(lemma, pos, band) counts
                                                                              ↓
                                                                        deltas to words.parquet via
                                                                        export-to-d1.py pipeline regen

CHILDES corpus (MOR-tier participant utterances)
    ↓
1× H100: canonical_spacy.parse() → extract triples (selectional only; freq columns unchanged)
    ↓
merge → selectional.parquet (childes bands)

PhonBank corpus (smoke-gated)
    ↓
local CPU or 1× H100 if smoke passes
    ↓
merge → selectional.parquet (phonbank bands, conditional)

Consumer surface (WordStore derived views)¶

# selectional_edges loaded as a Polars DataFrame at WordStore startup
class WordStore:
    def selectional(
        self, verb: str, role: str, filler: str, band: str
    ) -> SelectionalEdge | None: ...

    def subcat_profile(self, verb: str, band: str) -> SubcatProfile:
        """Derived view: groupby role, classify transitivity from dominant pattern."""
        ...

    def role_fillability(self, filler: str, band: str) -> dict[str, float]:
        """Derived view: per-(filler, band) marginal P(role | filler)."""
        ...

subcat_profile and role_fillability are lazy + cached; first call triggers a Polars groupby, subsequent calls return the cached result.

Phase 0: probe (sanity-check gate)¶

Local 1,000-doc FineWeb-Edu sample (~3M tokens post-filter), canonical config, ~10-15 min CPU run. Outputs JSON stats + a markdown lab notebook.

Presumptions checked:

DEP label histogram. Confirm nsubj, dobj, iobj, pobj, xcomp, ccomp are in the label inventory; verify iobj doesn't surface as dative in this spaCy version; flag any high-frequency label we hadn't accounted for.
Top 30 verb lemmas + sample triples. Spot-check lemmatization: running/runs/ran collapse to run; common verb extraction looks clean; particle verbs don't crash.
Pronoun lemma form. Confirm he/she/it not -PRON-. Affects whether the PRON-drop filter even fires.
Passive voice prevalence. Count nsubjpass instances. >5% of nsubj-like edges → committed remap to dobj.
PP attachment: V-rooted vs N-rooted. Count both; verify pobj_with extraction filters out NP-modifier PPs.
Coordination prevalence. Count conj chains under nsubj/dobj — measures evidence loss from single-head extraction.
Particle-verb prevalence. V+prt patterns; flag conflation magnitude (acceptable for v1).
Throughput. Tokens/sec local CPU _trf + parser + lemmatizer. Calibrates H100×4 wallclock estimate.

Gate criteria: all eight presumptions either confirmed-as-expected or addressed in the canonical config before authorizing the production parse. Any surprise becomes a notebook entry + a commit.

Phase 1: production parse¶

After probe clears:

FineWeb-Edu — 4× RunPod H100 SXM, sharded i/N like PHON-72. ~3-4h wallclock. Per-shard parquet output to ExternalData1. Local merge to selectional.parquet + freq+POS deltas.

CHILDES — 1× H100, ~30-60 min. Single-pod parse, no sharding (corpus is 30M tokens). Selectional only (frequency already shipped).

PhonBank — smoke test on PhonBank's per-band triple density. If top-2K verbs have ≥ min_count=5 triples in the smallest band, commit to a parse pass; otherwise drop PhonBank from the band inventory and document the decision in the notebook.

Tests¶

packages/data/tests/pipeline/test_canonical_spacy.py: - test_canonical_spacy_fixture — 10 hand-written sentences; locks the canonical config's behavior. Verifies expected DEP labels, lemmas, POS for each.

packages/data/tests/runtime/test_selectional_parquet.py: - test_schema_roundtrip — tiny DF → write → read → schema matches selectional_schema(). - test_known_verb_dobj_admits_plausible_filler — (cut, dobj, cake) has ppmi > 0 at fineweb_adult band. - test_known_verb_dobj_rejects_implausible_filler — (cut, dobj, thunder) is absent or has ppmi == 0. - test_known_verb_dobj_admits_paper_meat — (cut, dobj, paper) and (cut, dobj, meat) have ppmi > 0 (broader sanity). - test_coverage_gate_consumer_logic — WordStore.selectional() exposes count_v_r_star; consumers honor the ≥ 50 gate. - test_band_consistency — for any triple in any fineweb_b{i}, the same triple in fineweb_adult has c_vrf_adult ≥ c_vrf_b{i}. - test_passive_remap — sentence "the apple was eaten by the boy" produces (eat, dobj, apple) not (eat, nsubj, apple). - test_wordstore_subcat_profile — WordStore.subcat_profile(verb='give', band='fineweb_adult') returns transitivity='ditrans'. - test_wordstore_role_fillability — WordStore.role_fillability(filler='cake', band='fineweb_adult') shows dobj as dominant role.

Acceptance criteria¶

data/runtime/selectional.parquet populated and round-trips through pl.read_parquet() with the schema-extended (band-column-included) signature.
Sanity test on a known verb: cut admits cake, paper, meat as dobj; rejects thunder, idea. Confirmed at fineweb_adult band.
All bands in the inventory have at least the top-100 verbs populated above min_count=5.
words.parquet regenerated columns produce non-degenerate values (no all-zero columns; lemma column has reasonable English lemma forms).
WordStore.subcat_profile() and .role_fillability() return non-empty results for any verb/filler with adequate corpus support.
Probe notebook committed with all 8 presumption checks reported and any required canonical-config adjustments documented.
All tests pass on CI.
PR back to release/v5.2.0.

Open follow-ups¶

Cold-storage policy ticket. PHON-94 uses ExternalData1 as an interim home for raw shard parquets; broader policy for raw datasets, intermediate parses, and other large derived data needs its own ticket. File at PHON-94 close.
Verb-with-particle disambiguation. "give up" vs "give" conflate at the lemma level. Acceptable for v1; revisit if PHON-95 evaluation surfaces it as a coherence-gap.
Coordination evidence loss. Single-head extraction misses Mary in "John and Mary ate". Acceptable for v1; coordination-aware extraction is a candidate enhancement if probe results show high prevalence in CHILDES.
CHILDES license posture documentation. PHON-86/87/88 already ship CHILDES-derived freq aggregates under the same posture (derived statistical aggregates, not per-utterance redistribution); PHON-94 selectional aggregates inherit. Confirm data/SOURCES.md covers selectional under the existing CHILDES entry.