PHON-117 — Determiner selectional frequency in selectional.parquet¶

Status: SPEC Owner: TBD Filed: 2026-05-11 Driven by: PHON-110 / v5.2 CSP realizer determiner-diversity gap

Why this exists¶

The CSP realizer currently synthesizes determiners with a tiny rule: - Sentence-initial nominal slot → "the" (definite default) - Non-initial nominal slot → "a" / "an" by next noun's leading sound

This is acceptable v5.2 output but two failure modes are visible: 1. No diversity in non-initial position. Every non-subject noun gets a/an. English real-world distribution at object positions is mixed (the ~55%, a/an ~30%, this/that/some/my/... ~15% in adult corpora). 2. Ungrounded choice. The realizer doesn't know whether the corpus actually uses the X vs a X for a given (verb, role, filler) tuple. Mixing in "the" via hash (PHON-110 attempt) produced "The cat eats the cat"-style ambiguity because the realizer can't tell whether second-position the X should corefer with the subject.

We have the data to fix this principled-ly: the corpus DEP parse already attaches det(noun, the/a/this/...) relations. We just don't store them in selectional.parquet. Spec adds a det_counts column (or sibling table) that records actual determiner frequencies per (verb, role, filler) in each band, so the realizer can sample from corpus-attested distributions.

Scope¶

Extend selectional.parquet (and the corpus parse pipeline that produces it) with per-(verb, role, filler, band) determiner-frequency data.

Approach (sketch — implementer to confirm)¶

Data shape¶

Two options, pick one:

Option A — wide column on existing rows. Add a det_counts: Map<str, u32> column to selectional.parquet. Each row has {the: N1, a: N2, an: N3, this: N4, ...} for that exact (verb, role, filler, band) tuple. NULL for rows with no observed determiner (typically PRON-headed slots).

Pros: single-table join in the realizer. Cons: sparse map storage; per-row width balloons if we have a long determiner vocabulary.

Option B — sibling parquet selectional_det.parquet. Schema: (verb, role, filler, band, det, count). One row per (verb, role, filler, band, det) observation.

Pros: cleaner schema; easy to aggregate at runtime. Cons: extra IO + join on the hot path.

Recommend A for runtime simplicity unless map storage proves prohibitive.

Determiner vocabulary¶

Allowed determiner set (filtered at parse time to drop noise):

the, a, an, this, that, these, those,
some, any, no, every, each, all, both, neither, either,
my, your, his, her, its, our, their,
what, which, another

Anything outside this set → bucket as <other> or drop (debate at implementation time).

Pipeline change¶

Modify the corpus parse in research/2026-05-06-phon-94-corpus-parse/ (or wherever selectional.parquet is built) to also emit per-(verb, role, filler, band) determiner counts. The DEP tree already has det arcs — collect the child token's lemma at each filler-noun and increment.

Realizer integration¶

In _render_function_pos for DET position (or in the realize loop directly): 1. Look up det_counts for the filler about to be placed at the next slot. 2. If non-empty, sample a determiner weighted by counts (deterministic seed per candidate or simple top-1). 3. Else fall back to current the/a/an synthesis.

Same change applies to _realize_legacy._det_for.

Sentence-initial special case¶

Sentence-initial nsubj should still tend to "the" (definite given new subject). Could either: - Force "the" for nsubj position regardless of det_counts (simple, matches current behavior). - Sample from det_counts but bias toward definite forms.

Completion criteria¶

[ ] selectional.parquet includes det_counts (or sibling table). Verified by pl.read_parquet(...).columns.
[ ] Counts cover ≥80% of (verb, role, filler, band) tuples that have any DET-headed corpus instance.
[ ] Realizer uses det_counts when available; falls back to the/a/an synthesis when missing.
[ ] Triage on /api/generate-sentences shows determiner mix matching corpus distribution (~55% the, ~30% a/an, ~15% other in non-initial position).
[ ] Unit test: same (verb, filler) input produces deterministic output (same det every time) given a fixed seed.
[ ] Tests for the parse pipeline confirm determiners get collected from det DEP arcs only (not from articles attached to non-filler nouns).
[ ] Memory footprint of det_counts column documented; if map-column approach blows past target, switch to sibling-parquet.

What this ticket explicitly does NOT do¶

Determiner agreement (e.g., this requires singular noun; these requires plural). v5.3 work; the noise from skipping this is bounded since count-weighted sampling biases away from disagreement-heavy patterns naturally.
Possessive determiner antecedent resolution (my, your, his need a referent). Sample-and-hope for v1; refine if reranker scores are bad.

Anti-patterns to avoid (carryover from PHON-115)¶

No "PROCEED to integration in a follow-up ticket." If parse pipeline + integration both fit, do both. If only one fits the budget, ship just the parse data and note clearly that integration is pending — but file no follow-up ticket; complete the integration as part of this one or extend the deadline.
No new columns without consumption. The det_counts column has to be wired into the realizer in this ticket, not landed inert.

Estimated scope¶

~2 days: parse-pipeline modification + parquet rebuild (~$5-10 RunPod for FineWeb-Edu reparse if needed; cheap for CHILDES/PhonBank), realizer integration, tests. Total ~$15 + 2 dev days.