PHON-117 — Determiner selectional frequency in selectional.parquet¶
Status: SPEC Owner: TBD Filed: 2026-05-11 Driven by: PHON-110 / v5.2 CSP realizer determiner-diversity gap
Why this exists¶
The CSP realizer currently synthesizes determiners with a tiny rule:
- Sentence-initial nominal slot → "the" (definite default)
- Non-initial nominal slot → "a" / "an" by next noun's leading sound
This is acceptable v5.2 output but two failure modes are visible:
1. No diversity in non-initial position. Every non-subject noun gets a/an. English real-world distribution at object positions is mixed (the ~55%, a/an ~30%, this/that/some/my/... ~15% in adult corpora).
2. Ungrounded choice. The realizer doesn't know whether the corpus actually uses the X vs a X for a given (verb, role, filler) tuple. Mixing in "the" via hash (PHON-110 attempt) produced "The cat eats the cat"-style ambiguity because the realizer can't tell whether second-position the X should corefer with the subject.
We have the data to fix this principled-ly: the corpus DEP parse already attaches det(noun, the/a/this/...) relations. We just don't store them in selectional.parquet. Spec adds a det_counts column (or sibling table) that records actual determiner frequencies per (verb, role, filler) in each band, so the realizer can sample from corpus-attested distributions.
Scope¶
Extend selectional.parquet (and the corpus parse pipeline that produces it) with per-(verb, role, filler, band) determiner-frequency data.
Approach (sketch — implementer to confirm)¶
Data shape¶
Two options, pick one:
Option A — wide column on existing rows.
Add a det_counts: Map<str, u32> column to selectional.parquet. Each row has {the: N1, a: N2, an: N3, this: N4, ...} for that exact (verb, role, filler, band) tuple. NULL for rows with no observed determiner (typically PRON-headed slots).
Pros: single-table join in the realizer. Cons: sparse map storage; per-row width balloons if we have a long determiner vocabulary.
Option B — sibling parquet selectional_det.parquet.
Schema: (verb, role, filler, band, det, count). One row per (verb, role, filler, band, det) observation.
Pros: cleaner schema; easy to aggregate at runtime. Cons: extra IO + join on the hot path.
Recommend A for runtime simplicity unless map storage proves prohibitive.
Determiner vocabulary¶
Allowed determiner set (filtered at parse time to drop noise):
the, a, an, this, that, these, those,
some, any, no, every, each, all, both, neither, either,
my, your, his, her, its, our, their,
what, which, another
Anything outside this set → bucket as <other> or drop (debate at implementation time).
Pipeline change¶
Modify the corpus parse in research/2026-05-06-phon-94-corpus-parse/ (or wherever selectional.parquet is built) to also emit per-(verb, role, filler, band) determiner counts. The DEP tree already has det arcs — collect the child token's lemma at each filler-noun and increment.
Realizer integration¶
In _render_function_pos for DET position (or in the realize loop directly):
1. Look up det_counts for the filler about to be placed at the next slot.
2. If non-empty, sample a determiner weighted by counts (deterministic seed per candidate or simple top-1).
3. Else fall back to current the/a/an synthesis.
Same change applies to _realize_legacy._det_for.
Sentence-initial special case¶
Sentence-initial nsubj should still tend to "the" (definite given new subject). Could either: - Force "the" for nsubj position regardless of det_counts (simple, matches current behavior). - Sample from det_counts but bias toward definite forms.
Completion criteria¶
- [ ]
selectional.parquetincludesdet_counts(or sibling table). Verified bypl.read_parquet(...).columns. - [ ] Counts cover ≥80% of (verb, role, filler, band) tuples that have any DET-headed corpus instance.
- [ ] Realizer uses det_counts when available; falls back to
the/a/ansynthesis when missing. - [ ] Triage on
/api/generate-sentencesshows determiner mix matching corpus distribution (~55% the, ~30% a/an, ~15% other in non-initial position). - [ ] Unit test: same (verb, filler) input produces deterministic output (same det every time) given a fixed seed.
- [ ] Tests for the parse pipeline confirm determiners get collected from
detDEP arcs only (not from articles attached to non-filler nouns). - [ ] Memory footprint of
det_countscolumn documented; if map-column approach blows past target, switch to sibling-parquet.
What this ticket explicitly does NOT do¶
- Determiner agreement (e.g.,
thisrequires singular noun;theserequires plural). v5.3 work; the noise from skipping this is bounded since count-weighted sampling biases away from disagreement-heavy patterns naturally. - Possessive determiner antecedent resolution (
my,your,hisneed a referent). Sample-and-hope for v1; refine if reranker scores are bad.
Anti-patterns to avoid (carryover from PHON-115)¶
- No "PROCEED to integration in a follow-up ticket." If parse pipeline + integration both fit, do both. If only one fits the budget, ship just the parse data and note clearly that integration is pending — but file no follow-up ticket; complete the integration as part of this one or extend the deadline.
- No new columns without consumption. The
det_countscolumn has to be wired into the realizer in this ticket, not landed inert.
Estimated scope¶
~2 days: parse-pipeline modification + parquet rebuild (~$5-10 RunPod for FineWeb-Edu reparse if needed; cheap for CHILDES/PhonBank), realizer integration, tests. Total ~$15 + 2 dev days.