PHON-115 — AoA replacement via PHON-73 LLM-rating pattern¶
Status: SPEC (design ratified by pilot — 2026-05-11)
Owner: TBD (parallel to v5.2 ship)
Filed: 2026-05-11
Replaces: the draft 2026-05-11-phon-115-aoa-replacement-completion.md (multitask-LightGBM premise invalidated by pilot)
Closes: the unfiled "follow-up ticket" from the PHON-71 spike (fcf865c)
Why this exists¶
PHON-71's audit flagged aoa_kuperman as 🟡 license-encumbered. The PHON-71 spike (commit fcf865c) trained a Glasgow-derived LightGBM regression (5-fold CV R² = 0.745 ± 0.017), wrote data/norms/phonolex_aoa.tsv, and concluded "PROCEED to integration in a follow-up ticket." The ticket was never filed. Consequences:
aoa_kupermanstill shipped in v5.2 (words.parquet,word_percentiles, D1 schema, frontend slider).- Commit
8b34daf(April 7) re-pointed the frontend slider TO Kuperman with the rationale "stop word exemption makes Kuperman safe" — a fair-use claim never reconciled with the audit. - PHON-85/86/87/88 (May 4) shipped age-banded CHILDES + PhonBank frequency (~98 columns) correlating Spearman -0.751 with Glasgow AoA, but unconsumed by any UI, constraint, or model.
This ticket finishes the work. No follow-up tickets allowed for "integration."
Pilot evidence — design decision is empirical, not architectural¶
A pilot at research/2026-05-11-phon-115-aoa-pilot/ validated the PHON-73 pattern for AoA before this spec was written. Two runs:
Glasgow regression (N=5,551, full Glasgow AoA-labeled vocabulary): - Spearman 0.868, Pearson 0.861, MAE 0.714 on the 1-7 scale - 0 failures, mean coverage 1.000 (full prob mass on valid 1-7 tokens) - 23m runtime at concurrency=6, cost ~$1.40
Kuperman cross-construct sanity (N=500 Glasgow-unseen Kuperman rows): - Spearman 0.832, Pearson 0.816 - 0 failures, coverage 1.000
Compared against the original draft spec's gates and the PHON-71 spike baseline:
| Gate | Target | Pilot result | PHON-71 spike | Notes |
|---|---|---|---|---|
| Held-out Glasgow R² | ≥ 0.74 | 0.741 (Pearson²) | 0.745 (5-fold CV) | Match — and the PHON-73 path is zero-shot, no Glasgow in training |
| External Kuperman Pearson on Glasgow-unseen | ≥ 0.50 | 0.816 | 0.526 | Crushes by 0.29 absolute; reaches 91% of the published Kuperman inter-rater ceiling (0.893) |
| Coverage | ≥ 47K | (production) | (production) | Deterministic via the PHON-73 vocabulary scope |
Scope expansion: Glasgow becomes purely eval-only¶
Originally PHON-71 left Glasgow as a hybrid (pipeline input AND eval oracle) because PHON-73 only replaced 4 of Glasgow's 9 fields (arousal/valence/concreteness/familiarity), and imageability + size had no replacement. PHON-115 takes the opportunity to retire imageability and size columns entirely — they fail the "do we need every norm, or just the big ones?" filter, and removing them cleanly purges Glasgow from the pipeline so the relocation to data/norms/_oracles/ is unambiguous. Glasgow ends PHON-115 as a pure validation oracle, matching how Brysbaert/Warriner are staged.
Net column count change in words.parquet: -3 (aoa_kuperman replaced by aoa so no net change there; imageability dropped; size dropped). Same delta in the D1 schema and frontend.
Approach — PHON-73 pattern (existing codebase pattern, six datasets deep)¶
The PHON-73 family (concreteness, valence, arousal, familiarity, BOI, iconicity) replaced six license-encumbered behavioral norm datasets via gpt-4.1-mini cloze-prompt LLM ratings using a logprob-based expected-value estimator. AoA fits the same shape exactly. Six in-the-box components, two new components, zero new architecture:
- Vocabulary scope — CMU dict ∩ FineWeb-Edu frequency table, filtered to non-PROPN dominant POS. ~48K content words. (Reuses
load_cmu_words+load_content_freq_wordsfrombuild_concreteness.py.) - Prompt — Glasgow-style 1-7 cloze with age-band anchors (1=0-2y, 7=13+y) and high/low example words drawn from Glasgow distribution extremes. Already validated in the pilot — see
research/2026-05-11-phon-115-aoa-pilot/run_pilot.py::AOA_PROMPT. - Estimator — single API call per word with
top_logprobs=20, find the rating-token position, sum probability mass over the integers 1-7, compute expected value E[r] = Σ r · p(r). Produces a continuous value in [1, 7]. This is therate_onefunction inbuild_concreteness.py, identical for AoA. - Model — gpt-4.1-mini, parity with the rest of the PHON-73 family. Model name is a CLI flag (
--model), not hardcoded — seefeedback_model_provider_flexibility.mdfor the ethics-lens future-swap consideration. - Validation oracles — Glasgow Norms (Scott et al. 2019, CC BY 4.0) as the primary oracle; Kuperman 2012 as cross-construct sanity. Both used eval-only — no training. Move
data/norms/kuperman_aoa.xlsxtodata/norms/_oracles/kuperman_aoa.xlsx(parity with how Brysbaert/Warriner are kept). - Output —
data/norms/phonolex_aoa.tsvcolumns:word, aoa, cov_aoa. (Replaces the PHON-71 spike's TSV in place.)
License posture: identical to the rest of PHON-73. Glasgow stays as eval-only (CC BY 4.0 attribution preserved in NOTICE); the shipped column is PhonoLex-owned LLM output.
Completion criteria — ALL must be true to mark Done¶
Intentionally exhaustive because the prior cycle's "PROCEED to integration in a follow-up" pattern is what produced the current mess. The modeling side is already validated; this section is integration discipline.
- [ ] Build script:
research/2026-04-30-llm-word-features/build_aoa.pywritten (lifted frombuild_concreteness.py);harness.py::FEATURES["aoa"]entry added with the validated prompt. Run on the full ~48K non-PROPN content vocabulary. Cost ~$5, runtime ~1.6h at concurrency=6 (PHON-73 family precedent). - [ ] Validation report:
research/2026-04-30-llm-word-features/validate_aoa.pywritten; full-build Spearman vs Glasgow on the in-vocab overlap reported and committed toresearch/2026-05-11-phon-115-aoa-pilot/report.md(or sibling). Must include: full-vocab Glasgow Spearman, Kuperman-Glasgow-unseen Pearson, coverage. Report must NOT end with "PROCEED to integration in a follow-up ticket" — integration happens in this same ticket. - [ ] Artifact:
data/norms/phonolex_aoa.tsvwritten, ≥47K rows. - [ ] Loader rewired:
packages/data/src/phonolex_data/loaders/norms.py::load_kupermanremoved; newload_phonolex_aoareadsphonolex_aoa.tsv.loaders/__init__.pyexports updated. - [ ] Pipeline:
packages/data/src/phonolex_data/pipeline/words.pydrops("Kuperman", load_kuperman)entry and the"aoa_kuperman": "aoa_kuperman"source-map line. Newload_phonolex_aoaentry added, mapping to theaoacolumn. - [ ] Schema:
packages/data/src/phonolex_data/pipeline/schema.py::WordRecordfieldsaoa_kuperman,imageability, andsizeall removed.pipeline/derived.pypercentile target list dropsaoa_kuperman,imageability, andsize(theaoafield remains). - [ ] Pipeline source map:
packages/data/src/phonolex_data/pipeline/words.pydrops"imageability": "imageability"and"size": "size"source-map lines (in addition to the Kuperman entry removal). - [ ] Glasgow loader retirement:
load_glasgowis no longer wired into any pipeline source map. Remove it frompackages/data/src/phonolex_data/pipeline/words.pyimport + tuple; consider also removing fromloaders/__init__.pyexports (implementer call — pilot scripts inresearch/2026-05-11-phon-115-aoa-pilot/have their own inline loader and don't depend on it). - [ ] Property config:
packages/web/workers/scripts/config.pyPropertyDefs withid="aoa_kuperman",id="imageability", andid="size"all removed. Theid="aoa"PropertyDef updated: source reads "PhonoLex-derived from Glasgow AoA validation oracle via gpt-4.1-mini cloze-prompt rating (PHON-115)".packages/web/workers/src/config/properties.tsupdated in lockstep. - [ ] Parquet regenerated:
uv run python packages/data/scripts/build_runtime_parquet.pyrerun;data/runtime/words.parquethas noaoa_kuperman,imageability, orsizecolumns (and no corresponding_percentilecolumns). Verified viapl.read_parquet(...).columns. - [ ] D1 seed regenerated:
packages/web/workers/scripts/d1-seed.sqlrebuilt;aoa_kuperman,imageability,size, and their_percentilesiblings all absent. Verified viagrep -cE 'aoa_kuperman|imageability|\bsize\b' packages/web/workers/scripts/d1-seed.sqlreturning 0 for the norm columns (the SQL may still legitimately referencesizein unrelated contexts like file-size comments — eyeball the grep output). - [ ] Frontend slider:
packages/web/frontend/src/components/tools/GovernedGenerationTool/PsycholinguisticsSection.tsx— (a) BOUNDS list usesnorm: 'aoa'(notaoa_kuperman); slider range/scale/label updated to match the 1-7 scale of the derived column. (b) Imageability slider entry removed entirely. (Size has no slider in this file — only schema/types — but check anyway.) - [ ] Frontend types:
packages/web/frontend/src/types/phonology.tsremovesaoa_kuperman,min_aoa_kuperman,max_aoa_kuperman,imageability,min_imageability,max_imageability,size,min_size,max_size. Audit + update any other reference files (WordProfileContext,WordListTable,ContrastiveGroupsTable,ExportMenu). - [ ] Source data files: both
data/norms/kuperman_aoa.xlsxANDdata/norms/GlasgowNorms.xlsxmoved todata/norms/_oracles/(mirrors PHON-73 staging of Brysbaert/Warriner). Now unambiguous because Glasgow is no longer wired into any pipeline source map afterimageability+sizeremoval. Ifload_glasgowis retained for future eval scripts, update its default path to_oracles/GlasgowNorms.xlsx; if removed, no path update needed. - [ ] Audit checklist:
docs/data-license-remediation-checklist.mdKuperman row updated from 🟡 to 🟢 with the deletion commit hash. - [ ] NOTICE / attribution: any
NOTICE/THIRD_PARTY_LICENSESreferences to Kuperman et al. (2012) updated to "validation oracle only" framing. Glasgow Norms (Scott et al. 2019) already in NOTICE as the primary oracle — add a note that it also serves as the AoA training/calibration oracle for PHON-115 (parity with how Brysbaert is documented for concreteness). - [ ] Tests: every test referencing
aoa_kuperman,imageability, orsize(as norm columns, not as MUI props) updated. New regression test confirmingWordStore.from_parquet().dfdoes not haveaoa_kuperman,imageability, orsizecolumns. - [ ] Browser smoke: open
phonolex.com(or local dev) Governed Generation tool; bound on AoA slider; confirm sentences come back filtered correctly.
What this ticket explicitly does NOT do¶
- Multitask LightGBM joining Glasgow + CHILDES/PhonBank age-band signals. Pilot showed zero-shot LLM rating matches PHON-71's trained-regression CV R² (0.741 vs 0.745) at lower cost and lower license complexity. The age-band columns from PHON-85/86/87 remain unconsumed for now; consuming them is a separate scope (file as PHON-XXX if it ever becomes load-bearing).
- Small-LM joint training. Same reasoning — pilot showed it's unnecessary.
- Wiring age-banded frequency directly into UI as separate sliders. Separate scope.
- Adding derived AoA to the reranker. Reranker is Qwen-cosine; doesn't consume tabular features.
- Touching any other 🟡 row in the audit checklist.
Anti-pattern guardrails (lessons from PHON-71)¶
- No "PROCEED to integration in a follow-up" clauses. If validation passes, integrate before closing the ticket.
- No fair-use re-justifications. If a license-encumbered dataset gets re-introduced, that requires explicit audit-checklist update and recorded legal review, not a commit message.
- No new columns without consumption. This ticket's
aoacolumn directly replaces the surfacedaoa_kupermanslider; consumption is one-to-one. This ticket also REMOVESimageabilityandsize— both fail the "do we need every norm, or just the big ones?" test and have been shipping inert (no clinical/research use case driving them). - Done means deleted.
aoa_kuperman,imageability, andsize(as norm columns) cannot be present in any committed artifact at close-out. Verify:grep -rE 'aoa_kuperman|imageability' packages/ data/runtime/returns empty (size will surface in MUI prop usage — eyeball the grep output, then run a narrower check).
Estimated scope¶
PHON-73 pattern, three sub-tasks:
- Build + validate (~2h): write build_aoa.py + validate_aoa.py, run on 47K, commit report. ~$5 OpenAI spend (PHON-73 family precedent).
- Pipeline + schema rewire (~2.5h): loader swap, pipeline source map (Kuperman + Glasgow entries removed), schema field rename/drops, percentile target list updates, property config (3 PropertyDefs removed).
- Parquet + D1 + frontend (~2.5h): regenerate artifacts, audit frontend references (Kuperman + imageability + size — three columns through types/sliders/profile views), run smoke.
Total: ~7h end-to-end, ~$5 in API spend. Slightly larger than a typical PHON-73 ticket due to the column-removal cleanup riding along — but the column removals are exactly the shape of the aoa_kuperman cleanup, so no new judgment calls.
Pilot artifacts (for reference)¶
research/2026-05-11-phon-115-aoa-pilot/run_pilot.py— N=5,551 Glasgow regressionresearch/2026-05-11-phon-115-aoa-pilot/run_kuperman_sanity.py— N=500 Glasgow-unseen Kuperman sanity checkresearch/2026-05-11-phon-115-aoa-pilot/pilot_results.tsv— per-word LLM rating vs Glasgowresearch/2026-05-11-phon-115-aoa-pilot/kuperman_sanity_results.tsv— per-word LLM rating vs Kuperman
Caveats and open questions worth flagging¶
- Memorization concern (informational, not blocking). gpt-4.1-mini was almost certainly trained on web text that included both Glasgow and Kuperman norm tables. This is the same epistemic situation the rest of PHON-73 shipped under (concreteness validated vs Brysbaert, valence vs Warriner — both also likely in training data). The downstream gate is whether the column produces useful clinical/research behavior, not whether the validation oracle was unseen. We label honestly in NOTICE and ship.
- Model choice flexibility. If a future audit decides gpt-4.1-mini isn't appropriate (ethics, vendor lock-in, cost trajectory), the logprob-expected-value pattern works on any model exposing
top_logprobs— Qwen3, DeepSeek, etc. Keep--modelas a CLI flag in the build script. Seememory/feedback_model_provider_flexibility.md.