Skip to content

PHON-115 — AoA replacement via PHON-73 LLM-rating pattern

Status: SPEC (design ratified by pilot — 2026-05-11) Owner: TBD (parallel to v5.2 ship) Filed: 2026-05-11 Replaces: the draft 2026-05-11-phon-115-aoa-replacement-completion.md (multitask-LightGBM premise invalidated by pilot) Closes: the unfiled "follow-up ticket" from the PHON-71 spike (fcf865c)

Why this exists

PHON-71's audit flagged aoa_kuperman as 🟡 license-encumbered. The PHON-71 spike (commit fcf865c) trained a Glasgow-derived LightGBM regression (5-fold CV R² = 0.745 ± 0.017), wrote data/norms/phonolex_aoa.tsv, and concluded "PROCEED to integration in a follow-up ticket." The ticket was never filed. Consequences:

  • aoa_kuperman still shipped in v5.2 (words.parquet, word_percentiles, D1 schema, frontend slider).
  • Commit 8b34daf (April 7) re-pointed the frontend slider TO Kuperman with the rationale "stop word exemption makes Kuperman safe" — a fair-use claim never reconciled with the audit.
  • PHON-85/86/87/88 (May 4) shipped age-banded CHILDES + PhonBank frequency (~98 columns) correlating Spearman -0.751 with Glasgow AoA, but unconsumed by any UI, constraint, or model.

This ticket finishes the work. No follow-up tickets allowed for "integration."

Pilot evidence — design decision is empirical, not architectural

A pilot at research/2026-05-11-phon-115-aoa-pilot/ validated the PHON-73 pattern for AoA before this spec was written. Two runs:

Glasgow regression (N=5,551, full Glasgow AoA-labeled vocabulary): - Spearman 0.868, Pearson 0.861, MAE 0.714 on the 1-7 scale - 0 failures, mean coverage 1.000 (full prob mass on valid 1-7 tokens) - 23m runtime at concurrency=6, cost ~$1.40

Kuperman cross-construct sanity (N=500 Glasgow-unseen Kuperman rows): - Spearman 0.832, Pearson 0.816 - 0 failures, coverage 1.000

Compared against the original draft spec's gates and the PHON-71 spike baseline:

Gate Target Pilot result PHON-71 spike Notes
Held-out Glasgow R² ≥ 0.74 0.741 (Pearson²) 0.745 (5-fold CV) Match — and the PHON-73 path is zero-shot, no Glasgow in training
External Kuperman Pearson on Glasgow-unseen ≥ 0.50 0.816 0.526 Crushes by 0.29 absolute; reaches 91% of the published Kuperman inter-rater ceiling (0.893)
Coverage ≥ 47K (production) (production) Deterministic via the PHON-73 vocabulary scope

Scope expansion: Glasgow becomes purely eval-only

Originally PHON-71 left Glasgow as a hybrid (pipeline input AND eval oracle) because PHON-73 only replaced 4 of Glasgow's 9 fields (arousal/valence/concreteness/familiarity), and imageability + size had no replacement. PHON-115 takes the opportunity to retire imageability and size columns entirely — they fail the "do we need every norm, or just the big ones?" filter, and removing them cleanly purges Glasgow from the pipeline so the relocation to data/norms/_oracles/ is unambiguous. Glasgow ends PHON-115 as a pure validation oracle, matching how Brysbaert/Warriner are staged.

Net column count change in words.parquet: -3 (aoa_kuperman replaced by aoa so no net change there; imageability dropped; size dropped). Same delta in the D1 schema and frontend.

Approach — PHON-73 pattern (existing codebase pattern, six datasets deep)

The PHON-73 family (concreteness, valence, arousal, familiarity, BOI, iconicity) replaced six license-encumbered behavioral norm datasets via gpt-4.1-mini cloze-prompt LLM ratings using a logprob-based expected-value estimator. AoA fits the same shape exactly. Six in-the-box components, two new components, zero new architecture:

  1. Vocabulary scope — CMU dict ∩ FineWeb-Edu frequency table, filtered to non-PROPN dominant POS. ~48K content words. (Reuses load_cmu_words + load_content_freq_words from build_concreteness.py.)
  2. Prompt — Glasgow-style 1-7 cloze with age-band anchors (1=0-2y, 7=13+y) and high/low example words drawn from Glasgow distribution extremes. Already validated in the pilot — see research/2026-05-11-phon-115-aoa-pilot/run_pilot.py::AOA_PROMPT.
  3. Estimator — single API call per word with top_logprobs=20, find the rating-token position, sum probability mass over the integers 1-7, compute expected value E[r] = Σ r · p(r). Produces a continuous value in [1, 7]. This is the rate_one function in build_concreteness.py, identical for AoA.
  4. Model — gpt-4.1-mini, parity with the rest of the PHON-73 family. Model name is a CLI flag (--model), not hardcoded — see feedback_model_provider_flexibility.md for the ethics-lens future-swap consideration.
  5. Validation oracles — Glasgow Norms (Scott et al. 2019, CC BY 4.0) as the primary oracle; Kuperman 2012 as cross-construct sanity. Both used eval-only — no training. Move data/norms/kuperman_aoa.xlsx to data/norms/_oracles/kuperman_aoa.xlsx (parity with how Brysbaert/Warriner are kept).
  6. Outputdata/norms/phonolex_aoa.tsv columns: word, aoa, cov_aoa. (Replaces the PHON-71 spike's TSV in place.)

License posture: identical to the rest of PHON-73. Glasgow stays as eval-only (CC BY 4.0 attribution preserved in NOTICE); the shipped column is PhonoLex-owned LLM output.

Completion criteria — ALL must be true to mark Done

Intentionally exhaustive because the prior cycle's "PROCEED to integration in a follow-up" pattern is what produced the current mess. The modeling side is already validated; this section is integration discipline.

  • [ ] Build script: research/2026-04-30-llm-word-features/build_aoa.py written (lifted from build_concreteness.py); harness.py::FEATURES["aoa"] entry added with the validated prompt. Run on the full ~48K non-PROPN content vocabulary. Cost ~$5, runtime ~1.6h at concurrency=6 (PHON-73 family precedent).
  • [ ] Validation report: research/2026-04-30-llm-word-features/validate_aoa.py written; full-build Spearman vs Glasgow on the in-vocab overlap reported and committed to research/2026-05-11-phon-115-aoa-pilot/report.md (or sibling). Must include: full-vocab Glasgow Spearman, Kuperman-Glasgow-unseen Pearson, coverage. Report must NOT end with "PROCEED to integration in a follow-up ticket" — integration happens in this same ticket.
  • [ ] Artifact: data/norms/phonolex_aoa.tsv written, ≥47K rows.
  • [ ] Loader rewired: packages/data/src/phonolex_data/loaders/norms.py::load_kuperman removed; new load_phonolex_aoa reads phonolex_aoa.tsv. loaders/__init__.py exports updated.
  • [ ] Pipeline: packages/data/src/phonolex_data/pipeline/words.py drops ("Kuperman", load_kuperman) entry and the "aoa_kuperman": "aoa_kuperman" source-map line. New load_phonolex_aoa entry added, mapping to the aoa column.
  • [ ] Schema: packages/data/src/phonolex_data/pipeline/schema.py::WordRecord fields aoa_kuperman, imageability, and size all removed. pipeline/derived.py percentile target list drops aoa_kuperman, imageability, and size (the aoa field remains).
  • [ ] Pipeline source map: packages/data/src/phonolex_data/pipeline/words.py drops "imageability": "imageability" and "size": "size" source-map lines (in addition to the Kuperman entry removal).
  • [ ] Glasgow loader retirement: load_glasgow is no longer wired into any pipeline source map. Remove it from packages/data/src/phonolex_data/pipeline/words.py import + tuple; consider also removing from loaders/__init__.py exports (implementer call — pilot scripts in research/2026-05-11-phon-115-aoa-pilot/ have their own inline loader and don't depend on it).
  • [ ] Property config: packages/web/workers/scripts/config.py PropertyDefs with id="aoa_kuperman", id="imageability", and id="size" all removed. The id="aoa" PropertyDef updated: source reads "PhonoLex-derived from Glasgow AoA validation oracle via gpt-4.1-mini cloze-prompt rating (PHON-115)". packages/web/workers/src/config/properties.ts updated in lockstep.
  • [ ] Parquet regenerated: uv run python packages/data/scripts/build_runtime_parquet.py rerun; data/runtime/words.parquet has no aoa_kuperman, imageability, or size columns (and no corresponding _percentile columns). Verified via pl.read_parquet(...).columns.
  • [ ] D1 seed regenerated: packages/web/workers/scripts/d1-seed.sql rebuilt; aoa_kuperman, imageability, size, and their _percentile siblings all absent. Verified via grep -cE 'aoa_kuperman|imageability|\bsize\b' packages/web/workers/scripts/d1-seed.sql returning 0 for the norm columns (the SQL may still legitimately reference size in unrelated contexts like file-size comments — eyeball the grep output).
  • [ ] Frontend slider: packages/web/frontend/src/components/tools/GovernedGenerationTool/PsycholinguisticsSection.tsx — (a) BOUNDS list uses norm: 'aoa' (not aoa_kuperman); slider range/scale/label updated to match the 1-7 scale of the derived column. (b) Imageability slider entry removed entirely. (Size has no slider in this file — only schema/types — but check anyway.)
  • [ ] Frontend types: packages/web/frontend/src/types/phonology.ts removes aoa_kuperman, min_aoa_kuperman, max_aoa_kuperman, imageability, min_imageability, max_imageability, size, min_size, max_size. Audit + update any other reference files (WordProfileContext, WordListTable, ContrastiveGroupsTable, ExportMenu).
  • [ ] Source data files: both data/norms/kuperman_aoa.xlsx AND data/norms/GlasgowNorms.xlsx moved to data/norms/_oracles/ (mirrors PHON-73 staging of Brysbaert/Warriner). Now unambiguous because Glasgow is no longer wired into any pipeline source map after imageability + size removal. If load_glasgow is retained for future eval scripts, update its default path to _oracles/GlasgowNorms.xlsx; if removed, no path update needed.
  • [ ] Audit checklist: docs/data-license-remediation-checklist.md Kuperman row updated from 🟡 to 🟢 with the deletion commit hash.
  • [ ] NOTICE / attribution: any NOTICE / THIRD_PARTY_LICENSES references to Kuperman et al. (2012) updated to "validation oracle only" framing. Glasgow Norms (Scott et al. 2019) already in NOTICE as the primary oracle — add a note that it also serves as the AoA training/calibration oracle for PHON-115 (parity with how Brysbaert is documented for concreteness).
  • [ ] Tests: every test referencing aoa_kuperman, imageability, or size (as norm columns, not as MUI props) updated. New regression test confirming WordStore.from_parquet().df does not have aoa_kuperman, imageability, or size columns.
  • [ ] Browser smoke: open phonolex.com (or local dev) Governed Generation tool; bound on AoA slider; confirm sentences come back filtered correctly.

What this ticket explicitly does NOT do

  • Multitask LightGBM joining Glasgow + CHILDES/PhonBank age-band signals. Pilot showed zero-shot LLM rating matches PHON-71's trained-regression CV R² (0.741 vs 0.745) at lower cost and lower license complexity. The age-band columns from PHON-85/86/87 remain unconsumed for now; consuming them is a separate scope (file as PHON-XXX if it ever becomes load-bearing).
  • Small-LM joint training. Same reasoning — pilot showed it's unnecessary.
  • Wiring age-banded frequency directly into UI as separate sliders. Separate scope.
  • Adding derived AoA to the reranker. Reranker is Qwen-cosine; doesn't consume tabular features.
  • Touching any other 🟡 row in the audit checklist.

Anti-pattern guardrails (lessons from PHON-71)

  1. No "PROCEED to integration in a follow-up" clauses. If validation passes, integrate before closing the ticket.
  2. No fair-use re-justifications. If a license-encumbered dataset gets re-introduced, that requires explicit audit-checklist update and recorded legal review, not a commit message.
  3. No new columns without consumption. This ticket's aoa column directly replaces the surfaced aoa_kuperman slider; consumption is one-to-one. This ticket also REMOVES imageability and size — both fail the "do we need every norm, or just the big ones?" test and have been shipping inert (no clinical/research use case driving them).
  4. Done means deleted. aoa_kuperman, imageability, and size (as norm columns) cannot be present in any committed artifact at close-out. Verify: grep -rE 'aoa_kuperman|imageability' packages/ data/runtime/ returns empty (size will surface in MUI prop usage — eyeball the grep output, then run a narrower check).

Estimated scope

PHON-73 pattern, three sub-tasks: - Build + validate (~2h): write build_aoa.py + validate_aoa.py, run on 47K, commit report. ~$5 OpenAI spend (PHON-73 family precedent). - Pipeline + schema rewire (~2.5h): loader swap, pipeline source map (Kuperman + Glasgow entries removed), schema field rename/drops, percentile target list updates, property config (3 PropertyDefs removed). - Parquet + D1 + frontend (~2.5h): regenerate artifacts, audit frontend references (Kuperman + imageability + size — three columns through types/sliders/profile views), run smoke.

Total: ~7h end-to-end, ~$5 in API spend. Slightly larger than a typical PHON-73 ticket due to the column-removal cleanup riding along — but the column removals are exactly the shape of the aoa_kuperman cleanup, so no new judgment calls.

Pilot artifacts (for reference)

  • research/2026-05-11-phon-115-aoa-pilot/run_pilot.py — N=5,551 Glasgow regression
  • research/2026-05-11-phon-115-aoa-pilot/run_kuperman_sanity.py — N=500 Glasgow-unseen Kuperman sanity check
  • research/2026-05-11-phon-115-aoa-pilot/pilot_results.tsv — per-word LLM rating vs Glasgow
  • research/2026-05-11-phon-115-aoa-pilot/kuperman_sanity_results.tsv — per-word LLM rating vs Kuperman

Caveats and open questions worth flagging

  • Memorization concern (informational, not blocking). gpt-4.1-mini was almost certainly trained on web text that included both Glasgow and Kuperman norm tables. This is the same epistemic situation the rest of PHON-73 shipped under (concreteness validated vs Brysbaert, valence vs Warriner — both also likely in training data). The downstream gate is whether the column produces useful clinical/research behavior, not whether the validation oracle was unseen. We label honestly in NOTICE and ship.
  • Model choice flexibility. If a future audit decides gpt-4.1-mini isn't appropriate (ethics, vendor lock-in, cost trajectory), the logprob-expected-value pattern works on any model exposing top_logprobs — Qwen3, DeepSeek, etc. Keep --model as a CLI flag in the build script. See memory/feedback_model_provider_flexibility.md.