PHON-83 Iconicity LLM-Rating Replacement Implementation Plan¶

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace the unlicensed Winter et al. (2024) iconicity ratings with a PhonoLex-original LLM-rating column derived via cloze-prompted gpt-4.1-mini, anchored on form (IPA) + meaning (top-3 WordNet sense glosses).

Architecture: Mirrors the PHON-82 BOI replacement track exactly. Extend research/2026-04-30-llm-word-features/harness.py FEATURES dict with an iconicity FeatureSpec whose prompt_template uses three context slots ({word}, {ipa}, {glosses}); add small wordnet_glosses(word) and word_ipa(word) helpers in the same research dir; ship a validation script targeting Spearman ≥ 0.60 on a held-out Winter 2024 sample, then a production build over the same CMU ∩ FineWeb-Edu non-PROPN content vocabulary used for BOI; finally re-introduce the iconicity field into the data pipeline + Workers config + frontend properties (re-adding the field that was stripped during PHON-71 closeout).

Tech Stack: Python 3.11+, OpenAI Python SDK (gpt-4.1-mini), NLTK WordNet (research-time only — not a runtime data-package dep), existing phonolex_data IPA helpers, csv stdlib for TSV I/O. Frontend/backend integration touches packages/data/, packages/web/workers/scripts/, and packages/web/workers/src/config/.

Design decisions baked in¶

Prompt context — extend FEATURES["iconicity"].prompt_template to use {word} + {ipa} + {glosses} slots; backward compatibility preserved because existing single-slot prompts are unaffected by format(**ctx) with extra keys.
WordNet sense selection — top-3 senses across all POS, ranked by total lemma.count() summed within each synset, joined as " | "-separated short glosses. Frequency tie-breaking: synset offset ascending.
Out-of-WordNet fallback — when no WordNet entry exists, drop the entire glosses: clause from the prompt (form-only fallback). Track via wn_coverage boolean column in the validation TSV; in the production TSV record cov_glosses ∈ {0.0, 1.0} so downstream can audit.
IPA source — convert ARPAbet from cmudict_to_phono() to IPA via phonolex_data.phonology.normalize.to_ipa, then concatenate without separators (matches the IPA the user already sees on word pages). Drop primary-stress digits.
Vocabulary scope — same as PHON-82 BOI: load_cmu_words() ∩ load_content_freq_words() (non-PROPN content from data/norms/phonolex_frequency.tsv), ~48K words.
Validation target — Spearman ≥ 0.60 vs Winter 2024 ratings on a held-out N=500 sample (per ticket).
No distillation — gpt-4.1-mini one-shot at ~$5 total spend; distillation overhead not justified at this scale.
Oracle already moved — data/norms/_oracles/iconicity_winter2024.csv exists (14,776 rows, columns word,n_ratings,n,prop_known,rating,rating_sd), iconicity_ratings.csv was stripped during PHON-71 closeout.

File map¶

Create: - research/2026-04-30-llm-word-features/wordnet_glosses.py — top-N WordNet gloss helper + nltk.download('wordnet') bootstrap - research/2026-04-30-llm-word-features/word_ipa.py — CMU→IPA helper (single-pronunciation primary lookup) - research/2026-04-30-llm-word-features/validate_iconicity.py — N=500 held-out Spearman/Pearson against Winter 2024 - research/2026-04-30-llm-word-features/build_iconicity.py — production async build over CMU ∩ non-PROPN-content vocab - packages/data/src/phonolex_data/loaders/phonolex_iconicity.py — TSV loader, filter_words support, mirrors phonolex_boi.py - data/norms/phonolex_iconicity.tsv — produced by build script (gitignored if seed pattern matches)

Modify: - research/2026-04-30-llm-word-features/harness.py — add FEATURES["iconicity"] entry; widen WordFeatureRater.rate() and WordFeatureRaterOpenAI.rate() to accept **context for prompt formatting - research/2026-04-30-llm-word-features/harness_openai.py — same widening - packages/data/src/phonolex_data/loaders/__init__.py — export load_phonolex_iconicity - packages/data/src/phonolex_data/pipeline/words.py — re-add "iconicity": "iconicity" to _NORM_FIELD_MAP; add ("PhonoLex Iconicity", lambda s=cmu_word_set: load_phonolex_iconicity(filter_words=s)) to norm_loaders - packages/data/src/phonolex_data/pipeline/schema.py — re-add iconicity: float | None = None to WordRecord - packages/web/workers/scripts/config.py — re-add PropertyDef(id="iconicity", ...) under COGNITIVE_PROPERTIES - packages/web/workers/src/config/properties.ts — re-add iconicity property dict - packages/data/tests/test_new_loaders.py — add test_load_phonolex_iconicity - NOTICE — add iconicity to "Replaced 2026-05-02" entries - docs/data-license-remediation-checklist.md — toggle Iconicity 🟡→🟢

Task 1: Add WordNet gloss + IPA helpers¶

Files: - Create: research/2026-04-30-llm-word-features/wordnet_glosses.py - Create: research/2026-04-30-llm-word-features/word_ipa.py

[ ] Step 1: Install NLTK + download WordNet corpus

uv pip install nltk
python -c "import nltk; nltk.download('wordnet')"

Expected: WordNet downloaded to ~/nltk_data/corpora/wordnet/. Idempotent — skip if cached.

[ ] Step 2: Write wordnet_glosses.py

"""Top-N WordNet sense gloss helper for the PHON-83 iconicity prompt.

Returns the most-frequent senses across all POS as a single joined gloss
string suitable for slotting into the LLM rating prompt. Frequency comes
from summed lemma.count() within each synset (the Brown-corpus tagging
counts WordNet ships with). Ties broken by synset offset ascending.

Out-of-WordNet words return None — the build script falls back to a
form-only prompt and records cov_glosses=0.
"""
from __future__ import annotations

from functools import lru_cache

from nltk.corpus import wordnet as wn


def _synset_count(syn) -> int:
    return sum(l.count() for l in syn.lemmas())


@lru_cache(maxsize=200_000)
def top_n_glosses(word: str, n: int = 3, sep: str = " | ") -> str | None:
    """Return up to N WordNet sense glosses for `word`, frequency-ordered.

    Returns None when WordNet has no synsets for `word`.
    """
    syns = wn.synsets(word.lower())
    if not syns:
        return None
    syns_sorted = sorted(
        syns,
        key=lambda s: (-_synset_count(s), s.offset()),
    )
    glosses = [s.definition() for s in syns_sorted[:n]]
    return sep.join(glosses)


if __name__ == "__main__":
    import sys
    for w in sys.argv[1:] or ["bank", "buzz", "freedom", "zigzag", "qwerty"]:
        print(f"{w}\t{top_n_glosses(w)}")

[ ] Step 3: Write word_ipa.py

"""ARPAbet → IPA helper for the PHON-83 iconicity prompt.

Pulls the primary CMU pronunciation per word, strips stress digits, and
converts each ARPAbet phoneme to IPA via phonolex_data utilities.
Returns the concatenated IPA string with no separators (matches what
appears on PhonoLex word pages).
"""
from __future__ import annotations

from functools import lru_cache
from pathlib import Path

from phonolex_data.loaders.cmudict import load_cmudict
from phonolex_data.mappings import load_arpa_to_ipa


@lru_cache(maxsize=1)
def _cmu_index() -> dict[str, list[str]]:
    """word -> first ARPAbet pronunciation (with stress digits).

    NB: load_cmudict() return shape is dict[str, list[list[str]]] — values are
    lists of pronunciations, each a list of ARPAbet symbols. Take prons[0].
    """
    raw = load_cmudict()
    out: dict[str, list[str]] = {}
    for word, prons in raw.items():
        if prons:
            out[word] = prons[0]
    return out


@lru_cache(maxsize=1)
def _arpa_map() -> dict[str, str]:
    return load_arpa_to_ipa()


def word_ipa(word: str) -> str | None:
    """Return the primary IPA pronunciation of `word` or None if not in CMU.

    Mirrors phonolex_data.loaders.cmudict._convert_arpa_variant: try the
    stressed key first (e.g. "AH1" → "ʌ"), fall back to the unstressed key
    (e.g. "AH" → "ə"). Preserves the stress distinctions visible on
    PhonoLex word pages.
    """
    arpa = _cmu_index().get(word.lower())
    if not arpa:
        return None
    m = _arpa_map()
    return "".join(m.get(p) or m.get(p.rstrip("012"), p) for p in arpa)


if __name__ == "__main__":
    import sys
    for w in sys.argv[1:] or ["bank", "buzz", "phone", "zigzag", "qwerty"]:
        print(f"{w}\t{word_ipa(w)}")

[ ] Step 4: Smoke-test both helpers

cd research/2026-04-30-llm-word-features
python wordnet_glosses.py bank buzz freedom zigzag qwerty
python word_ipa.py bank buzz phone zigzag qwerty

Expected: - bank\tsloping land...| ...financial institution... | ... (3 glosses joined by |) - qwerty\tNone (not in WordNet) - phone\tfoʊn (or similar — primary CMU) - qwerty\tNone (not in CMU)

[ ] Step 5: Commit

git add research/2026-04-30-llm-word-features/wordnet_glosses.py research/2026-04-30-llm-word-features/word_ipa.py
git commit -m "PHON-83: WordNet gloss + ARPAbet→IPA helpers for iconicity prompt"

Task 2: Extend harness with iconicity FeatureSpec¶

Files: - Modify: research/2026-04-30-llm-word-features/harness.py:103-117 (add iconicity entry to FEATURES) - Modify: research/2026-04-30-llm-word-features/harness.py:226-259 (WordFeatureRater.rate accepts **context) - Modify: research/2026-04-30-llm-word-features/harness_openai.py:63-124 (WordFeatureRaterOpenAI.rate accepts **context)

[ ] Step 1: Add iconicity FeatureSpec

Append after the boi block in the FEATURES dict:

    # Iconicity — Winter, Lupyan, Perry, Dingemanse & Perlman (2024)
    # 14K-word ratings on a 1-7 scale. 1 = arbitrary form-meaning relationship;
    # 7 = form (sound) strongly resembles meaning. PHON-83 replacement of the
    # unlicensed Winter 2024 supplementary file.
    #
    # Prompt slots:
    #   {word}    — orthographic form
    #   {ipa}     — IPA pronunciation (no stress, no separators); empty string
    #               for words missing from CMU
    #   {glosses} — leading "Its primary meaning(s): <gloss1 | gloss2 | gloss3>."
    #               clause, OR empty string when WordNet has no entry (form-
    #               only fallback). Build script renders this slot.
    #
    # Anchor design: high-iconicity words mix sound-mimics ("buzz", "crash")
    # and motion-mimics ("zigzag"); low-iconicity words mix concrete arbitrary
    # ("table") and abstract arbitrary ("concept"). Direct paraphrase of the
    # Winter 2024 instructions.
    "iconicity": FeatureSpec(
        name="iconicity",
        scale_min=1, scale_max=7,
        prompt_template=(
            "Could you rate the iconicity of the following word on a scale "
            "from 1 to 7, where 1 means the relationship between the word's "
            "form (its sound) and its meaning is completely arbitrary, and 7 "
            "means the word's sound strongly resembles its meaning. Examples "
            "of words that would receive a rating of 7 are buzz, crash and "
            "zigzag (the sound or articulation mimics the meaning). Examples "
            "of words that would receive a rating of 1 are table, concept "
            "and reason (no resemblance between sound and meaning). Consider "
            "how the sounds in the word relate to its meaning. The word is: "
            "\"{word}\", pronounced /{ipa}/.{glosses} Reply with only a "
            "number from 1 to 7. Limit your response to numbers."
        ),
    ),

[ ] Step 2: Widen WordFeatureRater.rate() to accept context

Change the signature and body in harness.py:

    @torch.no_grad()
    def rate(self, word: str, feature: str, **context) -> RatingResult:
        spec = FEATURES[feature]
        prompt = spec.prompt_template.format(word=word, **context)
        input_ids = self._build_input_ids(prompt)
        # ... rest unchanged

[ ] Step 3: Widen WordFeatureRaterOpenAI.rate() to accept context

Change the signature and body in harness_openai.py:

    def rate(self, word: str, feature: str, **context) -> RatingResult:
        spec = FEATURES[feature]
        prompt = spec.prompt_template.format(word=word, **context)
        # ... rest unchanged

[ ] Step 4: Smoke-test the new prompt assembles correctly

cd research/2026-04-30-llm-word-features
python -c "
from harness import FEATURES
from wordnet_glosses import top_n_glosses
from word_ipa import word_ipa
spec = FEATURES['iconicity']
for w in ['buzz', 'table', 'phone', 'qwerty']:
    g = top_n_glosses(w)
    glosses = f' Its primary meaning(s): {g}.' if g else ''
    ipa = word_ipa(w) or ''
    print('---', w)
    print(spec.prompt_template.format(word=w, ipa=ipa, glosses=glosses))
"

Expected: four well-formed prompts, the qwerty prompt missing the meaning clause, all containing the IPA where available.

[ ] Step 5: Commit

git add research/2026-04-30-llm-word-features/harness.py research/2026-04-30-llm-word-features/harness_openai.py
git commit -m "PHON-83: add iconicity FeatureSpec; harness rate() accepts prompt context"

Task 3: Validation script¶

Files: - Create: research/2026-04-30-llm-word-features/validate_iconicity.py

[ ] Step 1: Write validate_iconicity.py — adapted from validate_boi.py with the new prompt-context wiring

"""Validate the iconicity prompt against Winter et al. 2024 ratings on a
held-out random subset.

Oracle file at data/norms/_oracles/iconicity_winter2024.csv (gitignored).
Quality target: Spearman >= 0.60 against Winter 2024 ratings on N=500
held-out words. If hit, proceed to production build_iconicity.py.

Usage:
    python validate_iconicity.py --n 500 --out validation_iconicity.csv
"""
from __future__ import annotations

import argparse
import asyncio
import csv
import math
import os
import random
import sys
import time
from pathlib import Path

from openai import AsyncOpenAI

sys.path.insert(0, str(Path(__file__).parent))
from harness import FEATURES
from harness_openai import _load_dotenv
from wordnet_glosses import top_n_glosses
from word_ipa import word_ipa
_load_dotenv()


REPO = Path(__file__).resolve().parents[2]
ORACLE_PATH = REPO / "data" / "norms" / "_oracles" / "iconicity_winter2024.csv"


def load_winter_oracle() -> dict[str, float]:
    out: dict[str, float] = {}
    with open(ORACLE_PATH, encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            try:
                out[row["word"].strip().lower()] = float(row["rating"])
            except (ValueError, KeyError):
                continue
    return out


def render_prompt(word: str) -> tuple[str, bool]:
    """Return (prompt, wn_coverage). wn_coverage=False means form-only."""
    spec = FEATURES["iconicity"]
    g = top_n_glosses(word)
    ipa = word_ipa(word) or ""
    if g:
        glosses = f" Its primary meaning(s): {g}."
        wn_cov = True
    else:
        glosses = ""
        wn_cov = False
    return spec.prompt_template.format(word=word, ipa=ipa, glosses=glosses), wn_cov


async def rate_one(client: AsyncOpenAI, model: str, word: str,
                   sem: asyncio.Semaphore, retries: int = 3) -> tuple[float, bool]:
    spec = FEATURES["iconicity"]
    prompt, wn_cov = render_prompt(word)
    delay = 1.0
    for _attempt in range(retries):
        async with sem:
            try:
                resp = await client.chat.completions.create(
                    model=model,
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0,
                    max_completion_tokens=4,
                    logprobs=True,
                    top_logprobs=20,
                )
            except Exception:
                await asyncio.sleep(delay)
                delay *= 2
                continue
        lp = resp.choices[0].logprobs
        if not lp or not lp.content:
            return float("nan"), wn_cov
        target_pos = None
        for i, pos in enumerate(lp.content):
            try:
                r = int(pos.token.strip())
                if spec.scale_min <= r <= spec.scale_max:
                    target_pos = i
                    break
            except (ValueError, TypeError):
                continue
        if target_pos is None:
            target_pos = 0
        pos = lp.content[target_pos]
        rating_p: dict[int, float] = {r: 0.0 for r in range(spec.scale_min, spec.scale_max + 1)}
        for entry in pos.top_logprobs:
            try:
                r = int(entry.token.strip())
            except (ValueError, TypeError):
                continue
            if spec.scale_min <= r <= spec.scale_max:
                rating_p[r] += math.exp(entry.logprob)
        total = sum(rating_p.values())
        if total <= 0:
            return float("nan"), wn_cov
        return sum(r * (p / total) for r, p in rating_p.items()), wn_cov
    return float("nan"), wn_cov


async def amain(args) -> int:
    winter = load_winter_oracle()
    # Restrict oracle to CMU words so IPA is always available
    from phonolex_data.loaders.cmudict import load_cmudict
    cmu = set(load_cmudict().keys())
    overlap = sorted(set(winter.keys()) & cmu)
    print(f"[oracle] Winter words: {len(winter):,}; CMU∩Winter: {len(overlap):,}")

    random.seed(args.seed)
    sample = random.sample(overlap, min(args.n, len(overlap)))
    print(f"[sample] N={len(sample)} (seed={args.seed})")

    client = AsyncOpenAI()
    sem = asyncio.Semaphore(args.concurrency)
    rows: list[dict] = []
    t0 = time.time()
    chunk = 50
    for i in range(0, len(sample), chunk):
        batch = sample[i : i + chunk]
        results = await asyncio.gather(
            *(rate_one(client, args.model, w, sem) for w in batch)
        )
        for w, (ev, wn_cov) in zip(batch, results):
            rows.append({
                "word": w, "llm_iconicity": ev,
                "winter_iconicity": winter[w], "wn_coverage": int(wn_cov),
            })
        elapsed = time.time() - t0
        rate = (i + len(batch)) / elapsed
        print(f"  {i+len(batch):>4d}/{len(sample)}  rate={rate:.2f} w/s", flush=True)

    out_path = Path(args.out)
    fieldnames = ["word", "llm_iconicity", "winter_iconicity", "wn_coverage"]
    with open(out_path, "w", encoding="utf-8", newline="") as f:
        wr = csv.DictWriter(f, fieldnames=fieldnames)
        wr.writeheader()
        for r in rows:
            wr.writerow(r)
    print(f"[write] {out_path}")

    valid = [r for r in rows if not (isinstance(r["llm_iconicity"], float) and math.isnan(r["llm_iconicity"]))]
    if not valid:
        print("[result] no valid ratings — bailing")
        return 1
    from scipy import stats
    x = [r["llm_iconicity"] for r in valid]
    y = [r["winter_iconicity"] for r in valid]
    rs, _ = stats.spearmanr(x, y)
    rp, _ = stats.pearsonr(x, y)
    n_wn = sum(r["wn_coverage"] for r in valid)
    print(f"\n[result] N_valid={len(valid)}/{len(rows)}  WN-covered={n_wn} ({100*n_wn/len(valid):.1f}%)")
    print(f"  Spearman vs Winter: {rs:+.4f}")
    print(f"  Pearson  vs Winter: {rp:+.4f}")
    print(f"  llm range:    {min(x):.2f} - {max(x):.2f}")
    print(f"  winter range: {min(y):.2f} - {max(y):.2f}")
    target = 0.60
    print(f"\n  Target (Spearman ≥ {target}): {'✓ PASS' if rs >= target else '✗ BELOW TARGET'}")

    # Stratified breakout: WN-covered vs form-only
    if n_wn > 0 and n_wn < len(valid):
        wn_x = [r["llm_iconicity"] for r in valid if r["wn_coverage"]]
        wn_y = [r["winter_iconicity"] for r in valid if r["wn_coverage"]]
        nw_x = [r["llm_iconicity"] for r in valid if not r["wn_coverage"]]
        nw_y = [r["winter_iconicity"] for r in valid if not r["wn_coverage"]]
        rs_wn, _ = stats.spearmanr(wn_x, wn_y)
        rs_nw, _ = stats.spearmanr(nw_x, nw_y) if len(nw_x) > 2 else (float("nan"), None)
        print(f"\n  Stratified Spearman:")
        print(f"    WN-covered  (N={n_wn}):  {rs_wn:+.4f}")
        print(f"    form-only   (N={len(valid)-n_wn}): {rs_nw:+.4f}")
    return 0


def main() -> int:
    p = argparse.ArgumentParser(description=__doc__)
    p.add_argument("--model", default="gpt-4.1-mini")
    p.add_argument("--n", type=int, default=500)
    p.add_argument("--seed", type=int, default=42)
    p.add_argument("--concurrency", type=int, default=6)
    p.add_argument("--out", default="validation_iconicity.csv")
    args = p.parse_args()
    if not os.environ.get("OPENAI_API_KEY"):
        print("ERROR: OPENAI_API_KEY not set", file=sys.stderr)
        return 1
    return asyncio.run(amain(args))


if __name__ == "__main__":
    raise SystemExit(main())

[ ] Step 2: Run validation against Winter 2024 N=500

cd research/2026-04-30-llm-word-features
python validate_iconicity.py --n 500 --out validation_iconicity.csv

Expected (~3-5 minutes runtime, ~$0.05 spend): - Spearman ≥ 0.60 against Winter 2024 → ✓ PASS - WN-covered slice should match or exceed form-only slice (sanity check that the gloss helps)

If the Spearman lands below 0.60, stop and revisit prompt design (anchor wording, gloss formatting). Do NOT proceed to production until target hit.

[ ] Step 3: Commit validation script + result

git add research/2026-04-30-llm-word-features/validate_iconicity.py research/2026-04-30-llm-word-features/validation_iconicity.csv
git commit -m "PHON-83: validate iconicity prompt — Spearman <X.XX> vs Winter 2024 (N=500)"

(Replace <X.XX> with the actual measured Spearman.)

Task 4: Production build script¶

Files: - Create: research/2026-04-30-llm-word-features/build_iconicity.py

[ ] Step 1: Write build_iconicity.py — adapted from build_boi.py with prompt-context wiring

"""Production build: AI-estimated iconicity ratings over the non-PROPN
PhonoLex content vocabulary via gpt-4.1-mini.

Replaces Winter et al. 2024 iconicity ratings (no posted license; OSF
supplementary qvw6u with null license, Springer article TDM-only).

Vocabulary scope: CMU dict ∩ FineWeb-Edu frequency table, FILTERED to
words whose PHON-72 dominant POS is NOT 'PROPN'. About 48K content words.

Validation (validate_iconicity.py, N=500 vs Winter 2024): see commit
referenced for measured Spearman.

Output: data/norms/phonolex_iconicity.tsv with columns:
  word, iconicity, cov_iconicity, cov_glosses

Resumable via append-mode TSV write.

Usage:
    python build_iconicity.py [--model gpt-4.1-mini] [--concurrency 6] [--resume]
"""
from __future__ import annotations

import argparse
import asyncio
import csv
import math
import os
import sys
import time
from pathlib import Path

from openai import AsyncOpenAI

sys.path.insert(0, str(Path(__file__).parent))
from harness import FEATURES
from harness_openai import _load_dotenv
from wordnet_glosses import top_n_glosses
from word_ipa import word_ipa
_load_dotenv()


REPO = Path(__file__).resolve().parents[2]
CMU_PATH = REPO / "data" / "cmu" / "cmudict-0.7b"
FREQ_PATH = REPO / "data" / "norms" / "phonolex_frequency.tsv"
DEFAULT_OUT = REPO / "data" / "norms" / "phonolex_iconicity.tsv"


def load_cmu_words() -> set[str]:
    out: set[str] = set()
    with open(CMU_PATH, encoding="latin-1") as f:
        for line in f:
            if line.startswith(";;;"):
                continue
            tok = line.split(maxsplit=1)
            if not tok:
                continue
            w = tok[0].strip().lower()
            if w.endswith(")") and "(" in w:
                w = w[: w.rindex("(")]
            if w and w.isalpha():
                out.add(w)
    return out


def load_content_freq_words() -> set[str]:
    """CMU∩freq words whose dominant POS is NOT PROPN."""
    out: set[str] = set()
    with open(FREQ_PATH, encoding="utf-8") as f:
        for row in csv.DictReader(f, delimiter="\t"):
            w = row["Word"].strip().lower()
            if not w or row.get("Dom_PoS") == "PROPN":
                continue
            out.add(w)
    return out


def render_prompt(word: str) -> tuple[str, bool]:
    spec = FEATURES["iconicity"]
    g = top_n_glosses(word)
    ipa = word_ipa(word) or ""
    if g:
        glosses = f" Its primary meaning(s): {g}."
        wn_cov = True
    else:
        glosses = ""
        wn_cov = False
    return spec.prompt_template.format(word=word, ipa=ipa, glosses=glosses), wn_cov


async def rate_one(client: AsyncOpenAI, model: str, word: str,
                   sem: asyncio.Semaphore, retries: int = 3) -> tuple[float, float, float]:
    """Returns (ev, cov_iconicity, cov_glosses)."""
    spec = FEATURES["iconicity"]
    prompt, wn_cov = render_prompt(word)
    cov_glosses = 1.0 if wn_cov else 0.0
    delay = 1.0
    last_err: Exception | None = None
    for _attempt in range(retries):
        async with sem:
            try:
                resp = await client.chat.completions.create(
                    model=model,
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0,
                    max_completion_tokens=4,
                    logprobs=True,
                    top_logprobs=20,
                )
            except Exception as e:
                last_err = e
                await asyncio.sleep(delay)
                delay *= 2
                continue
        lp = resp.choices[0].logprobs
        if not lp or not lp.content:
            return float("nan"), 0.0, cov_glosses
        target_pos = None
        for i, pos in enumerate(lp.content):
            try:
                r = int(pos.token.strip())
                if spec.scale_min <= r <= spec.scale_max:
                    target_pos = i
                    break
            except (ValueError, TypeError):
                continue
        if target_pos is None:
            target_pos = 0
        pos = lp.content[target_pos]
        rating_p: dict[int, float] = {r: 0.0 for r in range(spec.scale_min, spec.scale_max + 1)}
        for entry in pos.top_logprobs:
            try:
                r = int(entry.token.strip())
            except (ValueError, TypeError):
                continue
            if spec.scale_min <= r <= spec.scale_max:
                rating_p[r] += math.exp(entry.logprob)
        total = sum(rating_p.values())
        if total <= 0:
            return float("nan"), 0.0, cov_glosses
        ev = sum(r * (p / total) for r, p in rating_p.items())
        return ev, total, cov_glosses
    print(f"  [fail] {word}: {last_err}", file=sys.stderr)
    return float("nan"), 0.0, cov_glosses


async def amain(args: argparse.Namespace) -> int:
    print("[load] CMU dict ∩ FineWeb-Edu frequency words (non-PROPN content)")
    cmu = load_cmu_words()
    freq_content = load_content_freq_words()
    common = sorted(cmu & freq_content)
    print(f"[scope] CMU({len(cmu):,}) ∩ non-PROPN freq({len(freq_content):,}) = {len(common):,} words")

    if args.limit:
        common = common[: args.limit]
        print(f"[limit] truncated to {len(common):,}")

    out_path = Path(args.out)
    done: set[str] = set()
    write_header = True
    if args.resume and out_path.exists():
        with open(out_path, encoding="utf-8") as f:
            reader = csv.DictReader(f, delimiter="\t")
            for row in reader:
                if "word" in row:
                    done.add(row["word"])
        write_header = False
        print(f"[resume] {len(done):,} words already in {out_path}; skipping")

    todo = [w for w in common if w not in done]
    print(f"[todo] {len(todo):,} words to process")
    if not todo:
        print("Nothing to do.")
        return 0

    fieldnames = ["word", "iconicity", "cov_iconicity", "cov_glosses"]
    out_path.parent.mkdir(parents=True, exist_ok=True)
    f_out = open(out_path, "a" if not write_header else "w", encoding="utf-8", newline="")
    writer = csv.DictWriter(f_out, fieldnames=fieldnames, delimiter="\t")
    if write_header:
        writer.writeheader()

    client = AsyncOpenAI()
    sem = asyncio.Semaphore(args.concurrency)
    print(f"[run] model={args.model}  concurrency={args.concurrency}")

    t0 = time.time()
    n_done = 0
    n_fail = 0
    n_no_wn = 0
    chunk = 50
    for i in range(0, len(todo), chunk):
        batch = todo[i : i + chunk]
        results = await asyncio.gather(
            *(rate_one(client, args.model, w, sem) for w in batch)
        )
        for w, (ev, cov, cov_g) in zip(batch, results):
            writer.writerow({
                "word": w, "iconicity": ev,
                "cov_iconicity": cov, "cov_glosses": cov_g,
            })
            n_done += 1
            if math.isnan(ev):
                n_fail += 1
            if cov_g == 0.0:
                n_no_wn += 1
        f_out.flush()

        elapsed = time.time() - t0
        rate = n_done / elapsed
        eta_sec = (len(todo) - n_done) / max(rate, 1e-6)
        print(f"  {n_done:>6d}/{len(todo)} ({100*n_done/len(todo):.1f}%)  "
              f"rate={rate:.2f} w/s  fails={n_fail}  no-WN={n_no_wn}  "
              f"eta={eta_sec/3600:.2f}h", flush=True)

    f_out.close()
    print(f"[done] {n_done:,} words; {n_fail} failed; {n_no_wn} no-WN; total {time.time()-t0:.0f}s")
    return 0


def main() -> int:
    p = argparse.ArgumentParser(description=__doc__)
    p.add_argument("--model", default="gpt-4.1-mini")
    p.add_argument("--concurrency", type=int, default=6)
    p.add_argument("--out", default=str(DEFAULT_OUT))
    p.add_argument("--resume", action="store_true",
                   help="Skip words already in the output TSV")
    p.add_argument("--limit", type=int, default=0,
                   help="Truncate vocabulary for testing (0 = no limit)")
    args = p.parse_args()
    if not os.environ.get("OPENAI_API_KEY"):
        print("ERROR: OPENAI_API_KEY not set", file=sys.stderr)
        return 1
    return asyncio.run(amain(args))


if __name__ == "__main__":
    raise SystemExit(main())

[ ] Step 2: Smoke run with --limit 100

cd research/2026-04-30-llm-word-features
python build_iconicity.py --limit 100 --out /tmp/iconicity_smoke.tsv
head -5 /tmp/iconicity_smoke.tsv
wc -l /tmp/iconicity_smoke.tsv

Expected: 101 lines (header + 100 rows), all four columns populated, ratings within [1.0, 7.0], mostly cov_iconicity > 0.99, mixed cov_glosses ∈ {0, 1}.

[ ] Step 3: Full production build

cd research/2026-04-30-llm-word-features
python build_iconicity.py --concurrency 6

Expected (~1.5-2h runtime, ~$5 OpenAI spend, mirrors PHON-82 BOI): - ~48K rows in data/norms/phonolex_iconicity.tsv - 0 failures (or very few; resume to retry) - Spot-check probe words pre-distribution check: - buzz, crash, peep → ≥ 5.5 - zigzag, wiggle → ≥ 4.5 - concept, freedom, algorithm → ≤ 2.5

[ ] Step 4: Full-vocab validation against complete Winter 2024

cd research/2026-04-30-llm-word-features
python -c "
import csv, math
from scipy import stats
from pathlib import Path
oracle = {}
with open(Path('../../data/norms/_oracles/iconicity_winter2024.csv')) as f:
    for r in csv.DictReader(f):
        try: oracle[r['word'].lower()] = float(r['rating'])
        except: pass
ours = {}
with open(Path('../../data/norms/phonolex_iconicity.tsv')) as f:
    for r in csv.DictReader(f, delimiter='\t'):
        try:
            v = float(r['iconicity'])
            if not math.isnan(v): ours[r['word']] = v
        except: pass
common = set(oracle) & set(ours)
print(f'overlap: {len(common):,}')
x = [ours[w] for w in common]; y = [oracle[w] for w in common]
print(f'Spearman: {stats.spearmanr(x, y)[0]:+.4f}')
print(f'Pearson:  {stats.pearsonr(x, y)[0]:+.4f}')
"

Expected: full-vocab Spearman ≥ held-out Spearman from Task 3 (typically slightly higher because more data).

[ ] Step 5: Commit

git add research/2026-04-30-llm-word-features/build_iconicity.py data/norms/phonolex_iconicity.tsv
git commit -m "PHON-83: production iconicity build (~48K words, gpt-4.1-mini, Spearman <X.XX>)"

Task 5: Loader + pipeline integration¶

Files: - Create: packages/data/src/phonolex_data/loaders/phonolex_iconicity.py - Modify: packages/data/src/phonolex_data/loaders/__init__.py - Modify: packages/data/src/phonolex_data/pipeline/words.py:14,70-73,210-214 - Modify: packages/data/src/phonolex_data/pipeline/schema.py (re-add iconicity field) - Modify: packages/data/tests/test_new_loaders.py

[ ] Step 1: Write the loader

"""Loader for the PhonoLex in-house iconicity ratings.

Replaces Winter et al. 2024 iconicity ratings (no posted license; OSF
supplementary `osf.io/qvw6u` has null license, Springer article TDM-only).

Source: cloze-prompt LLM rating extraction via gpt-4.1-mini, same
methodology as PHON-73's 5-feature build and PHON-82's BOI replacement.
Prompt anchors high-iconicity sound/motion-mimics (buzz, crash, zigzag)
and low-iconicity arbitrary words (table, concept, reason), 1-7 scale.

Vocabulary scope: CMU∩freq filtered to non-PROPN content words.
Same scope as PHON-82 PhonoLex BOI.

Validation against held-out Winter 2024 oracle (kept locally at
data/norms/_oracles/iconicity_winter2024.csv, not redistributed).

Output field: `iconicity` (alias retained for backward compat with the
previous Winter-shipped column).
"""
from __future__ import annotations

import csv
from pathlib import Path
from typing import Iterable

from phonolex_data.loaders._helpers import get_data_dir


def load_phonolex_iconicity(
    path: str | Path | None = None,
    filter_words: Iterable[str] | None = None,
) -> dict[str, dict[str, float]]:
    """Load PhonoLex's in-house AI-derived iconicity ratings.

    Args:
        path: Path to the TSV. Defaults to ``data/norms/phonolex_iconicity.tsv``.
        filter_words: Optional iterable of word strings (lowercase). When
            provided, only entries with ``word`` in this set are returned.

    Returns:
        {word: {"iconicity": float}}. Values are LLM expected-value ratings
        on a 1-7 scale. Words with NaN (rare retry failures) are skipped.
    """
    path = Path(path) if path else get_data_dir() / "norms" / "phonolex_iconicity.tsv"
    allowed: set[str] | None = (
        {w.lower() for w in filter_words} if filter_words is not None else None
    )
    out: dict[str, dict[str, float]] = {}
    with open(path, encoding="utf-8") as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            w = row["word"].strip().lower()
            if not w:
                continue
            if allowed is not None and w not in allowed:
                continue
            try:
                v = float(row["iconicity"])
            except (ValueError, KeyError):
                continue
            if v != v:  # NaN check
                continue
            out[w] = {"iconicity": v}
    return out

[ ] Step 2: Add the loader test

Append to packages/data/tests/test_new_loaders.py:

def test_load_phonolex_iconicity():
    from phonolex_data.loaders import load_phonolex_iconicity

    result = load_phonolex_iconicity()
    assert isinstance(result, dict)
    assert len(result) > 40_000  # ~48K non-PROPN content words

    # Probe high-iconicity (sound-mimic) — should land high
    for w in ("buzz", "crash"):
        assert w in result, f"{w} missing"
        v = result[w]["iconicity"]
        assert v > 4.5, f"{w} iconicity={v} below expected high"

    # Probe low-iconicity (arbitrary) — should land low
    for w in ("table", "concept"):
        assert w in result, f"{w} missing"
        v = result[w]["iconicity"]
        assert v < 4.0, f"{w} iconicity={v} above expected low"

    # filter_words restriction
    subset = load_phonolex_iconicity(filter_words={"buzz", "crash"})
    assert set(subset.keys()) == {"buzz", "crash"}

[ ] Step 3: Run the loader test

cd packages/data
uv run python -m pytest tests/test_new_loaders.py::test_load_phonolex_iconicity -v

Expected: PASS.

[ ] Step 4: Wire into loaders/__init__.py

Add an import line and an __all__ entry, mirroring phonolex_boi:

from phonolex_data.loaders.phonolex_iconicity import load_phonolex_iconicity

Add "load_phonolex_iconicity", to __all__.

[ ] Step 5: Wire into pipeline/words.py

Three small edits:

(a) Add to the loader import block at top of file:

    load_phonolex_iconicity,

(b) Replace the # iconicity removed 2026-05-02 (PHON-71 closeout); PHON-83 will re-add comment with:

    "iconicity": "iconicity",

(c) Replace the iconicity-removed comment in norm_loaders with:

        ("PhonoLex Iconicity",
         lambda s=cmu_word_set: load_phonolex_iconicity(filter_words=s)),

[ ] Step 6: Re-add iconicity field to WordRecord

In packages/data/src/phonolex_data/pipeline/schema.py, find the boi: float | None = None line; add directly above:

    iconicity: float | None = None

[ ] Step 7: Re-add property metadata

In packages/web/workers/scripts/config.py, replace the iconicity-stripped comment block (lines ~346-349) with:

        PropertyDef(
            id="iconicity",
            label="Iconicity",
            short_label="Icon",
            source="PhonoLex Iconicity (gpt-4.1-mini, PHON-83)",
            description="Perceived resemblance between word's form (sound) and meaning. AI-derived (cloze-prompt with IPA + WordNet glosses) replacement for Winter 2024 (no posted license). Spearman <X.XX> vs Winter oracle.",
            scale="1-7",
            interpretation="Higher = stronger sound-meaning resemblance",
            display_format=".2f",
            slider_step=0.1,
        ),

In packages/web/workers/src/config/properties.ts, replace the iconicity-stripped comment (lines ~248-249) with:

      {
        id: 'iconicity', label: 'Iconicity', short_label: 'Icon',
        source: 'PhonoLex Iconicity (gpt-4.1-mini, PHON-83)',
        description: "Perceived resemblance between word's form (sound) and meaning. AI-derived (cloze-prompt with IPA + WordNet glosses) replacement for Winter 2024 (no posted license). Spearman <X.XX> vs Winter oracle.",
        scale: '1-7', interpretation: 'Higher = stronger sound-meaning resemblance',
        display_format: '.2f', filterable: true, slider_step: 0.1,
        use_log_scale: false, is_integer: false,
      },

(Replace <X.XX> in both with the measured Spearman from Task 4.)

[ ] Step 8: Run full data pipeline test suite

cd packages/data
uv run python -m pytest tests/ -v

Expected: all tests pass, including the new test_load_phonolex_iconicity.

[ ] Step 9: Reseed local D1

cd packages/web/workers
uv run python scripts/export-to-d1.py
npx wrangler d1 execute phonolex --local --file scripts/d1-seed.sql

Expected: seed regenerates with iconicity column populated; D1 reseed completes without error.

[ ] Step 10: Spot-check D1 has the column

cd packages/web/workers
npx wrangler d1 execute phonolex --local --command "SELECT word, iconicity FROM words WHERE word IN ('buzz','crash','table','concept') ORDER BY word;"

Expected: buzz + crash high (≥ 4.5), concept + table low (≤ 4.0).

[ ] Step 11: Update NOTICE + remediation checklist

In NOTICE, add iconicity to "Replaced 2026-05-02" entries (mirror the BOI line that landed in PHON-82).

In docs/data-license-remediation-checklist.md, toggle the Iconicity status from 🟡 (or RED) to 🟢.

[ ] Step 12: Final commit

git add packages/data/src/phonolex_data/loaders/phonolex_iconicity.py \
        packages/data/src/phonolex_data/loaders/__init__.py \
        packages/data/src/phonolex_data/pipeline/words.py \
        packages/data/src/phonolex_data/pipeline/schema.py \
        packages/data/tests/test_new_loaders.py \
        packages/web/workers/scripts/config.py \
        packages/web/workers/src/config/properties.ts \
        packages/web/workers/scripts/d1-seed.sql \
        NOTICE \
        docs/data-license-remediation-checklist.md
git commit -m "PHON-83: integrate PhonoLex iconicity into pipeline + D1 (Spearman <X.XX> vs Winter)"

Verification checklist (run before declaring PHON-83 done)¶

[ ] data/norms/phonolex_iconicity.tsv exists with ~48K rows, all 4 columns populated
[ ] Held-out Spearman vs Winter 2024 ≥ 0.60 (Task 3)
[ ] Full-vocab Spearman vs Winter 2024 measured + recorded in commit (Task 4)
[ ] Probe words land sensibly (buzz/crash high, table/concept low)
[ ] uv run python -m pytest packages/data/tests/ -v — all green
[ ] D1 has populated iconicity column with sensible values for probe words
[ ] PHON-71 remediation checklist marks Iconicity 🟢
[ ] NOTICE lists iconicity under "Replaced 2026-05-02"
[ ] PHON-83 ticket: paste held-out + full-vocab Spearman into a comment, transition to Done