Confluence Knowledge Base Implementation Plan¶

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Stand up the PhonoLex Confluence space with a skeleton (10 top-level section pages) plus a starter pack of populated pages, written fresh from repo/code review, passing the 100-year-historian test.

Architecture: Confluence space PHONOLEX at neumannsworkshop.atlassian.net. Space is created manually via the Confluence UI (MCP cannot create spaces). All pages are created via the Atlassian MCP createConfluencePage tool. Ten section stubs are created first as children of the space home page, then starter pages are created as children of their respective sections, then the home page is updated with links.

Tech Stack: Atlassian Confluence Cloud, Atlassian MCP plugin (mcp__plugin_atlassian_atlassian__* tools), no repo code changes.

Reference spec: docs/superpowers/specs/2026-04-18-confluence-knowledge-base-design.md

cloudId: 7a1f4095-96d2-458f-9a92-11413f425d84 (neumannsworkshop.atlassian.net)

Content authoring convention¶

Each starter page in this plan specifies:

Title — exact page title
Parent — which section page is its parent
Required headings — the structure the page must include
Repo sources to review — files and paths the author must read before writing
Historian test — the specific historical angle this page must satisfy

Page content is composed fresh at execution time by reading the listed repo sources. Do not copy-paste repo content verbatim. Write prose appropriate for an internal KB read by future collaborators, contractors, and auditors.

Content format¶

The Atlassian MCP createConfluencePage tool accepts page content in a specific format (Atlassian Document Format / ADF, storage XHTML, or markdown depending on the tool version). Before writing the first page, inspect the tool schema via ToolSearch to confirm the exact format parameter. Use that format consistently across all tasks.

Verification¶

After each page is created, fetch it back with getConfluencePage and confirm: - Title matches - Parent ID matches the intended parent - All required headings are present in the body

Task 0: Create the PhonoLex space (manual, by user)¶

Why manual: The MCP surface does not include a createSpace tool. Space creation must happen in the Confluence UI.

[ ] Step 1: User creates the space

In the Confluence UI at https://neumannsworkshop.atlassian.net/wiki: 1. Click "Create space" 2. Choose a blank space template 3. Name: PhonoLex 4. Key: PHONOLEX 5. Permissions: internal (default) 6. Click Create

[ ] Step 2: Record the space ID and home page ID

Run (via MCP):

mcp__plugin_atlassian_atlassian__getConfluenceSpaces(cloudId="7a1f4095-96d2-458f-9a92-11413f425d84")

Record: - Space ID: {PHONOLEX_SPACE_ID} (from the id field of the PhonoLex entry) - Home page ID: {PHONOLEX_HOME_ID} (from the homepageId field)

Substitute these placeholders throughout the rest of the plan.

[ ] Step 3: Verify

Run:

mcp__plugin_atlassian_atlassian__getConfluencePage(cloudId="...", pageId="{PHONOLEX_HOME_ID}")

Expected: returns a page with space key PHONOLEX. If not, re-check the space was created correctly.

Task 1: Create the 10 top-level section stub pages¶

Files: None (Confluence pages only)

Each section stub is a short page with a single paragraph stating the section's purpose. All ten are created as direct children of the space home page {PHONOLEX_HOME_ID}.

[ ] Step 1: Confirm page content format

Run:

ToolSearch(query="select:mcp__plugin_atlassian_atlassian__createConfluencePage")

Read the schema. Note the content format parameter name (e.g., body, content) and the expected format (ADF JSON vs storage XHTML vs markdown). Use that format consistently in all subsequent page creation.

[ ] Step 2: Create all 10 section pages

For each row below, call createConfluencePage with: - cloudId="7a1f4095-96d2-458f-9a92-11413f425d84" - spaceId="{PHONOLEX_SPACE_ID}" - parentId="{PHONOLEX_HOME_ID}" - title=<Title> - body=<Purpose paragraph> (in the confirmed format)

Title	Purpose paragraph
Mission & Origins	Why PhonoLex exists, the problem it addresses, and who it serves. The opening chapter of the project's history.
Architecture Overview	A stable, high-level view of the three faces of PhonoLex (Web, Generation, Catalog) and the data flow from raw datasets to the deployed system. Detailed technical documentation lives in the repo; this page is the "whiteboard view" for onboarding and cross-cutting discussion.
Data & Datasets	Provenance, license, version, and update cadence for every source dataset used by PhonoLex. The audit and attribution record for all incoming data.
Decisions (ADRs)	Significant technical and product decisions, one page per decision. Each ADR captures context, options considered, choice, and consequences. ADRs are immutable once accepted; a superseding decision gets a new ADR that references the old one.
Operations	How the system runs in practice: environments, deploy topology, secrets inventory, monitoring, cost tracking.
Runbooks	Step-by-step procedures for repeatable operational tasks (deploys, releases, rotations, rebuilds, onboarding, incident response). Populated on demand as procedures stabilize.
Roadmap & Milestones	Release history and planned work. The project's historical timeline.
Incidents & Lessons	Short post-mortems for failures and outages: what broke, why, what changed. Institutional memory. Populated on demand.
Legal & Licensing	License scope, dataset attributions, trademarks, domains, Terms of Service, and privacy policy source of truth.
People & Relationships	Contributors (current and past) and external relationships (vendors, partners, service providers).

Record the returned page ID for each section. These are the {SECTION_*_ID} placeholders used in later tasks:

{MISSION_ID}, {ARCH_ID}, {DATA_ID}, {ADR_ID}, {OPS_ID}, {RUNBOOKS_ID}, {ROADMAP_ID}, {INCIDENTS_ID}, {LEGAL_ID}, {PEOPLE_ID}
[ ] Step 3: Verify all ten pages exist under the home

Run:

mcp__plugin_atlassian_atlassian__getConfluencePageDescendants(cloudId="...", pageId="{PHONOLEX_HOME_ID}")

Expected: ten direct children with the ten titles above. If any are missing, re-run createConfluencePage for those. If any have wrong parents, use updateConfluencePage to re-parent.

Task 2: Starter page — "What PhonoLex is and why it exists"¶

Parent: {MISSION_ID} (Mission & Origins) Title: What PhonoLex is and why it exists

Repo sources to review before writing: - CLAUDE.md — top section "What This Is" - docs/product-plan.md — product vision and roadmap - packages/web/frontend/src/ — skim the tool list to understand the user-facing surface - README.md (if present at repo root)

Historian test: A reader 100 years from now should understand what problem PhonoLex was built to solve, who needed it, and why it took the form it did. Not how it's implemented — why it exists.

Required headings (H2): 1. What it is 2. The problem it addresses 3. Who it serves (SLPs, phonological researchers, educators, learners) 4. The three faces (Web app, Governed Generation, Content Catalog) 5. Origins (when, by whom, initial motivation)

[ ] Step 1: Read the listed repo sources
[ ] Step 2: Compose the page content

Fresh prose, 400–800 words total. Not a copy of the repo docs — a narrative appropriate for an internal KB.

[ ] Step 3: Create the page

mcp__plugin_atlassian_atlassian__createConfluencePage(
    cloudId="7a1f4095-96d2-458f-9a92-11413f425d84",
    spaceId="{PHONOLEX_SPACE_ID}",
    parentId="{MISSION_ID}",
    title="What PhonoLex is and why it exists",
    body=<composed content>
)

Record returned page ID.

[ ] Step 4: Verify

Fetch the page with getConfluencePage and confirm: title matches, parent is {MISSION_ID}, all five H2 headings present in body.

Task 3: Starter page — "High-level architecture"¶

Parent: {ARCH_ID} (Architecture Overview) Title: High-level architecture

Repo sources to review before writing: - CLAUDE.md — "Architecture" section with the ASCII diagram - packages/data/README.md and packages/data/pyproject.toml - packages/governors/README.md and source overview - packages/generation/README.md and Dockerfile - packages/web/workers/wrangler.toml and src/index.ts (top-level routes) - packages/web/frontend/package.json and top-level component structure

Historian test: A reader 100 years from now should understand the shape of the system as a whole — which pieces existed, how they fit together, and where each concern lived. Not file-by-file detail — the shape.

Required headings (H2): 1. The three faces 2. Data flow (raw datasets → pipeline → D1 seed → Workers API → frontend) 3. Generation subsystem (T5Gemma on RunPod Serverless + governor engine) 4. Repository layout (the monorepo packages and what each owns) 5. Where to find more detail (pointer to repo docs and code — explicit Confluence-vs-repo boundary)

[ ] Step 1: Read the listed repo sources
[ ] Step 2: Compose the page content

600–1000 words. Include a simple text diagram of the data flow. Make the Confluence-vs-repo boundary explicit in the final section.

[ ] Step 3: Create the page

createConfluencePage with the above parameters.

[ ] Step 4: Verify

getConfluencePage — confirm title, parent, all five H2s.

Task 4: Starter page — "Dataset inventory"¶

Parent: {DATA_ID} (Data & Datasets) Title: Dataset inventory

Repo sources to review before writing: - data/ directory tree — enumerate all subdirectories - data/cmu/ — CMU Pronouncing Dictionary: version, license - data/norms/ — list every norm dataset, find LICENSE or README for each - data/mappings/ — IPA/ARPAbet mapping sources - data/vocab/ — curated vocab lists and their sources - packages/data/src/phonolex_data/loaders/ — each loader file names its dataset; cross-check - Any LICENSE, README, or NOTICE files in data subdirectories

Historian test: A reader 100 years from now should know where each dataset came from, under what terms it was used, and what obligations attached to that use. This is the audit record.

Required structure:

One row per dataset. Every norm dataset in data/norms/ must appear. CMU must appear. IPA/ARPAbet mappings must appear if they have license implications.

Under the table, a short "How to add a new dataset" subsection pointing to the data-loading code in packages/data/src/phonolex_data/loaders/.

[ ] Step 1: Enumerate all datasets

Run ls data/ and ls data/norms/. Record every subdirectory.

[ ] Step 2: For each dataset, read its license/attribution source

Look in that directory for a LICENSE, README, or NOTICE file. If none exists, check the original paper or website referenced in the corresponding loader file.

[ ] Step 3: Compose the table

One row per dataset, fully populated. If any field is genuinely unknown (e.g., license couldn't be confirmed), mark it as Unknown — needs research rather than guessing.

[ ] Step 4: Create the page

createConfluencePage with the table and closing subsection.

[ ] Step 5: Verify

getConfluencePage — confirm all dataset rows present, no placeholder text (TBD, TODO) in the final body.

Task 5: Seed ADR 001 — "Cloudflare Workers + D1 for the API"¶

Parent: {ADR_ID} (Decisions) Title: ADR 001: Cloudflare Workers + D1 for the API

Repo sources to review before writing: - packages/web/workers/wrangler.toml - packages/web/workers/src/index.ts - packages/web/workers/src/types.ts - .github/workflows/deploy.yml and deploy-staging.yml - Git log for packages/web/workers/ — when it was introduced, what it replaced - CLAUDE.md — API section

Historian test: A reader 100 years from now should understand why the API runs on Cloudflare Workers with D1 instead of a traditional server/database, what alternatives were considered, and what the trade-offs locked in.

Required structure (MADR-style):

# ADR 001: Cloudflare Workers + D1 for the API

**Status:** Accepted
**Date:** YYYY-MM-DD (date of original decision — check git log)

## Context

[What forces drove this decision — cost, deployment model, scale profile, solo-operator constraints]

## Decision

Use Cloudflare Workers with D1 (SQLite at the edge) for the PhonoLex API.

## Options considered

- Cloudflare Workers + D1 — pros/cons
- Traditional VPS + Postgres (e.g., Fly.io, Railway) — pros/cons
- Serverless functions + managed SQL (e.g., AWS Lambda + RDS, Vercel + Neon) — pros/cons

## Consequences

[What this locks in — D1 100-bind-param limit, git-lfs for seed SQL, cold-start profile, edge-latency characteristics; what it enables — free tier at current scale, simple deploys, no infra to operate]

[ ] Step 1: Read repo sources and git history

Run:

git log --oneline --all -- packages/web/workers/wrangler.toml | tail -20

to find when Workers was introduced.

[ ] Step 2: Compose the ADR

Fill every section. Status = Accepted. Date = the date the decision was first implemented (approximate month is fine if exact day isn't clear).

[ ] Step 3: Create the page

createConfluencePage with parent {ADR_ID}.

[ ] Step 4: Verify

getConfluencePage — confirm all MADR sections present (Context, Decision, Options considered, Consequences), no placeholder text.

Task 6: Seed ADR 002 — "T5Gemma 9B-2B as the generation model"¶

Parent: {ADR_ID} Title: ADR 002: T5Gemma 9B-2B as the generation model

Repo sources to review before writing: - packages/generation/server/model.py — model loading, context - packages/generation/pyproject.toml — dependencies - packages/generation/Dockerfile — image size, runtime requirements - CLAUDE.md — Governed Generation section - Memory file project_unified_generation_v6.md (read via filesystem if needed) - Git log for packages/generation/server/model.py

Historian test: A reader 100 years from now should understand why T5Gemma 9B-2B was selected over other options (other instruction-tuned LMs, smaller models, API-hosted models), and what constraints that choice imposed (GPU VRAM, cost profile, vocabulary size, encoder-decoder architecture).

Required structure: MADR template (Context / Decision / Options considered / Consequences). Same format as ADR 001.

Options that must be discussed under "Options considered": - T5Gemma 9B-2B (encoder-decoder, bf16, 256K vocab) - Similarly-sized decoder-only open models (e.g., Llama-family, Qwen, Mistral) - API-hosted models (e.g., OpenAI, Anthropic) - Smaller open models (<3B)

[ ] Step 1: Read the listed sources
[ ] Step 2: Compose the ADR

Fill every section. Set Status to Accepted and Date to the date model.py first landed with T5Gemma (check git log).

[ ] Step 3: Create the page
[ ] Step 4: Verify

As in Task 5.

Task 7: Seed ADR 003 — "Unified trie-based constrained generation architecture"¶

Parent: {ADR_ID} Title: ADR 003: Unified trie-based constrained generation architecture

Repo sources to review before writing: - docs/superpowers/specs/2026-04-15-unified-constrained-generation-design.md (spec for the rewrite) - packages/governors/src/phonolex_governors/generation/reranker.py - packages/governors/src/phonolex_governors/generation/trie.py - packages/generation/server/governor.py - packages/generation/server/model.py — custom decode loop - Memory file project_unified_generation_v6.md

Historian test: A reader 100 years from now should understand why the governed-generation architecture was rewritten (from token-level governors to a unified trie-based word-list model), what it replaced, and what the design locked in.

Required structure: MADR template.

Options that must be discussed under "Options considered": - Unified trie-based BAN/BOOST with penalty-only steering (the choice) - Token-level hard-gate + CDD + coverage mechanism (the prior architecture) - Post-hoc filtering + retry only (no steering at decode time)

Key consequences that must be captured: - Single architectural primitive (word lists + tag trie) replaces five mechanisms - Custom decode loop (no HF generate()) - GPT-2 PPL + spaCy scorer replaces hand-tuned heuristics - Enables contrastive-pair constraints and bipolar steering

[ ] Step 1: Read the listed sources
[ ] Step 2: Compose the ADR

Fill every section. Status = Accepted. Date = 2026-04-15 (spec date) or the merge date of the rewrite PR, whichever is more accurate.

[ ] Step 3: Create the page
[ ] Step 4: Verify

As in Task 5.

Task 8: Starter page — "Environments and deploy topology"¶

Parent: {OPS_ID} (Operations) Title: Environments and deploy topology

Repo sources to review before writing: - .github/workflows/deploy.yml, deploy-staging.yml, ci.yml - packages/web/workers/wrangler.toml (bindings, routes, secrets names) - packages/web/frontend/ — Cloudflare Pages build config (if present) - packages/generation/Dockerfile - packages/generation/rp_handler.py (RunPod entry point) - packages/generation/server/main.py (local dev entry point) - Any .dev.vars.example or secret template files

Historian test: A reader 100 years from now should know where each piece of the system ran, how it was deployed, and what infrastructure it depended on. Enough to reconstruct the topology without archeology.

Required headings (H2): 1. Environments (local / staging / production — URLs, who has access, what differs) 2. API deploy topology (Cloudflare Workers + D1, wrangler config, GitHub Actions workflow) 3. Frontend deploy topology (Cloudflare Pages) 4. Generation deploy topology (RunPod Serverless, Docker image, scale-to-zero behavior, cold-start profile) 5. Secrets inventory (names + locations only — never values) 6. Monitoring and observability (where logs go, alerts, dashboards) 7. Cost model (which services cost money, rough per-month expectations)

[ ] Step 1: Read the listed sources
[ ] Step 2: Compose the page

700–1200 words. Secrets inventory should be a table: Secret name | Location (GH Actions / Wrangler / RunPod env) | Rotates? | Last rotated. Never list values or hints at values.

[ ] Step 3: Create the page
[ ] Step 4: Verify

getConfluencePage — confirm all seven H2s, secrets table present, no secret values accidentally included.

Task 9: Starter page — "License scope and dataset attributions"¶

Parent: {LEGAL_ID} (Legal & Licensing) Title: License scope and dataset attributions

Repo sources to review before writing: - pyproject.toml at repo root - packages/*/pyproject.toml for every package (license field) - Any LICENSE file at repo root - Output from Task 4 — the dataset inventory already has attribution info

Historian test: A reader 100 years from now should know what license the PhonoLex codebase was released under, what rights were retained by Neumann's Workshop, and what obligations attached to the datasets used.

Required headings (H2): 1. Code license (Proprietary — all packages; scope and rights retained) 2. Trademark and domain (PhonoLex name, phonolex.com domain — ownership and registration date if known) 3. Dataset attributions (summary; full detail lives on Dataset inventory page, link to it) 4. Third-party dependencies (high-level note on OSS dependencies and their license compatibility — point to pyproject.toml and package.json as the authoritative list) 5. Terms of service and privacy policy (sources of truth — where these documents live and who maintains them)

[ ] Step 1: Read the listed sources
[ ] Step 2: Compose the page

400–700 words. Link to the Dataset inventory page (from Task 4) rather than duplicating its content.

[ ] Step 3: Create the page
[ ] Step 4: Verify

As in prior tasks.

Task 10: Starter page — "Contributors and vendors"¶

Parent: {PEOPLE_ID} (People & Relationships) Title: Contributors and vendors

Repo sources to review before writing: - Git log: git log --format="%an <%ae>" | sort -u — full contributor list - pyproject.toml (authors field) - packages/*/pyproject.toml (authors fields) - Recollection of external service accounts in use (Cloudflare, RunPod, HuggingFace, GitHub)

Historian test: A reader 100 years from now should know who worked on this and what external parties the project depended on for services, data, or tools.

Required headings (H2): 1. Contributors (current and past — name, role, period of involvement) 2. Vendors and service providers (Cloudflare, RunPod, HuggingFace, GitHub — what each is used for, account/org name) 3. Data providers (link to Dataset inventory) 4. External collaborations (none yet — placeholder subsection for future partnerships)

[ ] Step 1: Enumerate contributors from git history

Run:

git log --format="%an <%ae>" | sort -u

[ ] Step 2: Compose the page

Short and factual. Contributors table: Name | Role | Active period. Vendors table: Vendor | Purpose | Account / org. 200–400 words total.

[ ] Step 3: Create the page
[ ] Step 4: Verify

As in prior tasks.

Task 11: Update the space home page¶

Parent: None (this IS the home page — {PHONOLEX_HOME_ID}) Title: PhonoLex (unchanged)

Historian test: A reader arriving at the space for the first time should understand what PhonoLex is and how to navigate the KB in under 60 seconds.

Required content:

One-paragraph summary of what PhonoLex is
Live URLs table: Surface | URL
Web app (phonolex.com)
Staging web (develop.phonolex.pages.dev)
Staging API (staging-api.phonolex.com)
GitHub repo
RunPod console (link to org console, not a specific endpoint)
Section directory: for each of the 10 sections, a link to its page with a one-line purpose:
Mission & Origins → Why PhonoLex exists
Architecture Overview → The system at a whiteboard level
Data & Datasets → Provenance, license, cadence for every dataset
Decisions (ADRs) → Significant technical and product choices
Operations → How the system runs
Runbooks → Step-by-step procedures
Roadmap & Milestones → Release history and planned work
Incidents & Lessons → Post-mortems
Legal & Licensing → License scope and attributions
People & Relationships → Contributors and vendors
[ ] Step 1: Compose the home page content

Use real page IDs from Task 1 for the section links.

[ ] Step 2: Update the page

mcp__plugin_atlassian_atlassian__updateConfluencePage(
    cloudId="7a1f4095-96d2-458f-9a92-11413f425d84",
    pageId="{PHONOLEX_HOME_ID}",
    title="PhonoLex",
    body=<composed content>
)

Note: updateConfluencePage may require a version number. If the tool returns a version-conflict error, fetch the page first to get the current version, then retry with version: current+1.

[ ] Step 3: Verify

getConfluencePage(pageId="{PHONOLEX_HOME_ID}") — confirm the page contains the URLs table and all 10 section links resolve to the correct page IDs.

Task 12: Final verification and handoff¶

[ ] Step 1: Fetch the full page tree

Run:

mcp__plugin_atlassian_atlassian__getConfluencePageDescendants(
    cloudId="7a1f4095-96d2-458f-9a92-11413f425d84",
    pageId="{PHONOLEX_HOME_ID}"
)

Expected tree (partial):

PhonoLex (home)
├── Mission & Origins
│   └── What PhonoLex is and why it exists
├── Architecture Overview
│   └── High-level architecture
├── Data & Datasets
│   └── Dataset inventory
├── Decisions (ADRs)
│   ├── ADR 001: Cloudflare Workers + D1 for the API
│   ├── ADR 002: T5Gemma 9B-2B as the generation model
│   └── ADR 003: Unified trie-based constrained generation architecture
├── Operations
│   └── Environments and deploy topology
├── Runbooks
├── Roadmap & Milestones
├── Incidents & Lessons
├── Legal & Licensing
│   └── License scope and dataset attributions
└── People & Relationships
    └── Contributors and vendors

[ ] Step 2: Historian-test spot check

For three randomly-selected populated pages, re-read the page and ask: does this pass the 100-year-historian test? If any page reads as implementation-focused or ephemeral, flag it in the task report. Do not auto-fix — report for user review.

[ ] Step 3: Record a memory

Save a reference memory at /Users/jneumann/.claude/projects/-Users-jneumann-Repos-PhonoLex/memory/reference_confluence_kb.md with: - Space key: PHONOLEX - Space ID: {PHONOLEX_SPACE_ID} - Home page ID: {PHONOLEX_HOME_ID} - 10 section page IDs mapped by name - Note: populated pages exist only under Mission, Architecture, Data, Decisions (×3), Ops, Legal, People

Update MEMORY.md to reference this new file under the References section.

[ ] Step 4: Open a PR

The spec and plan commits are already on docs/confluence-kb-design. Open a PR into develop when the KB is live and verified. PR body should link to the spec, the plan, and the Confluence home URL.

Self-review notes¶

Spec coverage: Every section of the spec maps to at least one task. The 10-section backbone = Task 1. The starter pack = Tasks 2–10. Space home = Task 11. Growth model & ADR template conventions are documented (not implemented — they're rules, not artifacts).
Types/signatures: All placeholder IDs ({PHONOLEX_SPACE_ID}, {PHONOLEX_HOME_ID}, {MISSION_ID} etc.) are defined in Task 0 and Task 1 and referenced consistently in later tasks.
Scope: Single coherent plan — all tasks serve the one goal of standing up the PhonoLex space per the design spec.
No placeholder content: Page content is deliberately composed at execution time from repo review, but every task specifies exactly which sources to read, what headings are required, and what the historian test looks like for that page. This is the minimum prescription for a page written freshly each time; pre-composing the prose here would duplicate the research effort the spec explicitly calls for.