Continuous Articulatory Feature Learning¶
Bayesian inference of continuous articulatory feature vectors from perceptual confusion and morphophonological alternation data.
Author: Neumann's Workshop, LLC Date: 2026-03-13 Status: DRAFT
1. Goal¶
Produce a set of learned continuous feature vectors for all 40 General American English phonemes (24 consonants, 16 vowels including r-colored variants), plus composite vectors for 5 diphthongs. These vectors replace the current PHOIBLE-derived 76-dimensional feature vectors used throughout PhonoLex.
Single deliverable: learned vectors, uncertainty estimates, composite representations, and convergence diagnostics as versioned artifacts.
2. Motivation¶
The current PHOIBLE vectors have three problems:
- Licensing: PHOIBLE is CC-BY-SA 3.0. The learned vectors will be an original data product with full provenance transparency.
- Dead dimensions: PHOIBLE's 38 features include features for click consonants, tones, and other phenomena absent from GenAm English. These are structurally zero across the entire inventory — dead dimensions that add noise to similarity computation.
- Lost generation: The 76d vectors exist only in the pickle (
cognitive_graph_v1.1_empirical.pkl). The script that generated them from the PHOIBLE CSV is gone.
The Hayes (2009) 26-feature system is tuned for English: every feature does discriminatory work for the GenAm inventory.
3. Prior: Hayes Feature Matrix¶
3.1 Source¶
Hayes, Bruce. 2009. Introductory Phonology. Wiley-Blackwell. Chapter 4, Tables 4.7–4.10.
The initialization matrix is an original encoding by Neumann's Workshop LLC. The generating code (build_features_ipa.py) and output (phonolex_features_ipa.csv) are version-controlled. 40 segments × 26 features with page-level citations and structural validation.
3.2 Feature Set (26 features)¶
- Major class (3): syllabic, consonantal, sonorant
- Manner (6): continuant, delayed_release, approximant, tap, trill, nasal
- Laryngeal (3): voice, spread_gl, constr_gl
- Labial place (3): labial, round, labiodental
- Coronal place (5): coronal, anterior, distributed, strident, lateral
- Dorsal place (6): dorsal, high, low, front, back, tense
3.3 Beta Encoding¶
Discrete feature values map to Beta distribution parameters:
| Hayes value | Meaning | Beta params | Mean | Interpretation |
|---|---|---|---|---|
+ |
Feature present | Beta(19, 1) | 0.95 | Tightly concentrated near 1.0 |
- |
Feature absent | Beta(1, 19) | 0.05 | Tightly concentrated near 0.0 |
0 |
Structurally inapplicable | Beta(1, 1) | 0.50 | Uniform — data decides |
The concentration (α + β = 20 for +/−) is a tunable hyperparameter controlling how much evidence is needed to move a feature from its initialization.
3.4 Structural Inapplicability¶
Features marked 0 (e.g., [anterior] for labials, [high] for non-dorsals) are initialized with a flat prior. The data determines their posterior value. If a feature stays near 0.5, it is uninformative for that segment. If it drifts, the model has discovered unexpected structure. This is Option A (learned N/A) from the original plan — simpler, no additional parameters, and the posterior reveals whether structural inapplicability holds empirically.
4. Segment Representation: Composite Vectors¶
4.1 The Problem¶
Diphthongs are single phonemic units with continuous articulatory motion. They are not two phonemes in sequence, and they are not discretized step encodings. The representation must handle monophthongs and diphthongs in a unified framework with a single comparison metric.
4.2 The Representation¶
Every segment is represented as a composite vector:
c[s] = α · v_onset + β · v_offset
where v_onset and v_offset are learned 26-dimensional monophthong vectors, and α, β are learned scalar weights.
- Monophthongs: onset = offset. A monophthong /a/ decomposes to α·v_a + β·v_a = (α + β)·v_a. Direction preserved, magnitude consistent with the composite framework.
- Diphthongs: onset ≠ offset. /aɪ/ = α·v_a + β·v_ɪ. The composite vector lands between its components in feature space, with magnitude encoding articulatory spread.
4.3 Magnitude as Articulatory Spread¶
The magnitude of the composite vector carries real phonetic information:
- Monophthong: magnitude ≈ α + β (two identical unit vectors, fully aligned)
- Narrow diphthong (/eɪ/): magnitude slightly less (components close, small angle)
- Wide diphthong (/aɪ/): magnitude smaller (components far apart, large angle)
4.4 Learned Onset/Offset Salience¶
α and β are learned from the data. The ECCC confusion data provides evidence about which component matters more perceptually. The priors are symmetric:
α ~ HalfNormal(1)
β ~ HalfNormal(1)
No theoretical bias toward onset or offset. Data decides.
4.5 Unified Comparison¶
Standard vector operations (dot product, Euclidean distance, cosine similarity) work on composite vectors directly. No branching logic, no special cases for mono↔mono vs diph↔diph vs mono↔diph. The decomposition structure ensures that a monophthong near a diphthong's onset scores high naturally.
4.6 English Diphthong Inventory¶
Five diphthongs, mapped to their onset and offset monophthongs:
| Diphthong | Onset | Offset |
|---|---|---|
| /eɪ/ | /e/ | /ɪ/ |
| /oʊ/ | /o/ | /ʊ/ |
| /aɪ/ | /a/ | /ɪ/ |
| /aʊ/ | /a/ | /ʊ/ |
| /ɔɪ/ | /ɔ/ | /ɪ/ |
5. Evidence Sources¶
5.1 Phase 1: ECCC Perceptual Confusion Data¶
Source: English Consistent Confusion Corpus v1.2 (Sheffield). CC-BY 4.0.
Data: 3,000+ word-level confusion pairs elicited under noise. Each entry includes target word, confused word, both in ARPAbet and IPA, listener counts, consistency scores.
Dialect mapping: ECCC uses British English transcription. The extraction pipeline must map BrE phonemes to GenAm equivalents before computing confusion probabilities. Key mappings include BrE /ɒ/ → GenAm /ɑ/, BrE /ɜ/ → GenAm /ɝ/ (non-rhotic NURSE → rhotic), BrE centering diphthongs (/ɪə/, /ɛə/, /ʊə/) decomposed or mapped to GenAm rhotic equivalents, and non-rhotic transcriptions adjusted for GenAm rhoticity. Segments with no clear GenAm equivalent are excluded. Segments that receive no ECCC evidence after mapping (e.g., /ɝ/, /ɚ/) rely entirely on the Hayes prior — this is acceptable because these are already well-characterized articulatorily, and Phase 2 alternation data may provide additional evidence.
Extraction: Word-level confusions are decomposed to phoneme-level evidence via edit-distance alignment of target and confusion IPA transcriptions. For each aligned word pair:
- Compute the minimum-edit alignment between target and confusion phoneme sequences.
- Each substitution pair (s₁ → s₂) at an aligned position contributes one observation of phoneme confusion.
- Identical aligned pairs contribute evidence that the phonemes are not confused (negative evidence).
- Multi-site confusions (words differing at >1 position) are included but down-weighted by 1/n_differences, reflecting decreased attribution confidence. Insertions and deletions are excluded — only substitutions provide pairwise phoneme evidence.
- Aggregate across all word pairs: for each phoneme pair, the confusion probability is the weighted count of confusion observations divided by the total weighted observation count.
Signal: Direct perceptual distance — which phonemes listeners confuse under degraded conditions. Independent from production-side data.
5.2 Phase 2: MorphoLex + CMU Alternation Pairs¶
Sources: MorphoLex-en (Sanchez-Gutierrez et al., 2018) for morphological segmentation of ~70K words. CMU Pronouncing Dictionary for pronunciations.
Extraction: Identify words sharing a root morpheme via MorphoLex segmentation. Align their CMU pronunciations at morpheme boundaries. Extract pairs where the same morpheme surfaces with different phonological forms. Example: if MorphoLex segments both "electric" and "electricity" with root "electr-", and CMU shows /k/ in one and /s/ in the other, that's an alternation pair.
Frequency weighting: Productive, regular alternations weighted higher than fossilized or rare ones. MorphoLex provides morpheme frequency data.
Signal: Functional phonological distance — segments that alternate occupy nearby regions in feature space.
Staging rationale: This pipeline is novel and unvalidated. By building Phase 1 first (ECCC only), we establish a clean baseline. When Phase 2 is added, its impact is measurable in isolation. If the alternation pipeline is noisy, we see it immediately by diffing against the Phase 1 posterior.
6. Model Specification¶
6.1 Parameters¶
- φ: feature matrix, 40 segments × 26 features, each entry φ[s, f] ∈ [0, 1]. Total: 1,040.
- α, β: onset/offset salience weights. Total: 2.
- Grand total: 1,042 parameters. Well within NUTS comfort zone.
6.2 Prior¶
φ[s, f] ~ Beta(α_sf, β_sf) # from Hayes matrix per Section 3.3
α ~ HalfNormal(1)
β ~ HalfNormal(1)
6.3 Likelihood: Perceptual Confusion (Phase 1)¶
For each phoneme pair (s₁, s₂) with observed confusion probability p_conf:
log(p_conf / (1 - p_conf)) ~ Normal(μ = a - b · d(c[s₁], c[s₂]), σ)
where: - c[s] is the composite vector for segment s - d is Euclidean distance on composite vectors - a (intercept) and b (slope) are fixed hyperparameters controlling the logistic mapping from distance to confusion probability - σ is observation noise
The logit of p_conf is modeled as a linear function of feature distance: small distance → high logit → high confusion probability. The parameters a, b, and σ are all fixed hyperparameters set in the config, not learned, to avoid non-identifiability with α and β (see Section 6.7). σ controls how tightly the model fits individual confusion observations — larger σ allows more deviation from the distance-based prediction. Initial values determined by grid search over a held-out fold of ECCC data.
6.4 Likelihood: Morphophonological Alternation (Phase 2)¶
For each alternation pair (s₁, s₂) with frequency weight w:
P(alternation | φ, α, β) ∝ exp(−w · d(c[s₁], c[s₂]))
Exponentiated negative distance: higher likelihood when alternating segments are close together. Distance is computed on composite vectors c[s], which depend on α and β.
6.5 Posterior¶
P(φ, α, β | data) ∝ P(φ) · P(α) · P(β) · P(confusions | φ, α, β) [· P(alternations | φ, α, β)]
Phase 2 term added when the alternation pipeline is validated independently.
6.6 Inference¶
NUTS sampler via PyMC. The Beta(19,1) and Beta(1,19) priors concentrate mass near the boundaries of [0,1], which can cause NUTS divergences. The feature parameters are sampled in logit space (reparameterized) to avoid boundary issues. Convergence diagnostics: R-hat < 1.01, effective sample size (ESS) sufficient, trace plots inspected. If divergences persist after reparameterization, reduce concentration (e.g., α + β = 10 instead of 20). Full ArviZ InferenceData saved for post-hoc analysis.
6.7 Identifiability Note¶
The global α and β weights scale all composite vectors uniformly. Since Euclidean distance is sensitive to scale, and the logistic link function (Section 6.3) also has scale parameters (a, b), learning all four simultaneously would create non-identifiability. Resolution: a and b are fixed hyperparameters (set by grid search or domain knowledge on a reasonable distance-to-confusion mapping), while α and β are learned. This separates the concerns: α/β control relative onset-offset salience, a/b control the distance-to-probability mapping.
7. Validation¶
7.1 Internal Coherence¶
Sanity checks on the learned vectors:
- Voicing pairs (p/b, t/d, k/ɡ, f/v, s/z, ʃ/ʒ) are nearest neighbors
- Natural classes cluster: stops, fricatives, nasals, vowels form distinct regions
- Vowels distribute by height × backness × rounding
- Diphthong composites land between their component monophthongs
- Structural N/A features (0-initialized) either stay near 0.5 or drift with justification
7.2 Regression Against PHOIBLE¶
Compare against the vectors being replaced:
- Pairwise similarity rankings over all segments using both vector sets
- Spearman rank correlation between PHOIBLE-based and learned similarity orderings
- Flag pairs where rank changes drastically — investigate whether the change is an improvement or regression
- Run full similarity search on a sample of the 44K word inventory, compare top-N results
7.3 Held-Out Prediction¶
- k-fold cross-validation on ECCC confusion data: do learned vectors predict held-out confusions better than raw Hayes discrete vectors?
- Same holdout strategy for alternation data (Phase 2)
- Report log-likelihood improvement over the prior (quantify how much the data moved the vectors)
7.4 Clinical Face Validity¶
For common phonological processes in SLP:
| Process | Target | Error | Expected |
|---|---|---|---|
| Stopping | /s/ | /t/ | /t/ near /s/ |
| Stopping | /f/ | /p/ | /p/ near /f/ |
| Fronting | /k/ | /t/ | /t/ near /k/ |
| Fronting | /ɡ/ | /d/ | /d/ near /ɡ/ |
| Gliding | /l/ | /w/ | /w/ near /l/ |
| Gliding | /ɹ/ | /w/ | /w/ near /ɹ/ |
The error phoneme should be among the top-k nearest neighbors of the target phoneme in the learned space.
8. Reproducibility¶
8.1 Determinism¶
Fixed random seed for NUTS sampler, documented in config. Same data + same seed = identical posteriors.
8.2 Version-Pinned Inputs¶
| Input | Version | License |
|---|---|---|
phonolex_features_ipa.csv |
v1.0 | Original work (Neumann's Workshop) |
confusionCorpus_v1.2.csv |
ECCC v1.2 | CC-BY 4.0 |
| MorphoLex-en | Sanchez-Gutierrez et al. 2018 | Academic open |
| CMU Pronouncing Dict | 0.7b | BSD |
All committed and hash-checked.
8.3 Hyperparameter Config¶
Single TOML file versioned with the code:
- Beta concentration (default 20 for +/−)
- α, β prior parameters (HalfNormal scale)
- NUTS tuning: target_accept, draws, tune, chains
- Likelihood parameters: logistic link intercept (a), slope (b), observation noise (σ), distance function
- Frequency weighting scheme (Phase 2)
8.4 Output Artifacts¶
| Artifact | Format | Description |
|---|---|---|
vectors.csv |
CSV | Posterior means, 40 × 26 |
uncertainty.csv |
CSV | Posterior SDs, 40 × 26 |
composites.csv |
CSV | Composite vectors for all 45 segments (40 mono + 5 diph) |
alpha_beta.json |
JSON | Learned α, β with credible intervals |
diagnostics/ |
PNG | Trace plots, pair plots, forest plots |
inference_data.nc |
NetCDF | Full ArviZ InferenceData |
8.5 Provenance Chain¶
Every output value traceable to:
- Its Hayes initialization (which feature, which segment, which Beta params)
- The evidence that moved it (which confusion pairs, which alternations)
- The posterior uncertainty around it (credible interval)
9. Project Structure¶
packages/features/
├── src/phonolex_features/
│ ├── prior.py # Hayes matrix → Beta parameters
│ ├── evidence/
│ │ ├── eccc.py # ECCC → phoneme-level confusion probabilities
│ │ └── alternations.py # MorphoLex + CMU → alternation pairs (Phase 2)
│ ├── model.py # PyMC model specification
│ ├── composite.py # Onset/offset decomposition, composite vector logic
│ ├── validate.py # All four validation stages
│ └── config.py # Hyperparameter config loader
├── configs/
│ └── default.toml # Hyperparameters, seeds, NUTS settings
├── outputs/ # Gitignored — generated artifacts
│ ├── vectors.csv
│ ├── uncertainty.csv
│ ├── composites.csv
│ ├── alpha_beta.json
│ ├── diagnostics/
│ └── inference_data.nc
├── tests/
├── pyproject.toml
└── README.md
Dependencies: PyMC, ArviZ, numpy, pandas, python-Levenshtein (for edit-distance alignment in ECCC extraction).
Data references: data/norms/eccc/, data/norms/morpholex/, data/cmu/. The Hayes matrix source (build_features_ipa.py) and output (phonolex_features_ipa.csv) are placed in packages/features/data/ as package-owned assets.
10. Licensing and Provenance¶
The initialization matrix is an original encoding sourced from empirical articulatory phonetics as described in Hayes (2009). The SPE-derived feature system constitutes scientific fact about articulatory phonetics, not copyrightable expression.
The learned continuous values are derived from empirical data (ECCC, MorphoLex, CMU) through a documented statistical procedure. The result is an original data product of Neumann's Workshop LLC with full provenance transparency.
| Component | License | Notes |
|---|---|---|
| Hayes matrix | Original work | Scientific fact, page-level citations |
| ECCC | CC-BY 4.0 | Attribution required |
| MorphoLex | Academic open | Citation required |
| CMU Dict | BSD | Open |
| Learned vectors | Original work | Derived from above via documented procedure |
11. References¶
- Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX Lexical Database (Release 2). Linguistic Data Consortium.
- Chomsky, N. & Halle, M. (1968). The Sound Pattern of English. Harper & Row.
- Cutler, A., Weber, A., Smits, R., & Cooper, N. (2004). Patterns of English phoneme confusions by native and non-native listeners. JASA, 116(6), 3668–3678.
- Hayes, B. (2009). Introductory Phonology. Wiley-Blackwell.
- Miller, G. A. & Nicely, P. E. (1955). An analysis of perceptual confusions among some English consonants. JASA, 27(2), 338–352.
- Moran, S. & McCloy, D. (eds.) (2019). PHOIBLE 2.0. Max Planck Institute for the Science of Human History.
- Sanchez-Gutierrez, C. H., Mailhot, H., Deacon, S. H., & Wilson, M. A. (2018). MorphoLex: A derivational morphological database for 70,000 English words. Behavior Research Methods, 50(4), 1568–1580.