The Categorical Imperative: Say What? — Dr. Katya Steiner on Linguistics, Logic, and the Research Apocalypse
# The Categorical Imperative: Say What? — Dr. Katya Steiner
If you’ve ever watched an academic mailing list implode because someone asked whether Irish English counts as “the real English” (it does), you already know the genre: part grievance column, part help desk, and part frantic firewall maintenance. The observation in that cheeky “Say What?” field guide is right on the nose: linguistics is being tugged between trivia-seeking netizens, a fragile research backbone, and an increasingly seductive techno‑optimism. I’ll argue — with a grin and a bit of math — that the solution isn’t more hype or more hand‑waving but better categories, clearer inference, and an institutional backbone that doesn’t collapse when someone trips over a server.
## Why the Q&A mess is a classification problem (and not the fun kind)
Good questions are structured. Bad ones are noise. In machine‑learning-speak: you want high signal‑to‑noise ratio, clear labels, and features people can actually measure. Linguistics Q&A communities fail precisely when they become unlabeled training data for human trolls.
Think of a well‑formed question as a type signature in programming or a morphism in category theory: it tells you the input, the expected transformation, and the output type. “Is this sentence grammatical?” is ambiguous. “Given this sentence and your native dialect X, is the auxiliary ‘have’ optional in rapid speech?” is a typed, testable operation. Category theory’s obsession with structure-preserving maps is the perfect metaphor: ask for mappings that preserve the properties you care about.
## Old debates through a new formal lens
Remember when syntax people argued like two warring logicians over infinitives? Those fights matter because they define the primitives we use in downstream work. Formal theory — be it Chomskyan grammar, constraint-based frameworks, or boolean classifiers — defines the objects our models manipulate.
Here math helps: logical systems (predicate logic, modal logic, type theory) give us crisp semantics for claims. Proof theory forces us to make inference steps explicit (not “it feels right”). And categorical logic lets us reason about equivalences between different formalizations: two seemingly different models can be isomorphic in the right category, meaning they make the same empirical predictions even if their internal machinery looks different. That’s a beautiful way to settle a fight without needing to be nasty about it.
## The research infrastructure apocalypse: a network and complexity problem
When people warn that the field’s infrastructure is at risk, this is not melodrama — it’s network science. Labs, datasets, and repositories are nodes in a fragile graph. If highly connected hubs fail (because of budget cuts, security rules, or bureaucratic lockouts), the network fragments. Fragmentation increases the sample complexity of replication studies and raises variance in who can contribute.
Cryptography and distributed systems theory have tricks here: redundancy, provenance tracking, and federated data models. But these require resources. It’s like saying: sure, blockchains can track provenance — but you still need electricity and sensible governance. The math says redundancies and backups reduce the risk of catastrophic data loss; politics and budgets determine whether we implement them.
## Speech, reading, and the algorithms that might (or might not) help
Dr. Tiffany Hogan’s speech‑to‑literacy pipeline is a wonderful example of how empirical regularities and mathematical models can combine. From a statistical perspective, early speech features are predictors; the question is how predictive, under what noise model, and with what cost of false positives.
This is where Bayesian reasoning earns its keep. A screen that flags preschoolers for intervention should balance prior prevalence, likelihood ratios for early markers, and the societal cost of action/inaction. Machine learning can help scale screening, but complexity theory and robust statistics warn us about worst‑case behavior, adversarial examples, and distribution shift. In plain language: your pretty app will fail in communities that weren’t in your training set unless you design for that.
## L2 pronunciation: a game of constraints, exposure, and identity
Pronunciation isn’t mystical. It’s the solution to an optimization problem under constraints. The constraints: your L1 phonology (what’s cheap to produce), perceptual distance (what you can hear), and sociolinguistic payoff (what changes your social standing). The objective: communicate effectively with minimal effort and social cost.
From an information‑theory lens, accent features that reduce confusion in communicative contexts will be retained more easily. But the optimization we care about isn’t purely Shannon; it’s multi‑objective: fidelity, identity, and social signaling. So teaching that emphasizes high‑gain features (contrasts that reduce communicative ambiguity) and gives targeted feedback wins over endless exposure.
## Logic, math, and the ugly truth about models
Here’s the punchline: formal systems make failure modes visible. If you specify your assumptions (axioms), logic tells you what follows. Category theory tells you how different formalisms relate. Statistical learning theory gives you bounds on generalization. Complexity theory tells you when an algorithm is infeasible. Information theory tells you what data can, in principle, reveal.
But there’s a human wrinkle: models are built by people with histories, incentives, and blind spots. That’s why we need both math and institutional hygiene. Math exposes the gaps; institutions decide whether to mind them.
## A jovial, slightly bitchy conclusion
So what do we do? Ask better questions (type your inputs), fund boring infrastructure (backups, provenance, federated access), and use math and logic not as intimidating ornaments but as practical tools: priors for screening, categories for mapping theories, and complexity for realistic promises about AI. Tech can help, but only if the people building it remember that language is messy and context matters — not a neat optimization problem with a single global minimum.
I’ll leave you with a question that’s more fun than it sounds: if linguistics is a category whose objects are datasets, morphisms are analytical transforms, and limits represent consensus, what does it mean for that category to be “complete” — and who gets to add the missing objects?