LLMs perform mathematically precise Bayesian inference. That capability, while remarkable, is also the ceiling: they cannot update weights after training and cannot move from correlation to causation. Closing that gap is what AGI requires.
1. The Origin: RAG Before RAG Was Named
In October 2020, Vishal Misra (professor and vice dean of computing and AI at Columbia University) used GPT-3 to translate natural language cricket queries into a custom domain-specific language GPT-3 had never seen, deploying it in production at ESPN in September 2021. The architecture was in-context few-shot learning: semantic search over 1,500 labeled examples, top matches prepended as context, new query appended. GPT-3 completed the DSL output with no access to weights or internals. Misra could not explain why it worked, which launched his modeling effort.
2. The Matrix Abstraction
Misra's first paper frames every LLM as an enormous sparse matrix. Each row is a prompt (a unique token sequence); each column is a probability over the vocabulary (roughly 50,000 tokens for GPT-class models). Given "protein," the model returns near-zero probabilities on almost everything and non-trivial probability on "shake" and "synthesis." Choosing one collapses the distribution entirely: protein shake pulls toward gym content, protein synthesis toward biology. The full matrix is combinatorially larger than the number of electrons in the observable universe, so what LLMs actually learn is a compressed approximation of it. In-context learning, under this view, is real-time Bayesian updating: each new example shifts the posterior toward the correct output token.
3. The Bayesian Wind Tunnel: Formal Proof
Empirical results drew the objection that "anything can be called Bayesian." To answer it, Misra and Columbia colleagues Naman Agarwal and Siddharth Dalal (now at DeepMind) built what they call a Bayesian wind tunnel. The approach: take a blank architecture, assign it tasks where memorization is combinatorially impossible given parameter count, but where the correct Bayesian posterior can be computed analytically. Results across architectures: transformers match the theoretically correct posterior to 10^-3 bits accuracy, essentially perfect. Mamba performs well. LSTMs handle only a subset of tasks. MLPs fail entirely. The a16z-funded Token Probe tool (tokenprobe.cs.columbia.edu) made the probability distributions visible and was used to run these experiments. A follow-up paper identified the gradient dynamics that produce the geometry enabling Bayesian updating. A third paper confirmed the same structural signature persists in large open-weight frontier models, though noisier due to diverse training data.
4. What LLMs Cannot Do: The Two Missing Capabilities
Misra draws a sharp line between Bayesian inference (which transformers do precisely) and the two things required for AGI:
Plasticity / continual learning: Human synapses remain plastic for a lifetime, driven by the objective of survival and reproduction. LLM weights freeze after training. In-context Bayesian updating is ephemeral: the next conversation starts from zero. Continual learning is hard because updating weights risks catastrophic forgetting of prior knowledge.
Causation over correlation: All deep learning operates in what Misra calls the Shannon entropy regime: modeling statistical associations. Human cognition operates in the Kolmogorov complexity regime: finding the shortest program that describes a phenomenon, building causal models that allow simulation and intervention. When a pen is thrown, you duck not by computing Bayesian probabilities of impact, but by running a causal simulation. Judea Pearl's causal hierarchy (association, intervention, counterfactual) maps the gap. Deep learning covers only the first level.
5. The Einstein Test and Data Gravity
Misra's proposed AGI test: train an LLM on pre-1911 physics and see if it derives the theory of relativity. It cannot, for the same reason that Shannon entropy cannot produce Kolmogorov complexity. The pre-relativistic data overwhelmingly supported Newtonian mechanics. Anomalies (Mercury's orbit, Michelson-Morley experiments, gravitational lensing) existed but were sparse. A correlation model would treat them as noise. Einstein rejected the existing axioms and produced a new, compact representation (a single tensor equation) from which all observed phenomena follow. LLMs are bound to the manifold encoded in their training data. They cannot generate a new manifold. Martin Casado frames it as "data gravity": the weight of the majority view always wins.
6. Knuth as a Case Study
Donald Knuth recently used LLMs to hunt for Hamiltonian cycles in odd numbers. The LLMs iterated through many attempts, updating their context window with learned steps (a hacked form of plasticity). They eventually stalled. Knuth synthesized what the LLMs had found and produced the proof himself. Misra's read: the LLMs performed the Shannon part efficiently, exhausting the correlation space. Knuth provided the Kolmogorov step, the new causal representation that unified the findings.
Key Takeaways
- Transformers are not approximating Bayesian inference loosely; controlled experiments show they match the mathematically correct posterior to 10^-3 bits, making this a proven architectural property, not a metaphor.
- LLMs learn the statistical manifold embedded in their training data and can navigate it with precision, but they cannot build a new manifold, which is exactly what every genuine scientific breakthrough requires.
- Two specific capabilities separate current LLMs from AGI: weights that can update after training without forgetting prior knowledge (continual learning), and the ability to build causal models that allow simulation and intervention rather than just pattern matching.
- Scaling compute and data does not address either gap; both require architectural innovation, not more of the same mechanism.
- Judea Pearl's causal hierarchy (association, intervention, counterfactual) and Kolmogorov complexity provide the existing theoretical scaffolding for the next research direction, though no practical algorithm for finding minimal causal programs yet exists.