What Transformers Could Learn from the Human Brain

Why prediction alone isn't enough, and how neuroscience could shape the next generation of AI.

Jun 28, 2025

Hi there - and welcome to Rethinking Intelligence.

If you are new, this is a space where neuroscience meets machine learning. We explore how brain-inspired computation might unlock the next leap in AI - from predictive coding to planning agents and beyond.

This post kicks off our flagship series.

The Illusion of Intelligence

Large Language Models (LLMs) such as GPT-4, Claude, and Gemini have captured global fascination. They write code, pass exams, and hold conversations that feel intelligent. Yet beneath the surface, they are statistical engines trained to predict the next word in a sequence.

This begs the question:

Are these systems truly intelligent - or simply convincing mimics?

Biological brains have been solving prediction problems far longer - and with far greater efficiency - than any neural network. What might artificial intelligence learn from nature?

This essay explores how predictive coding, a central theory in neuroscience, compares with transformer-based AI models. More importantly, it asks: What cognitive ingredients are still missing from our most advanced AI systems?

1. The Brain as a Prediction Machine

Predictive coding, formalised by Rao & Ballard (1999), proposes that the brain constantly generates internal models of the world, forecasting sensory input before it arrives. When predictions match reality, the brain stays quiet. When they don’t, it updates its model based on the difference — the prediction error.

Friston’s Free Energy Principle (2010) builds on this: intelligence is about minimising surprise (or technically, variational free energy) over time. The brain is not just reactive: it is proactive, efficient, and deeply hierarchical.

Recent advancements have demonstrated that predictive coding networks can be scaled to depths exceeding 100 layers, bridging the gap between biologically inspired models and deep learning architectures (Innocenti et al., 2025).

2. How Transformers Predict

Transformers, by contrast, are feedforward models trained to predict the next token in a sequence. Introduced in Attention is All You Need (Vaswani et al., 2017), they replace recurrence with self-attention. This enables the model to weigh all previous tokens when predicting the next.

LLMs scale this architecture massively and model complex linguistic patterns. Yet they fundamentally lack several key ingredients of biological cognition, such as:

Intrinsic goals
Grounding in real-world causality
Sensory input

While more modern LLMs can process sensory-like data (images, audio) as tokens, this is different from experiencing the world through continuous, integrated sensory modalities as biological systems do.

As a consequence, they excel at mimicry, but struggle with:

Common-sense reasoning
Long-term planning
Goal-directed behaviour
Adaptive memory

So, while LLMs can predict, do they truly understand? Some researchers have noted that attention mechanisms in deep learning models do not map cleanly to human attention — conceptually or functionally (Lindsay, 2020).

3. What the Brain Has That Transformers Don’t

Here’s how some core cognitive traits compare between the brain and current Transformer-based models:

Prediction

Brain: Hierarchical, context-aware
Transformer: Autoregressive, shallow (short-range)

Feedback Loops

Brain: Extensive reentrant processing
Transformer: Mostly absent

Memory

Brain: Working, episodic, long-term
Transformer: Limited context window, patched with retrieval mechanisms

Energy Efficiency

Brain: ~20W power usage
Transformer: Requires extensive GPU computation

Sensory Grounding

Brain: Embodied in multimodal sensorimotor experience
Transformer: Often symbolic or text-based; some recent models incorporate images and audio, but lack true embodied grounding

While many Transformer models remain symbolic and primarily text-based, others - such as DeepMind’s Gato (Reed et al., 2022) - incorporate multimodal input and interaction with simulated environments to address the gap in embodied grounding.

Goals

Brain: Intrinsic and adaptive
Transformer: Externally prompted and task-constrained

Recent models like RWKV (Peng et al., 2023) and Hyena (Poli et al., 2023) reintroduce architectural features such as recurrence and focus on energy-efficient processing to overcome Transformer limitations. Mamba (Gu et al., 2023) further explores these directions by incorporating selective state space models, which enable recurrence, effective temporal abstraction through content-based reasoning, and significant energy efficiency.

TransformerFAM, a 2024 model that uses feedback attention to construct internal working memory, demonstrates that adding feedback mechanisms to Transformer architectures can significantly enhance their ability to handle long contexts and improve memory capacity, which the authors argue is a key prerequisite for reasoning (Hwang et al., 2024).

This growing body of research suggests a shift: modern AI may need to rediscover the very mechanisms evolution refined in biological systems.

4. What AI Can Learn from the Brain

Several architectural insights from neuroscience may enhance next-generation AI:

Prediction Error Signalling: Enable models to pass local prediction errors forward through hierarchical layers, mimicking how brains adjust internal beliefs based on surprise.
Recurrence and Memory: Move beyond context windows; enable true internal state.

Recent studies have explored augmenting Transformers with recurrent mechanisms. One approach, depth-wise recurrence with dynamic halting (Chowdhury & Caragea, 2024), enables adaptive computational depth. Another, chunk-wise recurrence via temporal latent bottlenecks (Didolkar et al., 2022), consolidates information by combining fast and slow processing streams. These innovations address core limitations of standard Transformers - including high computational cost, rigid processing depth, and difficulty generalising to long or unfamiliar sequences.

Sparse and Modular Computation: Emulate the brain’s efficiency.
Multi-modal Grounding: Integrate sensory modalities for richer representation.
Goal-Directed Planning: Design agents with intrinsic reward systems or self-updating goals.

While DeepMind's recent projects, such as Gato (Reed et al., 2022), hypothesise that scaling data, compute, and model parameters can lead to generalist AI, other independent research efforts suggest a growing awareness that scaling alone - especially without addressing architectural inefficiencies or inherent limitations in understanding - may not be enough to solve intelligence (Bender et al., 2021).

5. Intelligence Is More Than Prediction

Predictive coding teaches us that intelligence involves more than reaction. It requires anticipation, adaptation, and purpose. The human brain is generative, model-driven, and constantly adjusting to incoming sensory feedback.

Today's LLMs are powerful tools - and with frameworks like AutoGPT, Open Interpreter, and GPT-4's memory API, we are seeing early forms of memory and agent-like behaviour emerge. Agent frameworks such as ReAct, Voyager, and LangGraph also show promise in enabling goal-conditioned tool use and iterative planning.

But these systems still fall short of true intelligence:

Memory is externally scaffolded or narrowly scoped
Embodiment is largely symbolic or simulated
Goals are prompt-engineered (not intrinsically generated)

LLMs simulate agency - but do not yet possess it.
As Bender et al. (2021) caution, surface-level fluency can be misleading: what appears intelligent may simply be statistical mimicry without real comprehension or intent.

To build genuinely intelligent machines, we may need to return to the first intelligent system we ever knew: the brain.

References

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big?🦜. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610–623).
Chowdhury, J. R., & Caragea, C. (2024). Investigating Recurrent Transformers with Dynamic Halt. arXiv preprint arXiv:2402.00976.
Didolkar, A., Gupta, K., Goyal, A., Gundavarapu, N. B., Lamb, A. M., Ke, N. R., & Bengio, Y. (2022). Temporal latent bottleneck: Synthesis of fast and slow processing mechanisms in sequence learning. Advances in Neural Information Processing Systems, 35, 10505–10520.
Friston, K. (2010). The free-energy principle: a unified brain theory?. Nature reviews neuroscience, 11(2), 127–138.
Gu, A., & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
Hwang, D., Wang, W., Huo, Z., Sim, K. C., & Mengibar, P. M. (2024). Transformerfam: Feedback attention is working memory. arXiv preprint arXiv:2404.09173.
Innocenti, F., Achour, E. M., & Buckley, C. L. (2025). μPC: Scaling Predictive Coding to 100+ Layer Networks. arXiv preprint arXiv:2505.13124.
Lindsay, G. W. (2020). Attention in psychology, neuroscience, and machine learning. Frontiers in computational neuroscience, 14, 29.
Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., & GV, K.K. (2023). Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
Poli, M., Massaroli, S., Nguyen, E., Fu, D.Y., Dao, T., Baccus, S., Bengio, Y., Ermon, S., & Ré, C. (2023, July). Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning (pp. 28043–28078). PMLR.
Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1), 79–87.
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S.G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J.T., & Eccles, T. (2022). A generalist agent. arXiv preprint arXiv:2205.06175.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Stay in the Loop

This post is part of Rethinking Intelligence - a series exploring how neuroscience can shape the future of AI, from predictive coding to intelligent agents.

→ For early access, subscriber-only posts, and the full roadmap:
🔗 Subscribe on Substack

→ Prefer reading on Medium? Follow this publication for future posts:
🔗 Follow on Medium

🧠 Next up: How Brains Learn from Almost Nothing: Why AI needs mountains of data, and the brain doesn’t.

Rethinking Intelligence Substack