CHASING THE GHOST OF HUMAN INTELLIGENCE

The Riddle that changes everything
"If four birds are sitting on a tree and a hunter shoots one bird, how many birds remain on the tree?"
When asked this question, do we step directly into the root of the problem, or do we first analyse the environment we work in?
We think
When the hunter shoots a bird, the loud sound reaches other birds. Birds have a negative reaction to loud sounds. Scared birds fly away.
How Large Language Models (LLMs) Process the Question
Compared to how humans understand this, LLMs approach this problem very differently.
When the model sees the given text, it splits it into tokens; these tokens can be a part of a word or sometimes the whole word as well. The model generates the next token based on what the previous sequence indicates. Older models would assume surface-level understanding of the problem, giving a result of 4-1 = 3 based on the recurrence of many 4-1 arithmetic problems in their dataset, without understanding context.
Newer models like GPT-4, Claude or Gemini can handle these questions more reliably, not only because the riddle might appear in training data, but because techniques like chain-of-thought finetuning and RLHF (Reinforcement Learning from Human Feedback) train the model to slow down, reconstruct token prediction patterns, examine the premise and look for implied context rather than surface-level arithmetic. Data familiarity may play a role, but it is reasoning that is reinforced.
Causal Reasoning: What Humans Have Naturally
Instead of directly interpreting the sentence, humans try to interpret it using causal reasoning. Humans do not follow a set of rules to make day-to-day decisions; we rely on gut feeling and past experiences.
Judea Pearl (a renowned computer scientist and philosopher) introduced the concept of causal reasoning in three layers.
Association — we observe patterns like the bird falling after getting shot.
Intervention — this is an active experimentation, where we put ourselves in the scenario we observed. We want to understand what would happen if we shot it or if we left the bird alone.
Counterfactuals — this is the layer where true reasoning occurs. What would have happened if the hunter had not shot? Could I have done something about this?
The CLadder paper (Jin et al., 2023) creates a benchmark specifically designed to test whether Judea Pearl's causal layers - association, intervention and counterfactuals. LLMs performed reasonably well at the association level, but performance dropped significantly at the intervention layer and dropped even more at the counterfactual level because the last two layers require actual causal understanding rather than pattern matching.
The Generation-Verification Gap
Plausibility is not equivalent to truth; this introduces us to the problem of the generation-verification gap.
It is the significant asymmetry between how easily AI models can generate content and how difficult it is to verify whether these outputs are trustworthy.
What this means is that the model does not understand the meaning of words; it just sees the statistical patterns over language where the output depends on which token gives the highest probability. What is showcased as understanding is actually pattern recognition. The model does capture contextual relations, but whether it truly understands them is debatable.
This shows a major limitation of LLMs. Say, the model is asked to verify its own answer, it can confidently confirm the answer is correct because 4-1 = 3. This shows that the model retraces its own flawed reasoning and calls it verified. Unless, of course, this question has been asked before and corrected for that AI model so that when one checks now, it would show the right answer, but again, here the reasoning has been given by us, not understood by the model.
Models confidently generate answers for questions asked, but there is no mechanism set in place to verify the correctness of these answers; they can produce responses that might sound convincing but might be factually wrong.
One way to fix this: A self-verification system
Generation: for a given input, the model produces multiple answers, instead of only one; this creates a distribution of possible outputs; improvement only occurs if there is variation in the output.
Verification: The model evaluates each generated answer, and it provides a score for each; this is called a proxy utility function.
Update: best answers are filtered based on the top k algorithm (only the top k scores are kept, the rest are discarded) or threshold filtering (keeps answers that only cross a specific threshold), and then the model learns from filtered answers.
A generator needs just one plausible answer to provide an output, whereas a verifier must check all possibilities; this is the core reason for the existence of the gap.
But for self-improvement to work, the verification system must be stronger than the generation, which is not always guaranteed, hence the need for a better approach - LRMs, which separate generation and verification.
Even if we add layers for verification, the base problem remains the same; it still has the structure of LLMs, so hallucinations are inevitable. How we can tend to reduce the gap is by using separate models for Generation and Verification, reducing the risks of overlapping errors.
This is called correlated failure, which happens when models have the same training data, which leads to similar bias. So when one fails, the other follows.
It does not matter how many verifiers we use if the base structure has flaws.
Large Reasoning Models (LRMs)
LRMs separate the generation and verification processes, either by using separate models or through RL-trained internal reasoning.
LRMs generate multiple outputs, so that the verification model has options to evaluate.
After verification, filtered outputs are picked and used in the next training cycle.
The paper "S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning" shows improved accuracy from 51.0% on the MATH500 test set, which is improved to 81.6% after using RL-trained self-verification, which suggests that the gain comes from self-verification and self-correction behaviours reinforced via RL, rather than simply scaling the model on mathematical tasks.
Weak verifier problem: Occurs when the verifier is not meaningfully better than the generator because they share similar biased data and have similar reasoning patterns.
Generation problem: Occurs when even if multiple reasoning patterns are generated, they are all variations of the same flawed reasoning, so there is no improvement.
We can reduce this issue by using multiple verifiers, solving problems in different methods - LM judges, Reward Models.
Reward Models: PRMs and ORMs
PRMs (Process Reward Model - evaluates how the reasoning is done) and ORMs (Outcome Reward Model - evaluates the correctness overall) are examples of such reward models.
This has been implemented with fruitful results, for example, the Cobbe et al. (2021) paper on GSM8K — "Training Verifiers to Solve Math Word Problems".
The results on GSM8K show approximately 33% accuracy without the verifier, increasing to around 55% with the verifier (PRM).
Even when models are different, they are trained on similar internet data and can learn the same misconceptions.
What matters is methodological diversity and not just architectural diversity in verifiers.
Methods of Reasoning
Chain-of-thought: It forces the model to showcase its intermediate steps, making it easier to detect errors. Wei et al. (2022) show that chain of thought prompting improved accuracy on GSM8K from 17.9% to 58.1% on LLMs, which shows reasoning matters as much as model size.
Self-Consistency: Instead of using greedy decoding, it solves the problem using different methods and varies the answers; finally, the answer with the majority vote is considered. It achieves +17.9% on GSM8K over greedy decoding.
Self-Reflection: model asks "Is the output correct?" "Can anything go wrong?" This is explicit verification (also called self-verification); this step occurs after generation. The main issue is the same model that makes the error also detects it, so without strong external verifiers, self-reflection can confidently verify wrong answers. Which is why we require the next component, the tool.
Tool Use: They use external systems to verify, for example: using a math library for calculation, calculation for numerical computation and search engines for facts. Another main example is RAG (Retrieval-Augmented Generation), the model extracts external documents from an external source rather than bases its responses on parametric memory (knowledge stored inside the model's parameters during training).
Even most verification systems are still inherently models having the tendency to hallucinate because sometimes reasoning paths can share the same underlying bias, models lean towards pattern recognition instead of true logical reasoning, which happens when the verification system is weak.
Core Limitations: LLMs and LRMs Fall Short
LLM/LRM are models trained specifically for certain domains: GPT models for textual content generation. Whisper for voice recognition, which is called artificial narrow intelligence. When faced with a dataset they are not familiar with, they produce unreliable outputs, like what we saw in the bird-hunter problem.
When we think about AGI, it must be able to perform a wide range of tasks with minimal data.
But a major problem with ANI is that it struggles to generalise, so to do even the smallest tasks, it would need new data and retraining, which increases the computational power required.
This makes scaling expensive in terms of computation and infrastructure.
So we must try to scale using logic and understanding instead, and for that, let's introduce a new idea - Symbolic AI.
Symbolic AI: Logic Over Probability
Symbolic AI converts a problem (or statement) into explicit symbols and structured facts; the entire group of these formulas form a knowledge base, and then rules are encoded manually, which form the domain knowledge. Now, an inference engine (brain) applies these rules to the facts to generate new conclusions.
Two methods that help apply this (inference):
Forward Chaining:
It goes from data to conclusion. So the model starts with known facts and applies rules applicable to those facts, which leads to the creation of new facts.
Example: a hunter shot a bird → Loud sound → scares birds → birds fly
Backwards Chaining:
It goes from the conclusion to the facts. So we start with a hypothesis, find rules that can help reach the goal and check where the conditions of the rules are true. If not, treat the conditions as subgoals and repeat until we reach true or false.
Example: we start with birds_left = 2? We find a rule like IF loud noise → all birds fly. Hence, birds_left = 0. So the provided conclusion is wrong.
Symbolic AI works with explicit rules. Everything depends on how the rules are structured. It can produce wrong output if the knowledge base is incomplete or if real-world assumptions are missing.
Unlike LLMs, whose errors are buried in opaque weights, symbolic AI errors are traceable.
The core issue is writing rules; it's not efficient to write all rules manually, because it cannot scale and learn the way ML systems do. It also struggles to handle natural language, edge cases and noisy data, so every update would require manual correction, which defeats the purpose.
Neuro-Symbolic AI: Combining the Best of Both
We see pattern recognition without structure and structure without learning - neither works alone, but their failure modes complement each other.
Gary Marcus has given a good understanding of how the structure of neuro-symbolic AI should be in his paper "The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence".
Let's apply it using the bird riddle:
Perception: The sentence "If 4 birds are sitting on a tree and a hunter shoots one bird, how many birds remain on the tree?" is converted to structured objects with designated properties. Example: Bird: living, flies, scared of loud sounds. Hunter: gun, shoots. Here, the question is defined as a proper structure instead of just tokens.
Structured Encoding: This is where relationships are formed between objects and their properties, so this shows how different objects are connected using knowledge graphs. For example, we map guns to hunting, birds to being scared of loud sounds, and guns producing loud sounds; this starts showing how the entire scenario is shown. This is the main thing that separates it from an LLM; unlike attention, which works on probabilistic relations, we use meaningful relational links.
Knowledge Base: Its stored rules are acquired from prior human knowledge. Important to note, these are not derived from the input but are already known facts about the world.
Cognitive Model: This is the most interesting part of the system; think of it like a simulation running in the model. We define objects, those objects would have features (properties), objects as a whole would have relationships with each other, and to track state changes, we use time. Also called a world model, the most amazing feature is that it is not static; it updates continuously as events occur, so the system can track reality. This takes us closer to how the human brain works.
Example:
Time = 0: 4 birds on the tree.
Time = 1: hunter arrives, spots the birds.
Time = 2: hunter aims gun, hunter shoots.
Time = 3: loud sound is created, birds register, 1 bird dies.
Time = 4: birds register sound, birds get scared, birds fly away.
Time = 5: zero birds left on the tree, 1 dead on the ground.
Symbolic Inference: This is the part where logic and rules are applied. Linking objects to properties based on the scenario they are present in, we know that as all birds fly away, no birds are remaining on the tree. So we conclude that the number of birds left is zero. Example: IF loud sound near bird → bird flies; IF bird shot → bird dies.
Update Mechanism: keeps track of time and maintains consistency throughout the world model, so if any updates happen mid-reasoning, this adds them to the world model, making the whole structure dynamic. Example: you take a scenario where we add another bird, but the bird is deaf, so the bird does not hear the shooting sound. This gets updated into the world model mid-reasoning, and the final answer gets changed to 1.
Output Layer: This is what answers questions or makes decisions in a pipeline. The system outputs exactly what the question wants, and the beauty of this method is its completely open box; it can trace back to any rule that produced this answer. We should think about the implications if we are able to scale this pipeline.
Metacognition: The Key to Controlled Intelligence
Going back to the previous problem in self-correcting models, the main issue there was that even when we used verifiers and all outputs came out false, it would still give an output that would be the best out of the wrong answers.
This is because all outputs are probabilistic, and the largest value is selected; there is no absolute standard.
Humans do this naturally. We all experience this, say before our exam when we know we are not prepared, we plan about what we will do based on what we know.
This is where we introduce metacognition:
This system would be another layer over our current neuro-symbolic setup. Instead of choosing the best option, it checks whether the given answers are acceptable or not.
It introduces something called failure awareness, where a system can reject its own thinking process.
If outputs fail to meet given criteria, it forces the model to try again, so this enables a retry loop, reducing error rate over iterations, which reduces the generation-verification gap.
This matters a lot for AGI because it is not only neural networks with symbolic reasoning, but also control over cognitive intelligence.
AGI is not just intelligence; it is controlled intelligence.
Even though metacognition tells the system when to doubt itself, a system needs something concrete to check its answers against.
WORLD MODELS: What We Are Missing
A world model is a learned model that represents state, causality, time and predicts what can happen next without being told.
It uses the fundamental principles it learned before in order to reason about things it has not seen by simulating.
It reduces hallucinations by checking answers against simulated reality. The generation-verification gap problem is reduced when verification is done using ground truth simulation. The addition of a world model does not eliminate data bias, but it makes it more detectable as the bias now lives as rules encoded in the knowledge base and as properties given to objects as described in the neural structure. These are explicit and inspectable, unlike LLMs, whose biases are mixed into weights through training data.
Causality is structurally encoded, not inferred statistically. Objects are explicitly represented along with their directional relationships. The model does not guess from patterns.
For a world model, a scenario such as the bird-hunter problem is trivial because it can easily dissect the objects of interest, take the scene forward and get to the conclusion through simulation, not subtraction. As shown above, it would be a scenario - gunshot, loud sound, fear, response and then zero birds remaining. The answer is not retrieved from data but observed, which is closer to what human reasoning is.
Say you change the scenario to a deaf bird - a token predicting model has no pattern to fall back on. The world model would simply update the chain of causal reasoning - deaf birds cannot hear sound, so the number of birds and the answers change it to one.
How do we build a world model from data?
In the V-JEPA paper on joint embeddings, Yann LeCun argues that token prediction is not true intelligence; what the model should predict is an abstract representation of the scenario.
Because it predicts abstract representation space, it also produces uncertainty estimates. If the scenario is outside what the world model understands, the predictor is uncertain.
V-JEPA 2 is trained on internet-scale video and learns a predictive model in latent space. This is important because vision is how we observe state changes. The system learns the dynamics of the world by watching it without labels or language. A full world model would require language, too, to link information across different domains, but V-JEPA 2 shows that the vision half is now tractable.
With a world model, metacognition becomes powerful because instead of checking its answers against other answers it would provide, the system checks it against a simulation of reality.
MuZero and DreamerV3 are AI systems developed by Google DeepMind. These are AI systems that learn to play games like chess, Shogi (by MuZero) and Atari games (by DreamerV3) without being told the rules of the game. Earlier systems, like AlphaGo, knew the rules beforehand. MuZero simulates the environment using world models and uses Monte Carlo Tree Search to simulate different futures and choose the best branch, whereas in Dreamer, the system relies heavily on latent imagination. Dreamer has outperformed model-free RL agents. MuZero is evidence that a system can build an internal causal model without explicit programming.
Scaling these systems to handle physics, human emotions and social dynamics is not just an engineering problem but also an unsolved mathematical problem.
But the fact that V-JEPA demonstrates tractable world modelling from vision alone shows that the mathematical part is approachable, but it is a question of architecture and scaling; we are not starting from zero.
True intelligence or AGI might not require a larger language model, but a system that can simulate, reason and recognise its own boundary of understanding. Every component that we have discussed so far - neural perception, symbolic reasoning and metacognitive control is a piece of that system.
The world model is what holds that together.
Conclusion: Towards AGI
As we step back and analyse everything we learned, we can see that, at least theoretically, AGI would be a combination of neural networks, self-correcting models, symbolic AI and metacognition.
Every failure mode described here (hallucinations, poor generalisation, the generation verification gap) shares one root problem - the system has no internal understanding of what the world actually is.
What we have built here is not just a list of fixes but a blueprint. Neural perception to convert reality to a structural representation. Symbolic reasoning to apply logic without going into probability. Metacognition, when reasoning has failed and what holds it together, which is a world model.
The bird riddle was never about arithmetic, but it was about whether a system can understand what a world is.
This understanding is exactly what would get us closer to AGI, where a system can simulate a world it was not trained on and recognise boundaries of that simulation.
The problem is that each component exists - we have a world model, the symbolic layer and a metacognitive check. What does not exist yet is the link that holds each component together. We don't have a name for it yet — but we used to call it understanding.





