ChatGPT Can Recite Physics. It Has No Idea How Physics Works.
The fundamental difference between language models and world models — and why it matters more than any benchmark.

We live in an age of technological abundance. Not a month goes by without a new version of ChatGPT, Claude, or Gemini producing a moment of genuine astonishment. These systems can now pass the bar exam, diagnose rare conditions that experienced physicians have missed, and write a passable Baudelaire poem in under three seconds. For many people, this is it. We have built intelligence.
But are we confusing the map for the territory?
The question has a name in AI research — the Moravec paradox — and it’s the problem that keeps Yann LeCun up at night. The paradox is this: the cognitive tasks we consider intellectually demanding, solving differential equations, writing legal briefs, and synthesizing research across fields, turn out to be relatively easy to replicate in a machine. But the things we consider trivially easy, such as a 24-month-old child understanding gravity after dropping a spoon three times, remain stubbornly out of reach for systems that have ingested the entire internet. The ease with which a toddler grasps cause and effect contrasts sharply with the immense data needs of a model to engage in even a simple physics discussion.
LeCun’s answer is that we are building the wrong thing. And he has a $1.03 billion bet to back it up.
What Language Models Actually Do
To understand why LeCun is dissatisfied, you need to understand what modern AI is actually doing under the hood. Whether you are using ChatGPT, Claude, or Gemini, the underlying logic is the same. These are autoregressive models: systems trained to predict the next token — roughly, the next word fragment — based on statistical patterns in everything they have ever read. The techniques have grown enormously sophisticated over the years, and there is real emergent reasoning happening at scale, but the fundamental operation remains prediction over text.
The problem, as LeCun has been arguing for years, is that this approach never touches the structure of the world. For a large language model, the concept of “apple” is a mathematical vector surrounded by nearby vectors: red, fruit, Newton, tree. The model has no concept of the apple’s mass, its texture against your teeth, or the trajectory it follows when knocked off a table. It cannot simulate falling objects — it can only recite the description of falling objects.
This is precisely why AI models hallucinate. Ask any of them whether a kilogram of feathers weighs more than a kilogram of lead, and a system that has processed billions of documents still occasionally generates a confidently wrong answer. The model has no internal physical foundation against which to check its claims. It is, by design, a generator. And generators, by definition, generate — they cannot verify. As LeCun put it, these systems recite the script of a falling object without ever calculating the fall.
Silicon Valley’s response to this problem has been consistent: more. More data, more servers, more parameters. And honestly, it cannot be said to be wrong — the scaling laws are real, the curves keep improving, and the results are often remarkable. But LeCun argues this approach eventually hits a structural wall. You cannot learn to drive by reading the driving manual indefinitely. At some point, you have to feel the clutch, understand the car’s inertia, and anticipate the pedestrian stepping off the curb. That information doesn’t exist in text. It lives in physics.
The Radical Alternative: AI That Simulates
LeCun’s contrarian thesis, one he was advancing even during his 12 years as chief AI scientist at Meta, is that intelligence is not the mastery of language. It is the mastery of causality. Predicting the next word in a sentence does not equate to understanding the world. If I push this, if I turn that, if I let go — what are the consequences?
The technical framework for this vision is called JEPA: Joint Embedding Predictive Architecture. LeCun first proposed it in a 2022 position paper, and the core idea is to stop predicting in surface space — pixels or tokens — and instead predict in what researchers call latent space, a compressed mathematical representation of the world’s deep structure. We don’t reconstruct the falling ball frame by frame. You model its trajectory as an abstract concept. We work with the rule, not the rendering.
In a busy street scene, a standard generative model exhausts itself computing the movement of every leaf in every tree, every light reflection on every puddle — what researchers call noise, information irrelevant to decision-making. A JEPA-based system learns to filter that noise away and keep only what matters: the car pulling out from the left, the pedestrian hesitating at the curb, the traffic light changing. That’s exactly what your brain does. When you cross the street, you don’t calculate the position of every air molecule. You simulate the critical trajectories.
This is the vision LeCun calls objective-driven AI: a system with a genuine internal model of the world, capable of simulating hypotheticals before committing to an action. If I grip this glass too tightly, it will shatter. If I turn the wheel too fast, the car will skid.
Not imitation — simulation.
The Proof of Concept
For years, this vision remained elegant but difficult to operationalize. JEPA-based systems kept hitting a specific failure mode: representation collapse. When training a model to predict future abstract states, the model discovers a shortcut. It can minimize its prediction error simply by mapping every input — a bouncing ball, a passing car, a stationary block — to the same internal representation. Prediction error goes to zero. The model learns nothing.
Previous attempts to prevent collapse required engineering scaffolding: complex multi-term loss functions, frozen pre-trained encoders, stop-gradient operations, and exponential moving averages. Six or seven tunable components made training fragile, expensive, and hard to reproduce.
In March 2026, LeCun’s team — working with researchers from Mila, NYU, Samsung SAIL, and Brown University — published LeWorldModel (LeWM), a paper that solves this problem with a single, clean mechanism. They called it SIGReg: the Sketched Isotropic Gaussian Regularizer. The idea is to impose a constraint on the model’s internal representations, forcing them to follow a Gaussian distribution. The model’s attempt to reduce all information to a single point leads SIGReg to impose a penalty. This model is forced to maintain diverse, meaningful distinctions between representations, which means it must actually understand what it’s looking at. To make the distinctions, it has to learn.
Despite what they imply, don’t come from a billion-dollar cluster. LeWorldModel has 15 million parameters, trains on a single GPU in a few hours, and plans actions up to 48 times faster than foundation-model-based world models. The model uses roughly 200 times fewer tokens than competing approaches. And importantly, it isn’t taught physics — it discovers physics from raw video. It implies that objects can’t pass through walls. It learns that a ball must bounce and that gravity is constant. Nobody wrote those rules. The model extracts them from observations.
Think of the difference between a tennis champion and a beginner. The beginner chases the ball reactively, running to where it already is. The champion has already simulated the trajectory before the opponent’s racket makes contact. Standard language models are perpetual beginners — they react to the last token to guess the next one, stuck in a permanent present tense. A world model looks ahead. It projects scenarios. It acts on outcomes it has already evaluated internally.
What This Changes
The applications that depend on this shift are not small ones.
In robotics, today’s systems require exhaustive pre-programming because any change in object position — an unexpected plate orientation in the dishwasher, a box placed differently on a shelf — can break the pipeline. A system with a physical world model could generalize. It would understand fragility, weight, balance, and the relationship between grip force and the likelihood of breakage.
While Tesla’s autonomous vehicles and similar systems currently merge perception, prediction, and planning, their foresight is not yet fully developed. LeCun’s model aims at something deeper: if a ball rolls into the road, the system should infer that a child may follow without waiting to see the child. That’s not pattern matching. That’s causal reasoning about the physical world.
The Business Reality
LeCun’s Paris-based startup AMI Labs raised $1.03 billion in March 2026 at a $3.5 billion pre-money valuation, the largest seed round in European startup history. The investors include Jeff Bezos’ Bezos Expeditions, Nvidia, Cathay Innovation, and Greycroft. That list says something. Nvidia backing a bet against the LLM paradigm it has largely profited from is not an accidental signal.
The efficiency story deserves a precise reading. A 15-million-parameter proof of concept running on a single GPU is remarkable evidence for the validity of the architectural approach. It is not yet a system capable of navigating a complex proper environment. AMI Labs is under no illusions about this: the company’s own roadmap anticipates years of fundamental research before commercial products arrive. The efficiency gain doesn’t negate the necessity of scale. The cost per capability unit decreases, but the investments in data acquisition and infrastructure are still considerable when producing at scale.
LeCun’s strategy, as described publicly, is to release the code openly — the SIGReg implementation is already available — intending to establish a new standard. If the world adopts JEPA as the foundation for physical intelligence, AMI Labs should remain at the frontier of its development and capture industrial licensing revenue. Whether that strategy holds in a race where geopolitical actors are watching closely is a genuinely open question. Copying an open-source paradigm is not the same as downloading a model, but it is not the same as replicating a gigawatt data center either.
The deeper point, the one that matters regardless of who wins commercially, is this: LeCun is beginning to demonstrate that intelligence doesn’t require gigantism. Scale is necessary, but a qualitative leap may come from connecting AI to physical reality rather than just making it bigger. If that thesis holds, the future of AI will not be written in text interfaces. It will be written in a way that machines finally learn to engage with the world they have merely been describing.
What’s your read on this? Curious where you land on this in the comments.

