New Delhi: For most of the past decade, the formula for making artificial intelligence smarter has been straightforward: give it more text to read.
The models that power ChatGPT, Gemini, and their competitors were trained on billions of words scraped from books, websites, Wikipedia articles and online forums. The more text they consumed, the more capable they became at writing, reasoning and answering questions.
But researchers at Meta’s AI lab and New York University now say that era is drawing to a close.
In a paper published on 3 March, a team of researchers—led by Shengbang Tong, David Fan and John Nguyen, with advice from Meta’s Chief AI Scientist Yann LeCun, professor Luke Zettlemoyer, and NYU’s Saining Xie—said that the supply of high-quality text on the internet is finite and “approaching exhaustion”.
They argued that the next significant leap in AI capability will have to come from somewhere else.
Their candidate is video, or the vast, largely untapped archive of moving images that people upload to video or streaming platforms every day.
The claim is not merely that video is a convenient substitute for text. According to the researchers, AI systems trained on large amounts of video develop capabilities that text alone cannot produce. In particular, they pick up on how the physical world works—how objects move, how spaces connect, and how one action leads to another.
To test this idea, the researchers trained AI models from scratch rather than adapting existing language models.
This allowed them to examine what multimodal training—the combination of text, images, and video—actually contributes to an AI system’s capabilities.
The methodological choice matters because most existing studies build on pretrained language models, making it difficult to distinguish what the model learned from language and what it learned from vision.
The starting point for understanding the study’s significance is a problem the researchers call the “modality tax”.
For years, it was widely observed that training AI on images tends to degrade its language abilities, and vice versa—as if the two kinds of learning were competing for the same mental space.
The assumption was that vision and language fight over the same computational resources, forcing an uncomfortable tradeoff in any system that tries to handle both.
The Meta-NYU team found that this may not be true.
When they added raw video footage—without any text captions or labels—to their training mixture, language performance did not worsen. In some tests, it improved slightly.
The researchers found that the problems observed in earlier multimodal systems may have come not from vision itself but from image captions.
The text used to describe pictures differs significantly from ordinary pretraining text, and that mismatch was causing the interference. Once the captions were removed, the visual data turned out to be compatible with language learning.
Also Read: UPSC aspirants are replacing coaching institutes with ChatGPT — ‘I save lakhs in fees’
Beyond words
The most striking finding came when researchers tested whether their models could predict how scenes change—given a sequence of video frames and a navigation instruction, could the model generate what the scene would look like after that action was taken?
This capability could be valuable for the kind of spatial reasoning that robots, autonomous vehicles, and AI assistants would need to operate in physical environments.
They found that models trained on general video data acquired most of this capability on their own, without being explicitly taught.
When the researchers tested how much specialist navigation data the model actually needed, they found that performance stopped improving after just one percent of training data came from navigation-specific sources.
In other words, the model had absorbed the remaining 99 percent of its world-modelling ability simply from watching ordinary video.
The models could also respond to natural language navigation commands they had never seen during training—instructions like “get out of the shadow” or “go to the road”—generating plausible visual sequences in response.
According to the researchers, this capability emerged as a consequence of general multimodal training rather than any deliberate design.
The study also challenges structural assumptions about the design of several prominent multimodal AI systems.
Currently, many systems use two separate visual encoders within a single model: one to understand images and another to generate them.
The reasoning was that understanding and generation require fundamentally different kinds of visual representation—abstract, semantic features, on the one hand, and fine-grained pixel-level information on the other.
But the Meta-NYU team found this separation to be unnecessary.
A single encoder—one that learns visual representations through language supervision rather than through pixel compression—performed better than dual-encoder alternatives on both tasks simultaneously.
Examining the internal workings of their models, they found that the same specialised sub-networks were being activated for image understanding and image generation alike, suggesting the model had converged on a truly unified visual representation without being prompted to do so.
The mathematics of scaling
Perhaps the most consequential finding in the paper concerns how much data AI models actually need as they grow larger.
Language, the researchers found, follows a relatively predictable curve—bigger models need proportionally more text, and the relationship is fairly balanced.
Vision is a different story. The amount of visual data a model needs grows far faster than its language data requirements as model size increases.
To put this concretely: at 100 billion parameters, a model needs 14 times more visual data relative to text than it did at one billion parameters. At one trillion parameters, that figure rises to 51 times.
In effect, the larger the model, the more vision dominates its data diet—and the harder it becomes to keep both modalities well-fed within the same training budget.
This creates what the researchers describe as a genuine dilemma at scale. Conventional architectures cannot satisfy the data requirements of both modalities simultaneously—a model optimised for language will underserve vision, and vice versa.
Their proposed solution is an architecture called Mixture-of-Experts, in which a learned routing system directs each piece of input to specialised sub-networks rather than passing everything through the same parameters.
The researchers found that, beyond its efficiency benefits, this approach also has a structural effect: it enables language to behave more like vision in terms of data demands.
This narrows the gap between the two modalities and makes it possible, for the first time, to train both close to their respective optimal levels within a single model.
The researchers invoke Plato’s allegory of the cave to frame the ambition: current language models, they write, have “mastered the description of shadows on the wall without ever seeing the objects casting them”. The goal, as they describe it, is to build models that have finally stepped outside.
(Edited by Sugita Katyal)
Also Read: Students use AI, parents panic — how Indian schools are finding ways to live with ChatGPT

