India is racing to decode its ancient manuscripts & medical texts. AI is helping

Image: Shruti Naithani/ThePrint; Source: bharatgen.com — Image: Shruti Naithani/ThePrint

New Delhi: On the outer wall of the Airavatesvara Temple in Tamil Nadu, a 12th-century stone carving shows a woman giving birth while standing upright, supported by attendants on both sides.

A replica of the stone sits at the National Institute of Indian Medical Heritage (NIIMH) in Hyderabad—a visual record, researchers there say, of a birthing practice that predates modern obstetrics by centuries, one that the supine position, introduced in 17th-century Europe, eventually replaced.

Whether the carving documents medical knowledge or merely custom remains an open question. But it points to the larger problem animating one of India’s most ambitious archival projects.

What can the historical record actually tell us about traditional medicine, and how much of that record can we even read?

Millions of manuscripts, mostly unread

India’s manuscript wealth is staggering in its scale and in the problem it presents. Millions of texts exist across temples, monasteries, private homes, and institutional archives and a significant portion of them deal with medicine. Many have never been systematically studied. Some have never been touched by a trained scholar.

The Centre’s Gyan Bharatam Mission, with an outlay of Rs 482.85 crore for 2024–2031, has made digitising this corpus a priority. Under the mission, over 7.5 lakh manuscripts have been digitised.

Participants at the International Conference on Gyan Bharatam in New Delhi last September | ANI photo

AI translation platforms, like BHASHINI, and various decipherment prototypes under Gyan-Setu initiative are being developed to help process the material.

The stakes go beyond academic completeness. NIIMH’s collection alone preserves over 800 medico-historical artefacts—some as old as a thousand years—which are showcased on its Showcase of Ayurvedic Historical Imprints (SAHI) portal, a digital archive tracing the history of Ayurveda from prehistoric times to the present.

Among its digitised records is the Talamanchi Plates of Vikramaditya I, a copper plate inscription found in Talamanchi village, Nellore district, Andhra Pradesh. It records the grant of a village called Elasatti to Srimeghacharya, the guru of Chalukya king Vikramaditya I, who ruled in the 7th century. Notably, the inscription was written by Srivaccavartmana of the Vaidyan family, indicating the prominent role that physician families played in royal administration of the time.

Another significant record is the Jivaka Amravana of Rajgir, an archaeological site located in the Griddhrakuta area of the Rajgir Hills, Nalanda district, Bihar. It is associated with Jivaka, the most celebrated physician of his time, who served as court physician to King Bimbisara and King Ajatasatru of the Haryanka dynasty—reflecting the deep roots of organised medical practice in ancient India.

Also documented is Ashoka’s Second Rock Edict, an inscription found at Girnar, Junagadh, Gujarat. It records the public healthcare measures undertaken by King Ashoka, making it one of the earliest known references to organised state-sponsored healthcare in Indian history.

“The history of medicine, studied properly, offers a map of how civilisations have understood health, disease, and the limits of human knowledge and how those understandings changed,” said Dr Saketh Ram Thrigulla, Research Officer (Ayurveda) at NIIMH.

Also Read: Modi govt is collecting rare manuscripts of Ramayan. Panel formed to evaluate text

Help from AI, with limits

Before any AI system can read a manuscript, someone has to find it.

Dr Thrigulla explained that researchers travel to temples, old family homes, and regional libraries, requesting access to documents that families have sometimes held for generations.

When custodians refuse to part with originals, researchers bring portable scanners, photograph every folio on-site, and leave without taking anything. This is when a digital record enters the national archive.

“Each digitised text is then catalogued. Script, language, subject matter, opening and closing lines. That metadata is what makes the archive practically useful,” he said.

In a digitised version, a scholar can locate every text on a given medical subject without having to read the entire corpus. Tools built under the Gyan Bharatam initiative take this further.

Scanned pages are processed through an optical character recognition (OCR) system—a technology that converts text from scanned images into a machine-readable format, across multiple Indian scripts. It can also decode the formatting such as words, line, and paragraphs, font size and emphasis, as well as tables and figures.

Scanned manuscript pages are processed through an optical character recognition (OCR) system | Source: @BharatGenOfficial

Once processed, a separate interface lets users ask questions directly—in text or voice, in their preferred language—and receive answers drawn from the manuscripts, with citations. One may ask how the mind can be calmed, and the system might point to the yogic practice of Anulom Vilom, tracing the answer back to its source text.

But digitisation is not the same as understanding. And that distinction is where the real difficulty begins.

Professor Chetan Arora of IIT Delhi’s Department of Computer Science and Engineering, along with his team, developed Lipikar, a manuscript digitisation tool recognised at the Gyan-Setu National AI Innovation Challenge organised by the Ministry of Culture in September 2025.

Lipikar uses OCR. It was built specifically to address the near-complete absence of reliable, open-source OCR tools for Indic manuscripts.

The problem, as Arora’s team encountered it, has three distinct layers. Reading, which involves recognising individual characters from images, is a difficult but achievable task for a computer. “Searching a digitised corpus for specific words or passages is now largely resolved,” he said.

However, he said that interpretation, which involves reasoning about what a text actually means in its historical and cultural context, remains an open research problem.

The distinction matters enormously in practice. “Let’s say I digitise the entire (Bhagavad) Gita for you,” Dr Arora said. “Can you now answer what does karma mean? The answer may not be written directly. This is where the research efforts are directed today.”

The meaning may not be written directly anywhere in the text, it has to be reasoned out from the context. Current AI systems can not do that very well, he explained. What researchers are doing instead is more modest and more honest about the technology’s limits. They are using AI to assist human scholars rather than replace them.

A 200-page manuscript that takes a trained expert two to two-and-a-half months to transcribe might be processed faster with AI, while the scholar does the actual interpretative work.

Also Read: Arthashastra to Ganita Kaumudi—rare manuscripts on display at Gyan Bharatam conference

Problem with ancient manuscripts

Even at the level of character recognition, Indian manuscripts present challenges that AI finds challenging. In the modern world, scripts map reliably to languages. Devanagari for Hindi, the Bengali script for Bangla.

Dr Arora explained that in the manuscript world, that relationship breaks down entirely. The same script was used for multiple languages, the same language appeared in multiple scripts.

For instance, he explained, a single character in Kaithi—a Brahmic script used widely across Bihar, Uttar Pradesh, and Jharkhand from the 16th to early 20th centuries for legal, administrative, and private records—might appear in four distinct forms depending on who wrote it and where.

Ancient Indian manuscripts present challenges not just for humans but also AI | Source: @BharatGenOfficial

For AI to recognise such variation reliably, it needs lakhs of labelled training examples. For several Indian scripts, those examples number only in the hundreds.

Professor Arjun Ghosh of IIT Delhi’s Department of Humanities and Social Sciences, who worked alongside Arora on Lipikar, frames the variability problem starkly.

“The manuscript world is far, far more diverse than the printed world, and unlike printed texts, it is not yet clear what researchers will encounter as the archive grows,” he said.

For medical texts specifically, he added, the stakes of misreading exceed mere scholarly inconvenience.

For instance, a mistranscribed character in a passage describing compound preparations or dosages is a different category of error than a misread word in a poem.

What future holds

The government’s current strategy favours building large, sovereign language models capable of handling manuscript interpretation as one function among many, AI-researchers ThePrint spoke to said.

BharatGen, a national initiative under the Department of Science and Technology’s National Mission on Interdisciplinary Cyber-Physical Systems (NM-ICPS), is developing foundational AI models tailored to Indian languages across text, speech, and vision modalities. Currently supporting 15 Indian languages, it will eventually cover all 22 scheduled languages. The initiative has already released domain-specific models for Ayurveda (Ayur Param), agriculture (Agri Param), and legal applications (Legal Param).

Professor Arora and Ghosh are betting on a different architecture—smaller, specialised models that run on “edge” devices, which are local computing systems such as small servers, laptops, or even mobile phones, that process data on-site instead of sending it to the cloud. These devices, costing less than Rs 50,000, can operate within a manuscript owner’s premises, ensuring no data leaves the room.

The privacy concern in this area is similar to the issues seen in other areas that deal with sensitive information. Arora explained that a high-resolution scan of a centuries-old document carries information beyond its text—material composition, ink age, and traces invisible to the naked eye.

“Many collectors want their text digitised, not their manuscripts exposed to the external servers,” he said.

Ghosh raises a different concern about what any of these models are trained on. The risk, he argued, lies in feeding AI systems a definition of “Indian” medical knowledge that treats it as self-contained—ignoring centuries of documented exchange with Central Asian, Middle Eastern, and Southeast Asian medical traditions.

“What AI produces depends entirely on what AI is taught and the assumptions built into the archive at this stage will be extremely difficult to dislodge later,” he said.

(Edited by Nida Fatima Siddiqui)

Also Read: Ancient Indian medical system had an image crisis. A new name fixed it

India is racing to decode its ancient manuscripts & medical texts. AI is helping—but only so far

From OCR data extraction to language models, technology is unlocking access, with Gyan Bharatam Mission prioritising digitisation—but interpretation remains a deeply human challenge.

Millions of manuscripts, mostly unread

Help from AI, with limits

Problem with ancient manuscripts

What future holds

LEAVE A REPLY Cancel

Most Popular

India’s running the wrong semiconductor race, time to shift gears. NITI Aayog offers stark advice

NIA’s Pahalgam chargesheet pieces together prelude to terror attack. It relied on human intel, DNA match

Punjab municipal polls: AAP retains its base as Congress, Akalis & BJP fail to set the tone for 2027

COMMENTS

Cancel

Follow us