The Use of Computational Text Analysis to Trace the Spread of Literacy

From Inks to Algorithms: How Computational Text Analysis Illuminates the Spread of Literacy

The story of literacy is not simply a tale of more people learning to read and write. It is a complex, uneven process shaped by economics, religion, politics, and technology. For centuries, historians relied on proxies—school enrollment records, book ownership inventories, and anecdotal accounts—to estimate when and where literacy took hold. These methods yielded valuable but fragmented views. Today, a new set of tools is transforming the field. Computational text analysis allows researchers to mine millions of pages of historical writing, detecting patterns invisible to the human eye. By measuring changes in vocabulary, syntax, and genre, scholars can trace the diffusion of literacy with unprecedented precision.

This article explains how computational text analysis works, why it matters for understanding literacy’s history, and what its findings reveal about the movement of reading and writing skills across time and space. It also explores the method’s limitations and the critical questions it raises for future research.

What Is Computational Text Analysis?

Computational text analysis (CTA) refers to a suite of techniques that use algorithms to process, quantify, and interpret large collections of written texts. Unlike close reading, which focuses on a single document or a small corpus, CTA operates at scale, identifying statistical patterns across thousands or millions of works. Core methods include:

Frequency analysis – counting how often words or phrases appear over time to track thematic shifts.
Topic modeling – a machine learning technique that groups words into clusters (topics) based on co-occurrence patterns, revealing latent themes in a corpus.
Sentiment analysis – measuring the emotional tone of texts to gauge public mood or authority stance.
Styled metric analysis – comparing readability scores, sentence length, and vocabulary richness as proxies for audience sophistication.
Geoparsing and named-entity recognition – extracting locations, people, and organizations to map spatial patterns of discourse.

These methods depend on digitized texts. Over the past two decades, massive digital libraries—such as Google Books, the HathiTrust Digital Library, and the British Newspaper Archive—have made hundreds of billions of words available to researchers. When paired with computing power, these corpora become laboratories for studying cultural evolution.

Why Literacy Leaves a Digital Trace

Literacy is not merely a skill; it is a social practice embedded in the production of texts. As more people become literate, the volume of writing increases, its language changes, and its audiences diversify. Computational analysis captures these transformations in at least three ways:

First, vocabulary expansion. Newly literate populations often adopt simplified grammatical structures and more concrete vocabulary before moving toward abstraction. By tracking the frequency of function words (articles, prepositions) versus content words (nouns, verbs), researchers can infer changes in genre and audience.

Second, genre proliferation. As literacy spreads, new text types emerge: almanacs, pamphlets, personal letters, diaries, and eventually newspapers and novels. Computational methods can classify documents by genre automatically, allowing historians to see when and where certain forms became common.

Third, geographic diffusion. Using metadata about place of publication or authorship, researchers can map how lexical innovations—such as new words or spelling conventions—travel along trade routes, railway lines, or postal networks. These patterns often correlate with the expansion of schooling and book distribution.

Early Applications: The Rise of Vernacular Languages

From Latin to the People’s Tongue

One of the earliest applications of CTA to literacy history involved tracing the shift from Latin to vernacular languages in Europe. In the Middle Ages, literary production was dominated by Latin, accessible only to clergymen and a thin elite. By the sixteenth century, print enabled the mass production of books in German, French, English, and Italian. Computational analyses of the Early English Books Online corpus show that the proportion of English-language publications rose from under 10% in 1500 to over 80% by 1700. This shift was not uniform; it occurred earlier in Protestant regions where vernacular Bibles were encouraged, and later in Catholic areas where Latin retained liturgical authority.

Measuring Readability as a Literacy Proxy

Researchers have also used readability formulas—developed for modern education—on historical texts. The Flesch Reading Ease score, when applied to eighteenth-century British pamphlets, reveals a steady decline in complexity as printers targeted lower-skilled readers. Texts aimed at “common readers” used shorter sentences, fewer syllables per word, and more concrete references. This pattern correlates with periods of rapid school expansion, such as the growth of Sunday schools in England after 1780.

Case Studies in Computational Literacy History

1. Nineteenth-Century Europe: The Urban Literacy Boom

The original article mentioned this case study. Let’s expand it with computational detail. Using the Chronicling America newspaper database (Library of Congress), historians have examined the explosion of local newspapers across the United States and Europe. In Prussia, for example, newspapers per capita doubled between 1820 and 1860. Topic modeling of these newspapers reveals a shift from religious and agricultural content to political news and advertisements—a change that presupposes a reading public with basic literacy and civic interest. Geoparsing shows that the spread of newspaper readership closely followed railway openings, which also carried schoolteachers and textbooks to rural areas.

Sentiment analysis of letters to the editor in French provincial newspapers from 1830 to 1870 indicates a correlation between literacy rates and the frequency of complex political arguments. In regions with higher schooling enrollment, letters used more subjunctive mood and abstract nouns, suggesting that literacy enabled readers to engage with abstract concepts like democracy and rights.

2. Early Modern England: The Print Revolution

The first mass literacy campaign in the Anglophone world occurred in England during the sixteenth and seventeenth centuries, driven by Protestant Reformation emphasis on reading the Bible. Computational analysis of the English Short Title Catalogue (ESTC) shows that between 1550 and 1640, the number of titles published per decade increased by a factor of ten. A key shift was the rise of “cheap print” – broadside ballads, almanacs, and chapbooks – whose simple syntax and repetitive structures indicate that publishers expected a semi-literate audience.

One study used topic modeling on 20,000 English pamphlets from 1600–1700. The model revealed a distinct “instructional” topic cluster containing words like “read,” “spell,” “catechism,” and “child.” The proportion of texts in this topic peaked during the 1640s and 1650s, a period of revolutionary upheaval when Parliament promoted literacy among soldiers and commoners. The geographic spread of these pamphlets, traced through imprint locations, shows that literacy education materials moved outward from London to market towns, then to villages—a pattern repeated a century later in colonial America.

3. Colonial and Post-Colonial Contexts

Computational text analysis is now being applied to the spread of literacy in colonized societies. Researchers at the University of Helsinki used n-gram analysis on nineteenth-century nineteenth-century Indian newspapers in English and regional languages. They found that the use of English words in vernacular newspapers increased rapidly after 1857, indicating that a bilingual literate class was emerging. Sentiment analysis of editorials in Bengali newspapers from 1860–1900 shows a shift from deferential language to assertive nationalist rhetoric, which would have required a readership comfortable with complex political argument—evidence of deepening literacy among the Indian middle class.

In sub-Saharan Africa, missionary-produced texts provide the earliest written material in many languages. Computational stylometry (comparing authorial styles) suggests that early translations of the Bible into Yoruba and Xhosa simplified native grammatical structures to match the reading level of newly literate converts. This had lasting effects on the development of these written languages.

Methodological Innovations: How Researchers Extract Signals

N-Gram Analysis

Perhaps the simplest computational tool is the n-gram, a contiguous sequence of n words. Google Books’ Ngram Viewer popularized the ability to plot the frequency of phrases over centuries. For literacy studies, n-grams reveal the adoption of new words—such as “telegraph” or “newspaper”—that indicate expanding information networks. One study of British English n-grams from 1700–1900 found that the phrase “to read” increased in frequency by 400% between 1750 and 1850, while the phrase “to write” increased even faster after 1830, suggesting that writing skills spread later than reading skills.

Topic Modeling Across Time

Topic modeling, using algorithms like Latent Dirichlet Allocation (LDA), clusters vocabulary into topics that humans can interpret. Applied to the Eighteenth Century Collections Online (ECCO) corpus, topic modeling identified a clear transition from volumes dominated by religious and classical topics to those dominated by science, commerce, and fiction after 1760. This transition aligns with the rise of the “reading public,” a demographic shift made possible by expanded literacy. When researchers compare topic models for texts published in London versus those from provincial towns, they find that London topics changed faster—again evidence that literacy (and its associated cultural capital) moved from urban centers to the periphery.

Stylometric Readability Metrics

Beyond content, computational tools measure the “readability” of texts. The Coleman-Liau index, originally designed for modern English, can be adapted to historical texts by adjusting for spelling changes. Applied to a corpus of 10,000 American novels from 1770–1920, readability scores dropped significantly after 1840, when the common school movement began. This suggests that authors increasingly targeted readers with limited formal schooling. The same pattern appears in British non-fiction: works of popular science from the 1850s used shorter sentences than comparable works from the 1790s.

Benefits of Computational Analysis for Literacy History

Scale: CTA can process millions of texts in hours, impossible for a single scholar.
Objectivity: Quantitative measures reduce the risk of cherry-picking evidence to support a pre-existing narrative.
Comparability: Same methods can be applied to different languages, regions, and periods, enabling cross-cultural analysis.
Visualization: Graphs and maps make trends visible instantly, aiding both research and public communication.
Hypothesis generation: Unexpected patterns in the data can lead researchers to ask new questions about causality.

For example, a topic model of German-language newspapers from 1848–1850 unexpectedly revealed a sharp drop in vocabulary diversity in conservative publications, while liberal publications increased theirs. That hinted at censorship suppressing expression among conservative writers—but since conservatives were in power, that seemed paradoxical. Further investigation showed that conservative newspapers were closing down, reducing the variety of voices. This insight led to a revised understanding of literacy’s political role during revolutionary periods.

Challenges and Limitations

Digitization Bias

Not all historical texts survive, and those that do are not digitized evenly. European archives have prioritized government records and canonical literature over ephemera like trade cards or personal letters. As a result, computational analyses may overrepresent elite male perspectives. Libraries in the global South are often under-digitized, skewing our picture of literacy’s global spread.

OCR Quality

Optical character recognition (OCR) software often struggles with historical fonts, faded ink, and irregular layout. Errors can be as high as 30% for early modern texts. If researcher don’t clean the data, frequency counts become unreliable. Tools like OCR-D are improving accuracy but require expertise.

Language Complexity

Languages with complex morphology (Finnish, Arabic, Quechua) present challenges for tokenization. Also, historical spelling was not standardized; “read” might appear as “rede,” “reade,” or “reed.” Researchers must either modernize spelling or use character-level models that are more computationally expensive.

Interpretive Caution

Correlation is not causation. A rise in word “freedom” in 1790s American newspapers may reflect literacy growth—or simply the French Revolution as a topic. Computational findings must be grounded in historical context and triangulated with archival sources. This is not a replacement for traditional scholarship but a complement.

Skill Barriers

Computational text analysis requires programming skills (Python, R) and familiarity with statistics. Many history departments still lack training in these areas, though digital humanities centers are growing. Open-source platforms like Voyant Tools lower the barrier for beginners.

Ethical and Epistemological Questions

As with any digital method, CTA raises questions about what counts as evidence. Do word frequencies really measure literacy, or merely the interests of publishers? Does a topic model reveal author intention or just the constraints of genre? The field is still debating these issues. One emerging best practice is to combine computational analysis with qualitative close reading of representative texts—sometimes called “distant reading” and “close reading” in tandem.

Another concern is algorithmic bias. Topic models are trained on whatever texts are available; if a corpus is 90% male-authored, the topics will reflect male concerns. Researchers must actively seek to include women’s writing, colonial literature, and other marginalized voices. Projects like The Women’s Print History Project are building more inclusive corpora for exactly this purpose.

Future Directions

Multilingual and Cross-Script Analysis

Most CTA work has been on English. But literacy spread was often multilingual—people read in two or more languages. New models like multilingual BERT allow researchers to analyze texts in multiple languages simultaneously, revealing how ideas and words moved across linguistic borders. For example, preliminary work on the Mediterranean world shows that literacy in Arabic, Spanish, and Catalan coexisted in early modern Valencia, with computational analysis showing lexical borrowing patterns that reflect religious and commercial contact.

Small Data and Infrastructure

Not all research requires millions of texts. “Small data” computational methods—applied to a few hundred carefully curated documents—can yield insights about specific communities. The rise of digital humanities centers in Africa, Asia, and Latin America means more local actors can tell their own literacy histories using computational tools.

Machine Learning and Handwritten Text Recognition

Until recently, computational text analysis was limited to printed texts. But handwritten text recognition (HTR) is improving rapidly, allowing scholars to analyze diaries, personal letters, and administrative records—the very documents that were often the first evidence of literacy for common people. Projects like Transkribus already handle historical handwriting with over 95% accuracy for some scripts. This will open up vast new archives for literacy research.

Conclusion

Computational text analysis is not a magic wand, but it is a powerful lens. By turning billions of words into quantifiable patterns, it allows historians to see the slow, uneven spread of literacy as it actually happened—not as a smooth curve but as a series of surges, plateaus, and reversals shaped by war, religion, trade, and policy. The method has shown that vernacularization preceded literacy in many contexts, that readability declined as literacy expanded, and that print networks were critical to the diffusion of reading skills.

Yet the most important lessons may be about the limits of the evidence. What computational text analysis reveals is largely the literacy of those who could afford to print and store texts. The truly marginal—those who left no written trace—remain hidden. But by combining these new tools with traditional archival work, researchers can listen more carefully to the voices that survived, and ask better questions about the literacy of the past.

As more texts are digitized and algorithms become more sophisticated, our understanding of how reading and writing spread will only deepen. For now, the field stands at a promising threshold, where the old and the new—the ink and the algorithm—work together to illuminate one of the most consequential developments in human history.