world-history
Using Computational Linguistics to Study 19th-century Political Texts
Table of Contents
Introduction: Mining the Past with Modern Methods
Historians of the nineteenth century have long relied on painstaking manual analysis of newspapers, speeches, pamphlets, and personal correspondence to understand the forces that shaped the modern world. Yet the sheer volume of surviving material from this era—millions of pages of text produced during the Industrial Revolution, the rise of mass politics, and transformative conflicts like the American Civil War—can overwhelm traditional qualitative methods. Computational linguistics offers a complementary approach: by applying algorithms and statistical models to large text corpora, researchers can detect patterns, track shifts in language, and test hypotheses at a scale impossible for a single scholar. This synthesis of computer science and linguistics does not replace close reading but extends it, enabling historians to ask new questions about political rhetoric, ideology, and social change. The digitization of archives such as the Chronicling America newspaper database and the Hansard parliamentary records has made these millions of pages accessible, but raw access is only the beginning. The real power lies in the analytical methods that can distill meaning from noise, and that is where computational linguistics enters the historian's toolkit.
What Is Computational Linguistics?
Computational linguistics (CL) sits at the intersection of linguistics, computer science, and artificial intelligence. Its core goal is to model natural language so that machines can process, analyze, and even generate human speech and text. In historical research, CL tools typically fall into several categories, each suited to different kinds of historical questions.
Sentiment Analysis
Sentiment analysis assigns numerical or categorical scores to text based on emotional valence—positive, negative, or neutral. For 19th-century political texts, this technique can reveal how public mood shifted during events such as the 1848 Revolutions, the debate over the Corn Laws, or the secession crisis in the United States. By charting sentiment over time, historians can identify periods of rising nationalism, reformist optimism, or war‑weary despair. For instance, a study of American newspapers during the 1860 election cycle might show how Republican-leaning papers shifted from hopeful rhetoric about Lincoln to anxious calls for union as South Carolina debated secession. Sentiment analysis is also used on diplomatic dispatches to gauge how foreign governments perceived threats—a kind of quantitative historical psychology.
Topic Modeling
Topic modeling algorithms (e.g., Latent Dirichlet Allocation) automatically uncover latent themes in a collection of documents. Applied to a corpus of 19th-century British parliamentary speeches, topic models might group words like "trade," "tariff," "manufacture," and "empire" into a theme of economic policy, while "suffrage," "representation," and "property" cluster around electoral reform. Researchers can then trace how the prevalence of each theme waxed and waned over decades. Topic modeling has been used to show, for example, that references to religion in British parliamentary debates declined steadily after 1850, while discussions of infrastructure and public health rose. This technique is especially valuable for uncovering the hidden structure of large archives, revealing what topics were actually debated rather than relying on the canonical subjects highlighted by traditional historiographies.
Named Entity Recognition (NER)
NER extracts proper nouns—people, places, organizations, dates—from raw text. For 19th-century texts, NER can help map networks of political actors, track the frequency of references to specific nations or movements (e.g., "Chartism," "Abolition," "Zollverein"), and even reconstruct the geographic spread of news coverage. When applied to thousands of editorials from the 1850s, NER can show that the name "John Brown" appeared far more frequently in Northern newspapers than in Southern ones in the months after the Harpers Ferry raid, but that both regions referenced him in connection with different emotions and arguments. Modern NER models trained on historical data can also identify obsolete place names (e.g., "Constantinople" or "Prussia") and variant spellings of famous figures.
Collocation and Keyword Analysis
Collocation analysis identifies words that appear together more often than chance would predict. Studying collocates of "democracy" in American newspapers before and after the Civil War reveals a shift from abstract ideals (e.g., "equality," "liberty") to concrete institutional language (e.g., "party," "ballot," "constitution"). Keyword analysis compares a target corpus to a reference corpus to find words that are unusually frequent—often signaling what contemporaries considered urgent or distinctive. For instance, comparing speeches from the British Liberal Party in the 1880s to those from the 1830s might highlight the rising prominence of "old age pensions" and "municipal socialism" even before those terms became official policy.
Stylometry
Stylometry uses statistical measures of writing style (sentence length, word choice, punctuation habits) to attribute authorship. This technique has been employed to resolve disputed texts, such as anonymous newspaper editorials or the authorship of political pamphlets, giving historians a firmer ground for attribution debates. One classic application involved proving that a series of anti-abolitionist pamphlets from the 1830s were likely written by a single New York lawyer, even though they had been published under a pseudonym. Stylometry can also detect the evolution of an individual author's style over a career, offering insights into how a politician's rhetoric matured or changed under pressure.
Applying Computational Linguistics to 19th‑Century Political Texts
The 19th century presents both opportunities and obstacles for CL. The rise of mass literacy, the expansion of the newspaper press, and the spread of shorthand reporting produced an explosion of political discourse. At the same time, spelling was often irregular, fonts varied, and the language includes archaisms that challenge modern models. Nevertheless, carefully designed studies have yielded valuable insights that would be nearly impossible to obtain through traditional methods alone.
Case Study: The Language of the American Civil War
One of the richest applications is the study of Civil War rhetoric. By analyzing the Congressional Globe (the forerunner of the modern Congressional Record) alongside thousands of newspaper editorials, researchers have tracked how the term "Union" shifted from a legalistic concept to an almost sacred ideal during the war. Topic modeling of speeches from the Confederate Congress shows that references to "states’ rights" declined sharply after 1863, while language about "independence" and "sacrifice" increased—a pattern consistent with the tightening grip of military necessity. Sentiment analysis of Union soldier letters from the Valley of the Shadow project reveals that morale dipped after major battles like Fredericksburg but rebounded after Gettysburg and Vicksburg, providing a quantitative complement to the diary-based narratives that dominate the literature.
Case Study: British Parliamentary Debates
The digitization of Hansard (the official record of UK parliamentary debates) has opened a vast corpus for CL. Studies have examined how the vocabulary of reform—words like "duty," "philanthropy," "improvement"—evolved across the 19th century. Sentiment analysis of speeches on the Factory Acts reveals that Whig and Liberal MPs used increasingly positive emotional language as the century progressed, while Conservative opposition shifted from moral outrage to practical objections. Such findings help historians move beyond anecdotal impressions of party positions. Another study used topic modeling to compare the language of Irish Nationalist MPs with that of their British counterparts. It uncovered that Irish members consistently framed issues in terms of "land," "religion," and "oppression," while British members focused on "economy" and "administration," highlighting the fundamentally different conceptual worlds that coexisted within the same parliament.
Case Study: Abolitionist and Anti‑Abolitionist Literature
Computational methods have illuminated transatlantic debates over slavery. Topic models applied to British and American abolitionist pamphlets (1830–1865) highlight the emergence of a distinct "moral economy" vocabulary—"sin," "repentance," "blood‑guiltiness"—that differed sharply from the utilitarian language ("productivity," "free labor") used by anti‑abolitionists. Collocation analysis of "emancipation" in both British and American sources shows that British texts framed it as an imperial policy, while American texts connected it overwhelmingly to racial fears and elections. More granular work on African American newspapers like The North Star reveals that Black abolitionists used a different set of collocates for "freedom"—emphasizing "justice," "manhood," and "citizenship"—compared to white abolitionists who more often paired it with "Christianity" and "benevolence."
Tracking the Evolution of Political Vocabulary
A perennial question for political historians is how key concepts change meaning over time. Using frequency analysis and word‑embedding models, researchers have traced shifts in words like "liberal," "conservative," "socialism," and "democracy." For instance, "liberal" in the 1820s British press referred mainly to openness of trade; by the 1880s it was a party label loaded with connotations of state intervention. These semantic shifts reveal the underlying ideological realignments of the century. Word embeddings trained on the Chronicling America corpus (1836–1922) can show that the word "worker" moved from being a simple occupational label to a politically charged term in the latter half of the century, as labor movements grew and the language of class became more pronounced.
Case Study: The French Revolution of 1848 and the Rise of Socialism
Beyond the Anglophone world, CL has been used to study the French political press during the Second Republic. Topic models applied to newspapers from 1848 to 1851 reveal the rapid emergence of a distinct socialist vocabulary centered on "atélier" (workshop), "association," and "droit au travail" (right to work). Sentiment analysis shows that the language of the conservative Le Journal des Débats became increasingly alarmist after the June Days uprising, using phrases like "danger rouge" (red danger) with rising frequency. This computational approach helps historians quantify the speed at which the political landscape fractured after the revolution, a process that traditional narrative accounts describe but cannot easily measure.
Benefits for Historians and Educators
Scale and Objectivity
CL allows historians to test claims about broad patterns—for example, that nationalism intensified after 1870—against entire archives rather than a curated sample. The quantitative output provides a check on confirmation bias and forces scholars to account for counter‑evidence. For educators, interactive visualizations of sentiment trends or topic prevalence can make abstract historical forces tangible in the classroom. A tool like Voyant Tools (an open-source text analysis environment) allows students to upload a set of speeches and immediately see word clouds, frequency plots, and collocation networks, turning history classes into mini research labs.
Discovery of Silent Patterns
Algorithms can find patterns that human readers overlook. A stylometric analysis of Harper’s Weekly editorials might reveal that a shift in editorial voice occurred months before a formal declaration of party allegiance—hinting at behind‑the‑scenes factional maneuvering. Topic models applied to petitions to the British Parliament have uncovered clusters of demand for women’s rights that were largely ignored in contemporary debates. In one notable study, topic modeling of petitions to the U.S. Congress during the 1840s revealed that women’s suffrage arguments were often bundled with temperance and anti-slavery language, a pattern that traditional historians had noted anecdotally but never systematically validated.
Integration with Digital Archives
As libraries and archives continue digitizing 19th‑century holdings, CL tools become gateways to these collections. Projects such as the European Parliamentary Debate Corpus and the Chronicling America newspaper database already provide ready‑made corpora. Researchers can combine CL outputs with geographic information systems (GIS) to map the spread of political language across regions. For example, analyzing the collocates of "tariff" in county-level newspapers from the 1820s and 1830s can show which parts of the country used the word in connection with protective tariffs versus revenue tariffs, tracking regional economic interests.
Challenges and Limitations
Archaic Language and Spelling Variants
19th‑century English often differs markedly from contemporary norms. Words such as "chuse" (choose), "shew" (show), and inconsistent capitalization can confuse modern part‑of‑speech taggers and sentiment lexicons. Researchers must often train custom models on historical datasets or use spelling normalization tools. Even then, meaning can be elusive: "gay" in the 1800s connoted merriment, not sexuality, and "awful" could mean awe‑inspiring rather than terrible. The Historical Thesaurus of English and specialized historical lexicons are essential resources for interpreting results.
OCR Errors in Digitized Texts
Optical character recognition (OCR) of 19th‑century newspapers and books suffers from high error rates due to broken type, ornate fonts, and deterioration. A common error—turning "long s" into "f"—can distort frequency counts. Post‑processing correction pipelines and manual validation are essential for reliable results, adding time and cost to projects. For large projects, researchers often use tools like OCRopus or Tesseract with models fine-tuned on historical fonts. Crowdsourcing platforms like Zooniverse have also helped to correct OCR errors in major collections like the British Library 19th Century Newspapers.
Context and Irony
Computational models struggle with sarcasm, irony, and indirect quotation—all common in political rhetoric. A satirical article that mocks expansionist language may be misclassified as genuinely expansionist by a sentiment algorithm. Similarly, topic models treat all words as equally informative, but a contemporary reader would know that "the glorious cause" meant different things in a Union address than in a Confederate one. Close reading remains necessary to contextualize computational output. One way to mitigate this is to combine CL with qualitative content analysis, using the algorithm to suggest patterns and then following up with a human check on a random sample of texts.
Data Biases and Canon Formation
Digitized collections are not neutral. Major archives often reflect the biases of the institutions that created them: large‑circulation metropolitan newspapers are overrepresented, while small‑town radical papers are scarce. CL studies that rely solely on Hansard or the New York Times risk reinforcing an elite‑centered view of history. Researchers must actively seek out marginalized voices—working‑class, immigrant, indigenous, and women’s publications—and incorporate them into their corpora. The Indigenous Newspapers in North America project and the Women’s Suffrage Newspapers collection are examples of efforts to balance the digital landscape. Failing to do so can lead to claims about "public opinion" that actually only represent a small, privileged slice of society.
Future Directions
Historical Language Models
Recent breakthroughs in deep learning, such as transformer‑based models (BERT, GPT), are being adapted for historical texts. Fine‑tuned on 19th‑century corpora, these models can handle irregular spelling, capture long‑range dependencies, and even generate synthetic texts that mimic historical styles. They promise more accurate NER and sentiment analysis, though they require substantial computational resources and careful validation against known facts. The MacBERTh model, trained on a corpus of Dutch historical texts, shows how these tools can be tailored to specific languages and time periods. However, scholars must be cautious about overfitting and ensure that any generated text is used for analytical purposes only, not for misrepresenting historical documents.
Multilingual and Cross‑National Analysis
Political discourse in the 19th century was rarely confined to one language. European revolutions, transatlantic reform movements, and imperial rule produced a multilingual landscape. Future CL tools that can handle code‑switching, translation, and cross‑lingual comparisons will enable historians to study, for example, how the concept of "freedom" traveled between French, German, English, and Spanish texts. Projects like Linguistic Linked Open Data are beginning to create cross‑lingual corpora that allow for such comparisons. A study of revolutionary pamphlets from 1848 across Germany, France, and Italy could reveal whether similar metaphors of "storm" and "awakening" appeared in all three languages, indicating a shared ideological repertoire.
Interdisciplinary Collaboration
The most fruitful work emerges when computational linguists, historians, and librarians collaborate from project design to publication. Combined expertise ensures that research questions are historically grounded, algorithms are appropriate, and digitization priorities are informed by scholarly need. Initiatives like the Digital Humanities Centers at major universities are fostering this kind of teamwork, creating reusable pipelines for future research. For example, the Stanford Literary Lab regularly publishes studies that combine computational methods with deep historical knowledge, setting a standard for the field. Grant agencies like the National Endowment for the Humanities now explicitly fund interdisciplinary teams, recognizing that the best results come from specialists who learn to speak each other’s technical and theoretical languages.
Public History and Citizen Science
CL tools can also engage the public. Online platforms allow volunteers to correct OCR errors, tag named entities, or classify sentiments in historical documents—contributing to research while learning about the past. Such crowdsourcing projects have already improved the quality of corpora used in peer‑reviewed historical studies. The Transcribe Bentham project, which invites volunteers to transcribe the manuscripts of philosopher Jeremy Bentham, has produced a large, high-quality dataset that is now being analyzed with CL. Similar projects for 19th-century political petitions or letters to editors could turn a massive archive into a community resource.
Multimodal Analysis
Political communication in the 19th century was not only textual. Newspapers included woodcut illustrations, maps, and later photographs. Future CL work will integrate optical analysis of images with textual analysis, recognizing that political messages were carried visually as well as verbally. For instance, a study might compare the frequency of images of "liberty" symbols (caps, statues) in radical versus conservative newspapers and correlate that with the sentiment of accompanying text. This multimodal approach will require close collaboration between computer vision specialists and historians.
Conclusion
Computational linguistics offers historians of the 19th century a powerful set of tools to supplement traditional methods of analysis. By measuring sentiment, discovering latent themes, and tracing the evolution of political vocabulary across vast archives, researchers can answer both classic questions and entirely new ones. The approach is not without pitfalls—archaic language, OCR errors, and contextual nuance require constant vigilance—but its potential to reveal patterns invisible to the naked eye makes it an essential component of the modern historian’s toolkit. As digital archives grow and algorithms improve, the synergy between computational linguistics and political history will only deepen, enriching our understanding of a century that laid the foundations of our own. For any historian embarking on this path, the key is to treat computational tools not as a black box that produces definitive answers, but as a lens that sharpens certain questions while reminding us that others still require the nuance of human interpretation.
For further reading, see Stanford University’s Natural Language Processing Group for technical overviews, the JSTOR article “Topic Modeling the British Eighteenth‐Century” for methodological discussion, the British Library’s 19th‐Century Newspapers collection for primary source corpora, and the Chronicling America website for a rich, freely accessible newspaper database. For an introduction to hands-on textual analysis with historical data, the open-source environment Voyant Tools provides a quick and educational entry point.
- Analyzing political speeches for rhetorical strategies using collocation and sentiment.
- Tracking the evolution of political vocabulary through keyword frequency over decades.
- Understanding societal attitudes via topic models of petitions, pamphlets, and editorials.
- Attributing anonymous texts with stylometry to resolve authorship disputes.
- Mapping the geographic diffusion of political language by combining NER with GIS.