Understanding Regression Analysis

Regression analysis stands as one of the most powerful statistical tools in the economist’s toolkit. At its simplest, the method models the relationship between a dependent variable (the outcome of interest) and one or more independent variables (potential predictors or causes). For economic historians, this means moving beyond narrative descriptions of past events to test hypotheses with measurable evidence. For instance, a researcher might ask whether higher literacy rates in 19th-century England were associated with faster industrial output growth. Regression allows them to quantify that association while controlling for other factors like capital investment or urbanization.

The core idea is to fit a line (or curve) through historical data points that best describes the systematic relationship between variables. The most common form is ordinary least squares (OLS) regression, which minimizes the sum of squared differences between observed values and model predictions. But economic history also frequently employs logistic regression when outcomes are binary (e.g., whether a firm survived a crisis), and multiple regression when many influences must be accounted for simultaneously. Understanding the assumptions behind these models—linearity, independence of errors, homoscedasticity, and absence of perfect multicollinearity—is critical for drawing valid inferences from historical datasets that are often messy or incomplete.

Applications in Economic History

Labor Markets and Human Capital

Economic historians have used regression to explore how education, skills, and health shaped wages across centuries. A classic study might regress individual earnings on years of schooling, age, and gender, using census data from 19th-century America. The estimated coefficients reveal not only the returns to education at the time but also how those returns varied by region or industry. More sophisticated models incorporate fixed effects for cities or birth years to control for unobserved heterogeneity, offering a clearer window into historical labor market dynamics.

Trade and Globalization

Regression analysis has clarified the economic consequences of trade policies, from the British Corn Laws to modern tariffs. Researchers often use gravity models, a form of regression that predicts bilateral trade flows based on country sizes, distances, and trade agreements. By analyzing historical trade data from the 19th and 20th centuries, historians can estimate how protectionism or free trade agreements affected national income, industrial specialization, and wage inequality. These models highlight that trade openness does not always benefit all segments of society equally, a nuance that simple historical narratives might miss.

Technological Change and Productivity

Understanding how steam power, electricity, and computing influenced economic growth is a central puzzle in economic history. Regression models can decompose productivity growth into contributions from capital deepening, labor quality improvements, and total factor productivity (TFP)—a residual that often captures technological progress. For example, a time-series regression of aggregate output on capital and labor inputs allows researchers to estimate the growth accounting equations first developed by Robert Solow. When applied to 19th-century U.S. data, such models show that TFP growth surged after 1860, coinciding with the spread of the railroad and telegraph networks.

Demographic and Health Transitions

Demographic history benefits from regression analyses that link infant mortality, fertility, and life expectancy to economic conditions, sanitation investments, and public health interventions. A study might regress mortality rates in early 20th-century cities on measures of water quality, housing density, and income to determine which factors most reduced deaths. These models can also incorporate spatial or temporal lags, recognizing that health improvements often follow infrastructure investments with a delay. The results inform debates about whether rising incomes or public policies were the primary drivers of the modern mortality decline.

Case Study: The Industrial Revolution Revisited

The original article touched on regression analyses of the Industrial Revolution, but the depth of modern work merits a fuller treatment. Historians have long debated what caused the breakthrough in British growth after 1760. Traditional accounts emphasize coal, steam, and mechanical inventions. But regression models allow scholars to test competing explanations quantitatively. A seminal 1990s study used cross-county data from England and Wales to regress patent counts—a proxy for innovation—on coal deposits, proximity to rivers, and population density. The models found that coal availability had a statistically significant positive effect on innovation, but that the effect was smaller than once thought once they controlled for urbanization and transport costs.

More recent panel regressions of British industrial output from 1700 to 1850 include variables such as cotton prices, iron production, and agricultural yields. By including time fixed effects, researchers can absorb common shocks like wars or harvest failures, isolating the role of specific technological changes. The results suggest that while the steam engine was transformative, its impact on aggregate productivity was modest until the 1830s, decades after its invention. This contradicts heroic narratives that credit a single invention for the Industrial Revolution. Instead, the regressions point to a cluster of complementary innovations in textiles, iron, and engineering, each interacting with falling energy costs and expanding markets.

The case study also illustrates the importance of model specification. Early regressions that included only capital and labor as predictors often produced “residuals” (the TFP part) that were implausibly large, implying that technology explained nearly all growth. Modern approaches add human capital, institutions, and energy prices, reducing the residual to a more defensible share. This iterative refinement—testing one model, adjusting for omitted variables, re-estimating—is the hallmark of rigorous historical regression work.

Challenges and Considerations

Data Fragmentation and Measurement Error

Historical datasets are rarely as clean as contemporary surveys. Economic historians often work with tax records, parish registers, trade ledgers, and census tables that contain gaps, transcription errors, and inconsistent definitions. Measurement error in an independent variable typically biases regression coefficients toward zero—a phenomenon known as attenuation bias. Researchers respond by using instrumental variables, robustness checks, or by carefully documenting data construction. For example, a study of 19th-century British wage rates might cross-check data from multiple sources (Board of Trade reports, company payrolls, and local newspapers) and test results under different measurement assumptions.

Omitted Variable Bias

Correlation does not imply causation, a lesson that regression alone cannot teach. A regression that finds a strong positive link between railroad expansion and economic growth may be missing the fact that both were driven by a third variable—say, rising demand for exports. If that omitted variable correlates with both the independent variable (railroads) and the dependent variable (growth), the estimated coefficient will be biased. Economic historians address this by including as many controls as possible, using region or time fixed effects, and sometimes employing natural experiments. The difference-in-differences method, for example, compares a treated group (a region that got a railroad) to a control group before and after treatment, effectively differencing out time-invariant unobserved factors.

Spurious Regression and Non-Stationarity

When analyzing long time series (e.g., GDP from 1800 to 2000), many economic variables trend upward over time. Regressing one trending series on another can produce high R-squared values and impressive t-statistics even if the two are unrelated—a classic case of spurious regression. Economic historians must test for unit roots and cointegration, and often use differenced data or error-correction models to avoid this pitfall. Cointegration techniques, developed in the 1980s, allow researchers to model long-run equilibrium relationships among non-stationary variables while still permitting short-run dynamics. This has been especially useful for studying historical relationships between money supply, prices, and output.

Multicollinearity and Sample Size

Historical predictors like education, income, and urbanization are often highly correlated across regions, making it difficult to isolate each variable’s unique contribution. Multicollinearity inflates standard errors, leading to imprecise coefficient estimates. Researchers can address this by using dimension-reduction techniques (e.g., principal components), by pooling data, or by limiting the number of variables. Small sample sizes, another common problem in historical work, reduce statistical power and increase the risk of false negatives. Bayesian methods or bootstrap inference can help, but ultimately researchers must acknowledge the limits of their evidence.

Advanced Regression Techniques in Economic History

Panel Data and Fixed Effects

Many historical datasets combine cross-sectional units (countries, regions, firms) observed over several time periods. Panel regression exploits both dimensions, allowing researchers to control for unit-specific unobserved factors that are constant over time (e.g., geography, culture) and time factors common to all units (e.g., global recessions). This dramatically reduces omitted variable bias. For instance, a study of the economic impact of the Black Death across European cities could use city fixed effects to absorb time-invariant differences in climate or soil quality, while year fixed effects capture pandemic-wide mortality shocks. The coefficient on a variable like “trade routes” then reflects within-city changes in trade over time, net of national trends.

Instrumental Variables

When endogeneity is suspected—meaning the independent variable is correlated with the error term—economists turn to instruments: variables that affect the independent variable but are not directly correlated with the outcome, except through that variable. In historical contexts, instruments often come from natural experiments or historical accidents. For example, to estimate the causal effect of property rights on agricultural investment in colonial India, researchers might use the timing of land revenue settlements as an instrument, arguing that settlement assignment was unrelated to local soil quality. A two-stage least squares regression then isolates the exogenous component of property rights enforcement. The use of instrumental variables has grown rapidly in economic history, especially for studying long-run development outcomes.

Difference-in-Differences and Synthetic Controls

Difference-in-differences (DiD) compares the evolution of an outcome in a group that experienced a policy change relative to a group that did not, before and after the change. Economic historians have applied DiD to evaluate the introduction of compulsory schooling laws, banking regulations, and suffrage extensions. The key identifying assumption is that the trends in the two groups would have been parallel in the absence of the policy—a testable condition using pre-treatment data. Synthetic control methods extend DiD by constructing a weighted combination of comparison units that closely matches the treated unit’s pre-treatment path, providing a more rigorous counterfactual. This has been used, for example, to estimate the economic impact of the division of Korea after World War II or the consequences of German reunification.

Time‐Series Methods for Structural Breaks

Historical time series often exhibit structural breaks—sudden changes in the mean or trend due to wars, financial crises, or policy regime shifts. Traditional regression techniques may fail to detect or properly model these breaks. Modern approaches like endogenous breakpoint tests (e.g., Bai–Perron tests) allow the data to identify break dates, which can then be incorporated into regression models with dummy variables. A study of U.S. industrial production from 1880 to 1940 might uncover breaks in 1914 (World War I) and 1929 (Great Depression). By modeling these breaks, the researcher can avoid bias from assuming a single constant relationship across fundamentally different periods.

Conclusion and Future Directions

Regression analysis has become indispensable to economic history, transforming the discipline from a purely narrative field into one that tests hypotheses with statistical rigor. By quantifying relationships, controlling for confounders, and identifying causal effects, regression models have deepened our understanding of past economic transformations—from the Industrial Revolution to the Great Depression. The best work in the field openly acknowledges limitations, uses robust identification strategies, and cross-validates findings across different data sources and methods.

Looking ahead, economic historians are increasingly adopting machine learning techniques, such as random forests and lasso regression, to handle high-dimensional historical data and to uncover non-linear relationships that traditional OLS might miss. Text‐as‐data approaches (e.g., counting words in historical newspapers) are also being combined with regression to measure cultural attitudes, political discourse, and institutional quality. These innovations promise to open new frontiers, but the fundamental principles of regression analysis—careful specification, awareness of bias, and cautious interpretation—remain as relevant as ever. For those interested in diving deeper, resources like the NBER’s guide to regression in economic history and the Cambridge Economic History of the Modern World offer excellent starting points. The Economic History Association website and VoxEU columns provide accessible summaries of recent research. As data availability improves and computational tools expand, regression analysis will continue to illuminate the economic forces that have shaped our past—and offer lessons for the future.