ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Identifying Social Class and Networks from US Legislative Journals using LLMs

Elites
USA
Methods
Ivan Fomichev
Universität Bern
Ivan Fomichev
Universität Bern

To access full paper downloads, participants are encouraged to install the official Event App, available on the App Store.


Abstract

Historical documents, such as nineteenth-century American legislative journals, contain vast amounts of structured information, yet they remain largely inaccessible for quantitative analysis due to non-machine-readable formats and optical character recognition (OCR) noise. This paper presents a scalable, Large Language Model (LLM)-based framework designed to extract structured roll-call data from large corpora of noisy, image-derived text. To overcome LLM context window constraints, the proposed pipeline employs a four-stage "hierarchical narrowing" strategy—indexing, classifying, assembling, and extracting—coupled with internal cross-referencing (matching extracted names to aggregate vote counts) for automated validation. Using the South Carolina House of Representatives (1858–1882) as a demonstration case, the pipeline achieves an extraction accuracy exceeding 95%. The resulting dataset facilitates the construction of legislative co-voting networks, allowing for the analysis of latent factional structures across the dramatic political ruptures of the antebellum, Civil War, and Reconstruction eras. Ultimately, this approach offers a highly generalizable template for recovering sparse, structured data from historical text at scale.