ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Dataframes within text haystacks: Evaluating Automation Framework for Structured Information Synthesis using LLMs

Comparative Politics
Elites
Political Leadership
Political Methodology
Big Data
Yifei Zhu
University of Hong Kong
Songpo Yang
Peking University
Yifei Zhu
University of Hong Kong

To access full paper downloads, participants are encouraged to install the official Event App, available on the App Store.


Abstract

Large-scale structured data extraction from unstructured text remains a central methodological bottleneck in political science. While existing elite datasets provide valuable static attributes like birthplace or education, constructing detailed career trajectories requires synthesizing fragmented information scattered across heterogeneous sources—a task that has long resisted automation. This paper introduces a generalizable framework to address this challenge, formalizing the problem as Large Open Context Knowledge Synthesis (LOCKS). We evaluate two large language model (LLM) architectures for tackling LOCKS: an agentic system and a naive Long-Context (LC) approach. Our results demonstrate a clear hierarchy of performance. The agentic system excels in retrieval and synthesis, significantly outperforming baselines by leveraging iterative, targeted information gathering. In contrast, naive LC models exhibit pronounced performance degradation when processing extended texts, exposing critical limitations in current context-window capabilities. However, a crucial finding emerges: even with these limitations, LC models still surpass trained human coders when evaluated on identical source materials. This challenges the long-standing assumption of human annotation as the definitive "gold standard" for complex extraction tasks. Furthermore, we find that our evaluation metrics remain robust whether benchmarked against machine-generated or human-validated Consolidated Ground Truth (CGT), reinforcing the reliability of our assessment pipeline. To demonstrate external validity, we apply the framework to two additional domains—U.S. political elites and OECD cabinet ministers. In both cases, the system generates novel, high-quality datasets at a fraction of the cost of manual coding. Together, these findings illustrate the promise of LLM-based extraction while offering a necessary corrective: while LLMs can outperform human annotators, reliable deployment at scale requires moving beyond "off-the-shelf" models to engineered, agentic workflows that mitigate context-length degradation.