ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

How much to seed a (partially) seeded topic model?

Media
Methods
Quantitative
Agenda-Setting
Communication
Big Data
Salsabil M. Abdalbaki
University College Dublin
Salsabil M. Abdalbaki
University College Dublin
Johan A. Dornschneider-Elkink
University College Dublin
Derek Greene
University College Dublin

Abstract

Topic models are an increasingly common method in political science, because of the high prevalence of textual data. Topic models are fully data-driven – the algorithm identifies topics based on the Bag of Word approach and co-occurrence of words. Latent Dirichlet Allocation (LDA) is often used for topic discovery, whereby the task of assigning labels to the identified topics ultimately relies on the researcher. One example of an application is the identification of media frames in a text corpus. The challenge with these unsupervised topic models when applied in social science research is that the topics do not necessarily include those of theoretical interest to the researcher. Semi-supervised topic models like seeded LDA and partially seeded LDA overcome this challenge as they depend on creating a pre-defined dictionary of seeded topics of interest and seed words representing these topics. This increase in interpretability and alignment of the topic model output with the researcher’s interests, comes at the cost of reduced model fit. When the seeding is too extensive, the researcher’s expectations start to drive the output, not the data. Unfortunately, there is very little guidance on how many seed words to add for a seeded topic, or how many topics to seed. This paper contributes to the literature of computational frame analysis by investigating the impact of the level of supervision on the coherence and fit of the resulting topic model. The Guardian API (750,000 articles) and Media Frame Corpus 4.0 (MFC) (20,037 articles) are utilized to run 100 replications across a range of different model specifications on a computer cluster. The baseline model in our simulations is the LDA with 50 topics, which is used for the comparison with the partially seeded LDA. The extensive computational simulations follow two steps. First, we assume that every media frame in the MFC represents a seeded topic. So, we use KeyBERT to extract the relevant seed words for every seeded topic and stick to the top 100 words to run the simulations. As a result, a dictionary of both seeded topics and the corresponding seed words is constructed from the MFC to implement the partially seeded LDA on the collected Guardian articles to avoid biasness. Second, using the Hungarian algorithm, the Kullback–Leibler divergence of the seeded topic from the unseeded equivalent is calculated to evaluate the model’s performance. This is combined with standard measures of topic coherence. Based on the sampled news articles, our experimental results confirm that the divergence of a partially seeded LDA from unseeded LDA increases and the topic coherence of the partially seeded LDA decreases as too many topics are seeded with too many seed words. We provide guidance with regards to the maximum amount of reasonable seeding.