ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

The Effect of Parameters on the Quality of topic modeling at the semantic space

Analytic
Big Data
Empirical
Paul Drecker
University of Münster
Paul Drecker
University of Münster

Abstract

The increasing access to large-scale textual data such as newspapers, speeches, and texts in social media opens up new perspectives and possibilities for answering political science questions. Recent studies show that newly developed models based on large pre-trained language models demonstrate a more precise topic mapping than previously used approaches like LDA. At the same time, these models have two disadvantages: First, due to the increasingly large data sets needed, the newer models require a large amount of computing resources and calculation time which only a limited number of researchers can afford. Moreover, Increased computer capacity implies increased electricity consumption, resulting in higher carbon dioxide emissions and leading to higher costs and an environmental issue. Second, the newer models run with more parameters that influence the quality of the topics. In topic modeling, various measures of the quality of topics are known. However, the established approach is to use the metrics as a guide but then set the final parameters based on the experience of the researchers. Due to the increasing number of parameters, the number of possible models increases, and the complexity rises. At the same time, how much a change in the parameters affects the documents to topics assignment is unknown. If a change in parameters leads to a significant shift in the assignment, this parameter should be considered in the final selection of the model. Hence, this article focuses on two related research questions: a) Which parameters affect the quality of the topics and b) does the modification of parameters lead to a significant shift in the assignment of documents to topics? Answering these questions will provide evidence on whether researchers may focus on a specific subset of parameters in running these newer models to spare calculation resources and time. The results also provide an understanding of the robustness of the models to parameter changes. Accordingly, when selecting the final model, researchers can mainly control for parameter changes that strongly affect the documents to topics assignment. To provide these answers, this analysis will observe the effect of different parameters by computing a large number of models with parameter combinations on datasets with varying lengths of documents. In order to evaluate the quality of the models, I employ a set of quality metrics. Additionally, I observe the changes in document to topic assignments across all parameters and analyze the distribution of these assignments. This allows me to gain insight into the effects of the different Parameters and make informed comparisons between them.