ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Supervised machine learning to study political communication: know the pitfalls to maximise advantages

Party Manifestos
Political Methodology
Populism
Methods
Communication
Big Data
Jessica Di Cocco
European University Institute
Jessica Di Cocco
European University Institute

Abstract

The increasing use of text-as-data in the social sciences meets the need to broaden the methodologies used to answer more or less novel research questions. Undoubtedly, this kind of data offers exciting opportunities, not least because scholars can retrieve them indiscriminately in various text types. Text-as-data and their considerable quantities have also necessitated the adoption of different methods enabling the analysis of vast amounts of data in the face of limited human and economic resources. Machine learning, unsupervised and supervised, has responded to this need. Indeed, it has increasingly found its way into scientific work that relies on text analysis, from political communication to sentiments, through opinion mining and information retrieval in general. In the social sciences, supervised machine learning remains the most widely used, firstly because it allows the study to be tailored to answer precise research questions. Secondly, because it partly offsets risks connected with the so-called ‘black box’ of unsupervised machine learning, i.e. researchers know the inputs and outputs but do not know what happens in between, thus how and why the algorithm achieved some given results. Although resorting to supervised models can seem incredibly appealing, their use is not free from biases. For instance, one of the challenges concerns training the models. The training phase is critical because it forms the basis from which the algorithm learns to replicate human actions. Consequently, the better the training set, the more precisely the algorithm will perform the task. When performing automated textual analyses, researchers often rely on manually coded datasets for training the algorithm based on the labels that some coders attributed to texts following a shared codebook. However, when ‘labelling’ complex phenomena or issues about which there are no clear views, the process of coding might embed researcher biases that penalise inter-subjectivity to the extent that results can considerably vary based on who the coders are. To discuss this under-explored aspect, we focus on populism, a phenomenon much debated in the literature. We deal with this topic within a broader reflection on the pitfalls and advantages of using supervised machine learning for studying textual sources in political communication. Besides introducing the pros and cons of manually coded training sets, we also discuss the possibility of alternatives based on the principle of ‘less is more’. For illustration purposes, we explore the data in light of populism and its subcomponents, applying them to comparative case studies to show how training can impact the final results.