Getting to see the trees and the forest: tree-based detection of intentionally deceptive news-like content
Democracy
Media
Methods
Big Data
Empirical
Abstract
Following the 2016 US elections, scholars declared the 'post-truth' era (Benkler, Faris, & Roberts, 2018; Madrigal, 2017), referring to the increased spread of false information and, by that, ‘the rise of the misinformation society’ and a world in which the truth is no longer the rule. The current pandemic dramatically but perfectly illustrates this development - with the spread of the novel Coronavirus, the world was flooded with incredible amounts of misinformation, potentially similarly infectious as the virus itself (Depoux et al., 2020). Consequently, the WHO declared an ongoing 'infodemic', referring to the uncontrolled spread of false information and conspiracy theories. With informed citizens being the fundamental cornerstone of democratic societies, this currently observed spread of disinformation may as well come with tremendous impact for democratic societies (Carpini and Keeter, 1996).
In fact, while false information has been an issue since the invention of the press (Ortoleva, 2019) modern media environments are assumed to push this even further (Humprecht et al., 2020). Specifically, due to the availability and accessibility of online platforms, the dissemination of false information became easier than ever, as anyone is now able to create and share false information (Hakak et al, 2021). Furthermore, evidence suggests an impact of traditional and non-digital media in the spread of false information, in that they either cover false information themselves, or by backfiring correction attempts (Tsfati et al., 2020). What is needed, however, is better empirical evidence about the actual supply of deliberate disinformation in either traditional or digital media.
A vastly increasing number of studies makes attempts to computationally classify false information in social media (Asubiaro & Rubin, 2018; Horne & Adali, 2017; Van der Zee et al., 2018). These usually refer to certain, rather specific characteristics of such information. Aside social scientists, researchers in computational linguistics and informatics have invested efforts in the understanding of specific particularities of disinformation (Damstra et al., 2021). Even though there is only little knowledge about structural features of intentionally deceptive news content, it is reasonable to assume that the characteristics of deceptive information differ from those of entirely correct information (e.g., Allcott & Gentzkow, 2017; Asubiaro & Rubin, 2018; Horne & Adali, 2017). The aim of this paper is to provide a thorough, comprehensive and systematic approach to identifying disinformation in media contents and understanding the particularities of such information.
Hence the present study aims at both (a) a contribution to an eventually powerful tool for the supervised detection of deception attempts on the one hand, as well as (b) a better understanding of indicative characteristics of deceptive content. For this purpose, based on an extensive literature review and a diverse set of training materials, a feature-based approach for the automated detection of falsehoods is developed. By comparing different tree-based classifiers (decision tree, random forest, extra-forest), this shall not only allow the automated detection of deception attempts but will furthermore shed light on characteristic content features of deceptive news-like articles.