ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Using deep learning for visual structural topic modeling of political images

Political Methodology
Quantitative
Social Media
Communication
Big Data
Matias Piqueras
Uppsala Universitet
Matias Piqueras
Uppsala Universitet

Abstract

Political communication is increasingly visual. This is especially true on social media where citizens are regularly exposed to political messages in the form of images and videos. As a consequence, being able to study the visual content being produced, shared and consumed is of importance to political scientists interested in understanding the contemporary online political discourse. The vast amount of data available and unstructured nature of visuals naturally calls for the use of computational methods. For textual data, probabilistic topic models have been widely adopted as an unsupervised method for classifying large text corpora into latent topics. Social scientists have extended these models to make them more relevant for their research purposes by allowing the analyst to induce problem specific structure into the model in the form of metadata–document level variables such as author or date–that is assumed to influence the generation of documents. Apart from guiding the inference, incorporating prior knowledge also permits the model to be used as a measurement tool for the quantities that are often of ultimate interest, like the relation between a variable of interest and the proportion of a topic. Recent work proposed a continuation of this line of methodological research to the study of images (Torres, 2023). Specifically, it introduces a framework based on the bag of visual words (BoVW) model, a classic approach for feature extraction that allows for the creation of a visual analogue to the document-term matrix, and thus directly apply many existing text-as-data methods to images. While an important contribution, current leading methods in Computer Vision typically use Deep Neural Networks (DNNs) for unsupervised learning tasks. Moreover, early empirical work studying images computationally in the social sciences has mostly used DNN based methods, and a recent paper that compares different image representations (BoVW being one) finds using pretrained DNNs to yield the most coherent results when applied to datasets of relevance to social scientists (Zhang and Peng, 2022). This work addresses the need for a method that on the one hand takes advantage of DNNs and on the other, draws on the rich literature on topic modeling with structural information. More specifically, the paper introduces a model within the Variational Autoencoder (VAE) framework for automatically grouping images into "visual topics". The VAE elegantly connects Bayesian inference and DNNs to learn a latent representation of the data that is encouraged to follow a Gaussian distribution. Instead of a single Gaussian distribution, the model assumes that images are generated from a mixture of Gaussians to reflect the distinct distribution of visual content within a topic. Moreover, the probability of an image belonging to a topic is modelled as a function of image level metadata in a logistic regression. Depending on the needs, the model accepts raw images or creates finetuned representations based on inputs from a pretrained network. Finally, the VAE framework is flexible which makes the model possible to extend and modify, thus serving as a point of departure for future methodological and empirical research on visual communication.