Digital Data 2019, Week 06

Week 06: “Public” Sociology and Twitter Sampling

Social scientists in general, and sociologists in particular, have begun to develop methodologies for using Twitter data in research, and have also sought to better reach and educate a more general audience, a practice called public sociology. Schneider and Simonetto (2016) recently used the former to investigate whether sociologists use Twitter for the latter. They began by using the Twitter website’s advanced search to identify users who included the words “sociology” and “professor” in their profile, excluding any accounts that had not publicly published a tweet or that lacked the identification necessary to confirm the account belonged to a current (i.e. not retired or emeritus) professor of sociology. The researchers then collected up to the last 3200 tweets from each account, the maximum allowed by Twitter at the time, although the exact method of doing so is not identified. (It’s likely this was accomplished using the Twitter API, but given the changing nature of these proprietary access points, and the authors’ note that the data were organized into a searchable PDF file (oof), there may have been a more manual process involved. Twitter’s current API, for example, only allows you to extract tweets up to seven days old. Using the advanced search function on the website, a researcher can find older tweets, but in order to create a data set must collect the information manually by taking screenshots or copying and pasting the relevant tweets into a spreadsheet or other document format.) Some of the biggest challenges in working with these data were related to coding. Researchers needed to understand whether each person was a sociologist, or a non-sociologist member of the public; what the area of expertise of each sociologist was; whether a tweet from a sociologist was based in that person’s area of expertise; and whether a tweet from a sociologist counted as public engagement. These and other qualitative judgements needed to be made individually by a human, not programmatically by an algorithm. Ultimately, this study found that in general, the sociologists in this study use Twitter as a megaphone, using the platform perhaps to more widely distribute scholarly information but rarely engaging with members of the public. This study isn’t particularly generalizable, given that the researchers did not take steps during the sampling process to ensure the demographics of the study population approximate the demographics of sociologists more broadly, but it does provide a first glance at how sociologists are using Twitter, and identifies some initial methodologies that we can build upon in future studies.

Figure 2. Histogram of group-level percentages of coverage for tweets about mobilization in the Occupy Wall Street movement. Note. Estimates are based on 1,547 groups. (Rafail 2018:208)

Speaking of sampling, another recent article (Rafail 2018) investigates and evaluates the efficacy of different sampling methods for use with Twitter data, arguing that the scale of big data alone is not enough to ensure results will be representative and therefore generalizable. To begin, the author defines four unique population types, then excludes the unbounded type (for which random sampling is suggested). Data were collected from the Twitter API via a customized Python application. The application was run once a week for several months in order to work around the same 3200-tweet limit which restricted researchers in the previous study. For one population, the initial accounts targeted for collection were identified using external databases which listed Twitter accounts engaging on the topic of interest. Those accounts were then used to find other accounts engaging on the topic, the digital equivalent of snowball sampling. For a second population, tweets were collected based on the inclusion of relevant hashtags and other keywords, then cleaned to exclude those tweets which were not actually on-topic. For the third and last population, a machine-learning algorithm was employed to identify and collect tweets for analysis, but only after researchers tested its effectiveness using a data set they had manually compiled and classified. Among other results, this study finds that data collection using hashtags produces a biased sample; the majority of tweets on a given topic do not include a hashtag, which leads to their content and tendencies being underrepresented in the sample, and also generates an incorrect sense of scale regarding the reach of conversation on a given topic. This research provides a framework for understanding the importance of, and basic strategies for, sampling Twitter data in order to achieve the generalizability we pride ourselves on in other kinds of research.

Rafail, Patrick. 2018. “Nonprobability Sampling and Twitter.” Social Science Computer Review; Thousand Oaks 36(2):195–211.

Schneider, Christopher J. and Deana Simonetto. 2017. “Public Sociology on Twitter: A Space for Public Pedagogy?” The American Sociologist; Washington 48(2):233–45.