Digital Data 2019, Week 10

Week 10: Scraping Data for Research

The internet provides myriad conveniences, among others the ability to purchase most anything from the comfort and privacy of our homes — including illegal drugs through digital black markets called cryptomarkets. These sites use advanced technological methods to obscure the identity of buyers and sellers, but they are not without risk, much like the products purchased therein. In order to understand how market participants managed the potential risks stemming from their drug use, Bancroft (2017) scraped two years of data from the forums of a major cryptomarket, using an application called NVivo to perform a combination of manual and automatic coding. He did ask the cryptomarket forum administrator for permission to collect data for his research, and receiving no response decided to proceed. In other cases I would judge this a clear violation of informed consent guidelines, as I’ve done consistently for other research performed using data from publicly-accessible forums and sites. However, this article provides an opportunity for a more nuanced discussion around consent. As the author points out, the users in this forum were knowingly conducting illegal activity, and so used safeguards both technical (anonymizing software like Tor) and social (refraining from posting personal information) to protect their true identities. Although the forum users did not know they were being watched for research purposes, they did presume they were being actively watched by authorities, and adjusted their behavior accordingly. Although the users may not have consented to their inclusion in this research, they do seem to have least been informed, which seems to create a less unethical situation. It’s also worth noting that this research project received approval from Bancroft’s IRB-equivalent process; while I’m not familiar with how exactly the University of Edinburgh makes these decisions, privacy standards around digital technology are notably higher in Europe than in the United States.

Satellite and aerial imagery used to identify urban farms; while the satellite imagery (left) did not provide a clear view, the aerial imagery (right) was prohibitively expensive. (Young et al. 2018:331)

While other research articles I’ve reviewed the last couple of weeks used APIs or web scraping as the sole source of data for analysis, for Young et al (2018), scraping was one of two methods used to identify relevant data for use in the research project. Seeking to better understand the nature of urban agriculture in Baltimore, Young et al compared the efficacy of methods typically used to make quantitative statements about agriculture (satellite imagery) with that of scraping information about urban farms from the web. Urban farms represent a fraction of the overall agriculture industry, and tend to be “smaller, more diverse, more transient, and more widely dispersed” (p. 324) than their rural counterparts. While it was technically possible to identify these more varied urban farms using higher-quality aerial imagery instead of lower-quality satellite imagery, the better technology was prohibitively expensive for the project. In the end, web scraping was identified as the more efficient, effective method for gathering geospatial data, even if it did require a greater level of manual error correction and verification by research assistants. The population being studied was not farmers but their farms, and so no personally identifying information was put at risk, and the text-scraping techniques limited the ability for researchers to gather irrelevant information just because they could (e.g. how the cars that capture imagery for Google Maps often capture people going about their daily lives), and research assistants could use their local and cultural knowledge to better understand the potential farm locations they surveyed to verify their status. This seems like an ideal use for web-scraping techniques.

Bancroft, Angus. 2017. “Responsible Use to Responsible Harm: Illicit Drug Use and Peer Harm Reduction in a Darknet Cryptomarket.” Health, Risk & Society; Abingdon 19(7–8):336–50.

Young, Linda J., Michael Hyman, and Barbara R. Rater. 2018. “Exploring a Big Data Approach to Building a List Frame for Urban Agriculture: A Pilot Study in the City of Baltimore.” Journal of Official Statistics; Stockholm 34(2):323–40.