The missing link? Linking surveys with geospatial and digital behavioral data

Survey data are the most widely used type of data in the quantitative social sciences. They have been used to investigate a huge range of societally relevant topics. However, they also have specific limitations. An effective way of addressing those is linking survey data with other types of data, such as geospatial or digital behavioral data.
This blog post discusses the benefits of linking survey data with other types of data as well as what researchers need to consider if they want to do this. To address the latter, we introduce a general framework for organizing the data linking workflow.

Umfragedaten sind die in den quantitativen Sozialwissenschaften am häufigsten verwendete Datenart. Sie wurden für die Untersuchung einer Vielzahl gesellschaftlich relevanter Themen verwendet. Sie haben jedoch auch spezifische Einschränkungen. Ein effektiver Weg, diese zu überwinden, ist die Verknüpfung von Umfragedaten mit anderen Datentypen, z. B. mit Geodaten oder digitalen Verhaltensdaten.
In diesem Blogbeitrag werden die Vorteile der Verknüpfung von Umfragedaten mit anderen Datentypen erörtert, aber auch, was Forschende beachten müssen, wenn sie dies tun wollen. Daher stellen wir einen allgemeinen Rahmen für die Organisation des Datenverknüpfungsworkflows vor.

DOI: 10.34879/gesisblog.2021.50

Data Linking – What is it?

Data linking sounds like an abstract concept, but let’s try to be specific: For us, data linking means the enrichment of one focal dataset with information from another auxiliary dataset through specific identifiers that can be used to create the link between the data sources. In our case, the focal dataset is survey data, and its basic structure remains unchanged in the linking process. In the end, it is just extended by additional attributes (variables) from another data source.

Admittedly, the definition we use here is somewhat narrow. There are plenty of other data linking (often also called linkage) approaches out there, such as probabilistic linkage, which does not require unique/unambiguous identifiers. Yet, we focus on a common approach for social scientists working with survey data.

Data Linking – What is it good for?

Survey data can provide a wealth of interesting insights. Nevertheless, they also have some clear limitations regarding which things they can measure reliably. This is where data linking comes in. If we link survey data with other types of data, we can increase their analytic potential and address some of their limitations at the same time.¹

Survey data are typically based on self-reports. This can be both a key strength and a weakness. Though self-reports can be used to assess a huge variety of attitudes, opinions, and behaviors, a problem with self-reports is that they can be unreliable. One issue in this context is social desirability. For example, people tend to report to consume substantially more news than they actually do, as it is socially desirable to consume a lot of news.²

Another issue is that respondents may simply not be able to remember things they are asked about. Imagine that you are asked how many times you checked your Twitter feed yesterday. In this case, you are probably able to provide a good answer. However, imagine you are asked how often you did that in the last week or month. In that case, your answers are quite likely going to be guesstimates.³ ⁴

Other things may simply not be known to respondents. For example, it is quite unlikely that people are able to accurately report the noise level in decibels in their neighborhood or the exact unemployment rate in their hometown.

Our auxiliary data sources: geospatial and digital behavioral data

Survey data can potentially be linked with all sorts of other data. To answer the exemplary questions listed in the previous paragraph, researchers can make use of the two types of data that the authors of this blog post use in their daily work: geospatial (Stefan) and digital behavioral data (Johannes). Geospatial data are data that contain georeferences (i.e., geocoordinates). Digital behavioral data (DBD) comprise different kinds of data produced by the use of digital technology and services (such as websites, social media platforms, smartphones, and apps) or wearable devices (such as fitness trackers).

While there are obvious differences between geospatial data and DBD, they have some similarities and intersections. For example, DBD also can contain georeferences. In addition, both can be sensitive and disclosive. One key aspect that makes them different is the actual identifier needed to link them with survey data: geocoordinates in geospatial data and user names or IDs in DBD. Nevertheless, the basic framework of managing the linking process can be the same, as we will show in the following.

Handle with care: A privacy-sensitive workflow for data linking

Regardless of whether we want to link geospatial data or DBD with survey data, working with these data can get tricky due to data privacy issues. As we deal with sensitive information (i.e., survey respondents’ location, social media, or other online activity), we must be careful when processing the data.

Hence, an important condition for the framework we propose is that there should be no single dataset that contains all the data from the different sources. For geospatial data, this means that we use respondents’ geocoordinates as a tool for gathering relevant information from external data sources but get rid of them immediately after the linking.⁵ For DBD, this means that variables that identify users of digital platforms, such as user names or IDs, can be used to collect these data, but these data should not be stored in the same file as the survey data. Accordingly, we suggest a strict separation of the different data sources: (A) survey data, (B) the linking information (identifiers), (C) the auxiliary data, and (D) a correspondence table between identifiers of the survey data (e.g., anonymous respondent IDs) and the linking information.

In a nutshell, our framework has three steps: First, the linking information is used to extract information from the auxiliary data. Second, these extracted data are processed, and disclosive information is removed. Simultaneously, the data are extended by matching IDs with the survey data using the correspondence table. Third, these processed (and reduced) data are merged with the survey data to create an enriched version of the original survey data. Notably, even after the final step, the linked data can remain sensitive and potentially disclosive. Hence, when it comes to sharing such data, it may be that they can only be offered through dedicated secure facilities, such as the GESIS Secure Data Center.

Link responsibly!

As our proposed workflow illustrates, data linking is a double-edged sword. Linking survey data with other data, such as geospatial data or DBD, can enhance the analytic potential of the data and allow researchers to overcome some of the limitations of self-reports. But the linking increases privacy and disclosure risks.

Hence, it is important that researchers follow a privacy-sensitive workflow and exert care when linking survey data with other types of data. In that case, they can exploit their full potential while also minimizing privacy and disclosure risks. As a bonus, the workflow we have laid out should also facilitate finding solutions for publishing and sharing linked data.

While we used geospatial data and DBD as examples, our framework should also be applicable to other cases of survey data linking. Of course, this blog post can only provide a glimpse into the large topic of data linking. If you want to learn more, the GESIS Survey Guideline on data linking may be a good starting point. We believe that linking survey data with other types of data, such as geospatial data or DBD, can offer exciting opportunities. Given the sensitivity of such data, however, we as researchers must pay close attention to this and link responsibly.

References

Stier, S., Breuer, J., Siegers, P., & Thorson, K. (2020). Integrating Survey Data and Digital Trace Data: Key Issues in Developing an Emerging Field. Social Science Computer Review, 38(5), 503–516. https://doi.org/10.1177/0894439319843669
Prior, M. (2009). The Immensely Inflated News Audience: Assessing Bias in Self-Reported News Exposure. Public Opinion Quarterly, 73(1), 130–143. https://doi.org/10.1093/poq/nfp002
Scharkow, M. (2016). The Accuracy of Self-Reported Internet Use—A Validation Study Using Client Log Data. Communication Methods and Measures, 10(1), 13–27. https://doi.org/10.1080/19312458.2015.1118446
Araujo, T., Wonneberger, A., Neijens, P., & de Vreese, C. (2017). How Much Time Do You Spend Online? Understanding and Improving the Accuracy of Self-Reported Measures of Internet Use. Communication Methods and Measures, 11(3), 173–190. https://doi.org/10.1080/19312458.2017.1317337
Jünger, S. (2019). Using Georeferenced Data in Social Science Survey Research. The Method of Spatial Linking and Its Application with the German General Social Survey and the GESIS Panel. GESIS – Leibniz-Institut für Sozialwissenschaften. https://doi.org/10.21241/ssoar.63688