Adventures in ex-post harmonization: Frankenstein’s Creature

A blog post series on harmonizing data from different surveys.

In this series of blog posts, we will explore ex-post harmonization: the art of combining data from different surveys. In this introductory post, I explain the motivation behind the series and outline the upcoming topics. The series is aimed at both seasoned ex-post harmonization practitioners as well as researchers who are curious about ex-post harmonization and how it can enrich their research projects.

Dies ist der Auftakt einer Reihe an Blogposts über ex-post Harmonisierung: der Kunst Daten aus verschiedenen Umfragen zu kombinieren. In diesem Intro geht es um die Motivation hinter der Serie und um die kommenden Themen. Die Serie richtet sich sowohl an erfahrene ex-post Harmonisierungspraktiker*innen, als auch an Forschende, die neugierig darauf sind, wie ex-post Harmonisierung ihre Forschungsprojekte bereichern kann.

DOI: 10.34879/gesisblog.2020.21

“With how many things are we on the brink of becoming acquainted, if cowardice or carelessness did not restrain our inquiries.”
Victor Frankenstein
(from Mary Shelley’s “Frankenstein; or, The Modern Prometheus”)

National and international survey research programs explore a vast range of topics and steadily add to a treasure trove of data on individuals and societies. Secondary use of that data treasure is not new, but Open Data are now being promoted with unprecedented vigor by researchers, governmental agencies, and funding agencies ¹. Secondary use may mean reusing data from one survey program, but there is a growing trend to combine data from different survey programs or data sources for joint analysis ² ³.

If data from different survey programs, which were not originally intended to be combined, are merged into a single dataset ready for analysis, we speak of ex-post harmonization ⁴ This is different from the logic of international survey programs, where data from different countries are harmonized as well. The difference is, however, that this harmonization has been planned from the outset. That is, the national questionnaires were designed to facilitate this eventual harmonization ⁵.

Ex-post harmonization is on the rise

Ex-post harmonization, in contrast, is often a more daring endeavor. Usually with a research question at its heart, ex-post harmonization projects must make do with the way that data has been collected. In essence, ex-post harmonization researchers piece together their own Creature; thankfully with happier results than Victor Frankenstein. The central benefit of ex-post harmonization is that it enables research groups to knit together datasets tailored to their specific interests (and that of subsequent secondary users of their dataset). However, in contrast with reusing data from a single survey program, ex-post harmonization requires a far more active approach to data selection and harmonization. And here, the devil is certainly in the detail.

Nonetheless, researchers take up that challenge, for example, in the ambitious Survey Data Recycling project, which combines data from 23 international survey projects ⁶. There are also several ex-post harmonization projects in the works at GESIS, such as the HaSpaD project (harmonizing and synthesizing partnership histories from different research data infrastructures) and ONBound project (old and new boundaries: national identities and religion).

Personally, I am currently working on a database of recoding scripts which make measurements of selected constructs comparable across several of the survey programs we are involved in at GESIS (ALLBUS, ESS Germany, EVS Germany, GESIS Panel, GLES, & ISSP Germany). I also consult on how to harmonize substantive instruments, especially variables representing latent constructs such as attitudes, values, emotions, personality traits, or cognitive- and non-cognitive skills.

A series of blogposts

In this series of blogposts, I want to share some insights derived from my work and my research. The series will deal with the broad questions of “Why should we ex-post harmonize surveys?”, “Why is it so challenging to combine data measured with different instruments?”, and “How to make data comparable in practical terms?”. The intention is to offer bite-sized, hands-on posts that may help ex-post harmonization practitioners with their projects and inspire researchers to try ex-post harmonization for their own research interests.

Below is a preview of upcoming posts in the series, which will also feature the links to the other posts as soon as they are published. I aim to release posts in a monthly rhythm. The list is not comprehensive. Further posts will be added as our research on the topic progresses.

Upcoming posts in the series:

The sum and its parts: The benefits of combining data from different surveys
(October 2020)
Before delving into the “how” of ex-post harmonization, we look into the “why” by exploring the various benefits of and use-cases for ex-post harmonization.

Apples and Oranges: How to find out if two questions measure the same concept?
(November 2020)
Here we delve into an important but difficult matter in harmonization in general and ex-post harmonization in particular. How can we determine if two instruments measure the same concept? This is especially hard if the concepts are latent (i.e., not directly observable).

Ceci n’est pas une pipe: Disentangling measurement and reality in ex-post harmonization
(December 2020)
The scores in our dataset are not reality itself; they are glimpses at reality through the lens of the respective measurement instrument. In research practice, that distinction sometimes takes a backseat. However, if we want to combine scores of different instruments, then the relationship between measurement and reality becomes crucial.

(Not) by any stretch of the imagination: A cautionary tale about linear stretching
(January 2021)
Linear stretching is a frequently used approach to combine data from response scales with different numbers of response options. In linear stretching, the scales’ minimum and maximum scores are set as equal, respectively, and all values in between are spread with equal distances within this range. However, while temptingly easy to use, linear stretching runs the risk of seriously biasing analyses in the harmonized dataset.

The new normal: Linear equating of different instruments
(February 2021)
As an alternative to linear stretching, we look into (observed score) equating approaches in this post and the next. Linear equating is a powerful approach that corrects for differences in difficulty and variance between different scales.

Cats are liquids: Equipercentile equating of different instruments
(March 2021)
Equipercentile equating expands on the idea of linear equating by matching the whole response distribution shape. It does not only correct for differences in scale mean and standard deviation but also higher distribution moments, such as skewness and kurtosis. This helps harmonize instruments, where, for example, respondents mostly choose high (or low) response options.

Swiss cheese and MICE: Harmonizing instruments with multiple imputation
(April 2021)
This time, we apply multiple imputation to harmonize data for the same construct measured different instruments. We will treat data as swiss cheese and then unleash mice; sorry MICE. The approach will pose some hurdles regarding the required data and the analysis complexity. However, if those hurdles are met, it can be a flexible and powerful tool for ex-post harmonization.

Interested?

You can follow the series in numerous ways:

This current introduction post will be updated with links to all other posts.
There is a blog tag for this series, with which you can easily select all posts: Ex-Post Harmonization Adventures
All posts will be announced via GESIS-News as well as on Twitter (@GESIS_org).
Or follow me on Twitter (@_R_K_Singh) for previews and interesting tidbits from my harmonization research and practice.

What next?

You can read the first substantive post in the series, which is already online: “The sum and its parts: The benefits of combining data from different surveys.” I am also open to discuss, collaborate on, or consult on any topic on the list of posts (or related topics).

References

Link, G., Lumbard, K., Germonprez, M., Conboy, K., & Feller, J. (2017). Contemporary Issues of Open Data in Information Systems Research: Considerations and Recommendations. Communications of the Association for Information Systems, 41(1), 587–610. https://doi.org/10.17705/1CAIS.04125
Dubrow, J. K., & Tomescu-Dubrow, I. (2016). The rise of cross-national survey data harmonization in the social sciences: emergence of an interdisciplinary methodological field. Quality and Quantity, 50(4), 1449–1467. https://doi.org/10.1007/s11135-015-0215-z
Hussong, A. M., Curran, P. J., & Bauer, D. J. (2013). Integrative Data Analysis in Clinical Psychology Research. Annual Review of Clinical Psychology, 9(1), 61–89. https://doi.org/10.1146/annurev-clinpsy-050212-185522
Granda, P., Wolf, C., & Hadorn, R. (2010). Harmonizing Survey Data. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P. P. Mohler, … T. W. Smith (Eds.), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 315–332). https://doi.org/10.1002/9780470609927.ch17.
Granda, P., Wolf, C., & Hadorn, R. (2010). Harmonizing Survey Data. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. E. Lyberg, P. P. Mohler, … T. W. Smith (Eds.), Survey Methods in Multinational, Multiregional, and Multicultural Contexts (pp. 315–332). https://doi.org/10.1002/9780470609927.ch17.
Slomczynski, Kazmierz M. and Irina Tomescu-Dubrow. 2018. Basic Principles of Survey Data Recycling. Pp. 937-962 in Advances in Comparative Survey Methodology: Multinational, Multiregional and Multicultural Contexts (3MC), edited by Timothy P. Johnson, Beth-Ellen Pennell, Ineke A. L. Stoop and Brita Dorer. Hoboken, NJ: Wiley.

3 comments

Pingback: The sum and its parts: The benefits of combining data from different surveys – GESIS Blog
Kazimierz Maciek Slomczynski says:

02/11/2020 at 15:13

After the sentence: “Nonetheless, researchers take up that challenge, for example, in the ambitious Survey Data Recycling project, which combines data from 23 international survey projects.” the reference to the most relevant published work is in order: Slomczynski, Kazmierz M. and Irina Tomescu-Dubrow. 2018. Basic Principles of Survey Data Recycling. Pp. 937-962 in Advances in Comparative Survey Methodology: Multinational, Multiregional and Multicultural Contexts (3MC), edited by Timothy P. Johnson, Beth-Ellen Pennell, Ineke A. L. Stoop and Brita Dorer. Hoboken, NJ: Wiley.

Loading...

1. gesispr says:
  
  05/11/2020 at 13:07
  
  Thanks for your comment, we included the reference!
  
  Loading...