The sum and its parts: The benefits of combining data from different surveys
Post 1 in the series “Adventures in ex-post harmonization”
Ex-post harmonization – combining data from different surveys – requires quite some effort. In this post, we explore the many reasons why it is still worth it. All benefits discussed here apply equally to large scale and smaller ex-post harmonization efforts.
Ex-post Harmonisierung – das Verbinden von Daten aus verschiedenen Umfragen – ist nicht unaufwändig. In diesem Post geht es um die vielen Gründe, warum es sich dennoch lohnt. Die Vorteile, die diskutiert werden, treffen gleichermaßen auf große wie kleine ex-post Harmonisierungsprojekte zu.
DOI: 10.34879/gesisblog.2020.22

This is the first substantive post in the series “Adventures in ex-post Harmonization .” In future posts, we will delve into the detailed inner workings and complexities of ex-post harmonization. But before addressing the “how,” let’s talk about the “why.” Ex-post harmonization—combining data from different surveys—requires quite some effort. Having said this, in this post, I want to convince you that data harmonization might well be worth it: When we combine data from different surveys, we can answer completely new questions with old data. The sum is, quite proverbially, more than its parts.
The following is true for large-scale ex-post harmonization, such as data enhancement in archives or larger projects harmonizing many different surveys. However, it is equally true for small ex-post harmonization efforts, such as combining data from two surveys or from a few waves.
In particular, there are three broader categories in which benefits can be grouped: (1) filling gaps in the data, (2) increasing sample size, and (3) improving the robustness and reproducibility of results.
1. Filling gaps in the data
Many researchers are familiar with the feeling when their research ideas collide with their chosen survey program’s conceptual, spatial or temporal limits. If only a certain construct, a certain country, or a certain year had been included, then surely all our questions could be answered! However, even the most extensive, best-funded survey program cannot be everything to everyone. Ex-post harmonization may help compensate for some of these limitations. Data gaps in individual surveys may well be closed by adding data from other surveys.
Filling gaps in regional data

Filling gaps in regional coverage may mean just adding a single city, state, or country. But with the help of international survey programs, it can be possible to add whole swaths of the globe at the same time 1. The survey data recycling (SDR) project is an elaborate example of this. International (cross-cultural) comparability is undoubtedly an issue here. However, if the combined international survey programs overlap in some of the covered countries, measurements obtained in different countries can be made more comparable. Promising approaches here are linear and equipercentile equating, methods which help mitigate instrument differences and which will be covered later in this series.
Filling gaps in sample districts

However, adding regions is not the only benefit of ex-post harmonization that increases spatial coverage. Large face-to-face surveys often rely on multi-stage random sampling to reduce costs 2. This might mean that they randomly draw several districts in the first stage of the sampling process, then households within those districts, and finally, a random person in each household. The ADM-Design 3 is an example of this. Such an approach is usually unproblematic for data users. However, with the increasing popularity of geo-referencing, this can limit the spatial resolution considerably. Imagine showing the distribution of a concept, such as a political attitude, on the map of a country. The resolution and reliability of that map crucially hinge on the number of chosen districts (e.g., districts in which data has been collected). However, if the desired concept is included in several surveys, we can combine the data to increase the number and spread of sampled districts.
Healing time series

Apart from gaps in spatial coverage, temporal coverage is often a limiting factor in research. A survey program may have started too late or ended too early to cover the relevant years. The relevant concept may have only been included in some years but not others, or the instrument has been replaced, breaking a seamless time series. If adequate surveys are available, then all these vexing problems can be solved with ex-post harmonization. It is easiest to heal time series if an instrument change has been done with a split-half experiment. Transforming scores from the old instrument format to the new format then becomes trivial with linear or equipercentile equating. However, even many different instruments for the same construct over time in the same survey (e.g., the GESIS Panel) can be healed by using the time series of another probabilistic survey (e.g., the ALLBUS).
Filling gaps in substantive concepts

Lastly, gaps can occur in the concepts of interest. Researchers are less dependent on a specific survey covering all concepts of interest. Instead, conceptual gaps can be compensated if only some of the surveys contain a concept. This is one of the advantages of large ex-post harmonization projects combining many survey programs across several waves. However, ex-post harmonization is already helpful on a far smaller scale. Even if researchers use different surveys in separate analyses, ex-post harmonization of the survey instruments can help make the coefficients more comparable. For example, imagine testing the effect of a concept of interest (e.g., social trust) on two different depending variables. Even if the two depending variables are from different surveys and analyzed in two different models, it is worthwhile to harmonize the two social trust measures in the two surveys. That way, the results can be compared directly because now below-average (-1 SD), average (mean), and above-average (+1 SD) participants have the same social trust score, respectively.
2. Increasing the sample size

Apart from filling gaps in the data, ex-post harmonization has the undeniable benefit of increasing the available sample size. This has many different advantages, of course:
- The power to detect small effects increases.
- Population parameters and effect sizes are estimated with greater precision.
- More complex models can be estimated with adequate robustness, such as models with complex categorical variables or models with higher-order interactions.
Better resolution for subpopulations

There is another advantage of an increased sample size that is not immediately obvious but represents a vast potential for new research avenues. Suppose you are interested in a specific subpopulation, such as people with a certain education level, professional standing, or ethnicity. Then, general social surveys often have too few cases for reliable, precise estimates. However, by combining different surveys, it may well be possible to reach adequate sample sizes even for relatively infrequent social categories (or combinations of them). Of course, this only works if the respective surveys include a socio-structural measure that is finely grained enough.
3. Improving the robustness and reproducibility of results
Lastly, we look into how ex-post harmonization can make findings more robust and replicable. The replication crisis and the push towards open science have brought such issues to the center of attention. Ex-post harmonization can play an important role here. The increased sample already contributes to that. However, ex-post harmonization also helps in two specific ways.
IPD Meta-Analysis

Spearheaded in clinical research, individual patent data (IPD) meta-analyses (or Integrative Data Analysis) represent an important innovation towards more robust scientific insights 4. Meta-analyses are attempts to make findings more robust by testing them based on ideally all existing studies on the relevant effect at once. Traditionally, meta-analyses did this by aggregating the effect sizes of different studies (often experiments). Additional characteristics of the studies are included via moderator variables 5. However, this approach glosses over the rich details of the individual studies and the individual participants in each study. IPD meta-analyses improve upon that by pooling not study averages but individual raw data. They are, in other words, ex-post harmonization projects. IPD meta-analysis is especially promising for social science research, where standardized, experimental research, which is the basis for traditional meta-analysis, is not the norm. For a social science project that has been conceived with a meta-analytical approach in mind, see HaSpaD.
Comparative Methodology

The benefits of ex-post harmonization for robustness are not limited to new forms of meta-analysis, however. By their nature, ex-post harmonization projects require that ex-post harmonization practitioners, the users of the combined data, and the eventual readers have to engage actively with the methods used. That entails both the methods of harmonization as well as the methods of the source surveys. And just like comparative research between cultures reveals more than studies based on only one culture, projects based on several surveys reveal much about the methods and the degrees of freedom used in collecting and cleaning the data—a comparative methodology if you like. This may also benefit data producers because insights gained during ex-post harmonization can provide valuable feedback for them. Meanwhile, researchers can (and of course already do) combine data from many different surveys to answer pressing methodological questions (e.g., 6). For an extensive discussion of comparative methodology, see a chapter by Słomczyński and Tomescu-Dubrow 7 as well as a book by Słomczyński and colleagues 8. The latter is also available for download here.
In sum (and in parts)
Ex-post harmonization is hard work. There is no denying that. Nonetheless, I hope to have shown some ways of how it may benefit your research questions. In the sense that ex-post harmonization takes existing data and creates value far beyond the original separate surveys, it is not just mere data reuse; it is data upcycling.
Next month’s post in this series: “Apples and Oranges: How to find out if two questions measure the same concept?”
With that in mind, I hope you are already looking forward to the next month’s post in this series: “Apples and Oranges: How to find out if two questions measure the same concept?” There we will delve into one of the first and most crucial questions ex-post harmonization practitioners must ask themselves before combining data: Do two variables from different surveys measure “the same thing,” and what does that even mean?
References
- Dubrow, J. K., & Tomescu-Dubrow, I. (2016). The rise of cross-national survey data harmonization in the social sciences: Emergence of an interdisciplinary methodological field. Quality and Quantity, 50(4), 1449–1467. https://doi.org/10.1007/s11135-015-0215-z
- Shimizu, I. (2014). Multistage Sampling. In N. Balakrishnan, T. Colton, B. Everitt, W. Piegorsch, F. Ruggeri, & J. L. Teugels (Eds.), Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118445112.stat05705
- Häder, S. (2014). Stichproben in der PraxisStichproben in der Praxis. SDM Survey Guidelines. https://doi.org/10.15465/SDM-SG_014
- Hussong, A. M., Curran, P. J., & Bauer, D. J. (2013). Integrative Data Analysis in Clinical Psychology Research. Annual Review of Clinical Psychology, 9(1), 61–89. https://doi.org/10.1146/annurev-clinpsy-050212-185522
- Hussong, A. M., Curran, P. J., & Bauer, D. J. (2013). Integrative Data Analysis in Clinical Psychology Research. Annual Review of Clinical Psychology, 9(1), 61–89. https://doi.org/10.1146/annurev-clinpsy-050212-185522
- Ortmanns, V. (2020). Explaining Inconsistencies in the Education Distributions of Ten Cross-National Surveys – the Role of Methodological Survey Characteristics. Journal of Official Statistics, 36(2), 379–409. https://doi.org/10.2478/jos-2020-0020
- Słomczyński, Kazimierz M and Irina Tomescu-Dubrow. 2018. Basic Principles of Survey Data Recycling. Ch. 43; p. 937-962 in Advances in Comparative Survey Methodology: Multinational, Multiregional and Multicultural Contexts (3MC), T.P. Johnson, B-E Pennell, I. A. L. Stoop, & B. Dorer (eds), Wiley Hoboken, New Jersey
- Słomczyński, Kazimierz M., Irina Tomescu-Dubrow, J. Craig Jenkins, with Marta Kołczyńska, Przemek Powałko, Ilona Wysmułek, Olena Oleksiyenko, Marcin Zielińsk and Joshua K. Dubrow. 2016. Democratic Values and Protest Behavior: Harmonization of Data from International Survey Projects. Warsaw: IFiS Publishers
One comment