Apples and Oranges: How to find out if two questions measure the same concept?

This month we delve into an important but difficult matter in harmonization in general and ex-post harmonization in particular: How can we determine if two instruments measure the same concept? This is especially hard if the concepts are latent (i.e., not directly observable).

Diesen Monat beschäftigen wir uns mit einem wichtigen, aber schwierigen Aspekt von Harmonisierung und insbesondere der ex-post Harmonisierung: Wie können wir feststellen, ob zwei Instrumente das gleiche Konzept messen? Das ist eine besondere Herausforderung, wenn das Konzept latent (also nicht direkt beobachtbar) ist.

DOI: 10.34879/gesisblog.2020.24


Many of the complexities of ex-post harmonization stem from combining variables, instruments, and concepts that are somewhat but not fully alike. By doing so, we do not only try to compare apples and orange, but we want to blend them into a tasty data smoothie. And in larger projects, we might even try to blend grapes, and melons, and cherries as well. However, unlike smoothies, blending very different concepts can often go awry in ex-post harmonization. In general, we want to combine like with like: two or more variables which all measure the same concept. Alternatively, we want to make at least careful and informed decisions about what to blend. As such, this post will explore some of the complexities we face when we want to assess if two instruments (or data measured with these instruments) reflect the same concept.

Concept Missmatch

If we combine variables that reflect very different concepts, we may introduce systematic bias or even create variables that cannot be sensibly interpreted at all. The problem with concept mismatches is that even subtle instrument differences can introduce bias. And this bias may remain inert in some analyses but distort others substantially. Consider political efficacy, which can be differentiated into internal and external political efficacy. Internal efficacy reflects whether respondents personally feel competent to participate in politics. External efficacy reflects whether respondents feel that their legal and social environment is responsive to individual political participation 1. Both internal and external political efficacy are, thus, facets of one’s self-efficacy regarding political participation. However, a shift in the social or legal environment in a country may impact external political efficacy more than internal political efficacy. If we have combined data reflecting internal political efficacy with data reflecting external political efficacy, we have created a variable where some cases react to changes in the external environment, and others do not. Such heterogeneity can lead to spurious results, especially if the construct mismatch is not random but systematical with regard to source surveys or populations.

Comparability = Validity × 2

So conceptual comparability seems to be pretty important in ex-post harmonization. However, what is conceptual comparability? In some cases, a match or mismatch is self-evident, but for more complex cases, we need some formal idea. To my mind, conceptual comparability is nothing else than validity times two. The validity of a measurement instrument is how closely measurement results reflect the intended concept in reality (and not other concepts). So validity captures whether an instrument measures what it should. Conceptual comparability, in essence, does the same, just twice: Does one instrument capture what it should, and does the other instrument also capture what it should (i.e., the same concept). So instead of comparing the instruments directly, comparability might be best established if we examine the relationship between measurement and reality for each instrument. This might seem needlessly complicated. However, as we will see next, latent constructs such as attitudes, values, emotions, or intentions are fickle and not easily captured.

Latent constructs

Latent constructs can be thought of as the unobservable causes for observable phenomena 2. And because latent constructs cannot be directly observed, they have to be inferred indirectly. That seems like a horribly abstract way of thinking about everyday “things” such as attitudes. However, have you ever seen an attitude in the wild? Have you touched it? You might have experienced one of your own attitudes if you paid careful attention. However, if we want to know the attitudes of others, we cannot observe them, only infer them. This nature of latent constructs also explains the term “construct”: Such concepts are called constructs because we have to construct the abstract construct based on externally observable phenomena such as behavior, facial expressions, or utterances. And what we do every day in our lives, science does in a formalized manner: building bridges between what we can observe (e.g., a survey response) and what we are actually interested in (e.g., a respondent’s attitude).

Meanwhile, to assess the conceptual comparability of two instruments, we do exactly that, just times two: Once for each of the instruments. Hence, we will look into different strategies to validate instruments measuring latent constructs in a moment. First, however, a general point about how latent constructs are usually accessed in the social sciences. This will help us reconcile different measurement approaches and allows us to use powerful psychometric tools to assess comparability.

Introspection access

A widespread approach in the social sciences is to ask respondents directly about a latent construct. If we want to know how interested a respondent is in politics, we might just ask: “Generally speaking, how interested are you in politics?” Nothing could be simpler. Or is it? In fact, the response process here is nothing short of magical. We ask respondents to access completely abstract aspects about themselves and then put those into a few response options 3. The key assumption here is that respondents possess the power of introspection 4. Respondents are asked to look within themselves, recall relevant instances of behavior and experiences, aggregate this information, and form a coherent response about abstract commonalities of all these disparate experiences. This approach allows us to ask about complex constructs with only a single question.

However, introspection has its limits. Some constructs are hard for respondents to access, for example, because the concept is too abstract or not consciously accessible. Other constructs are hard to judge correctly for respondents. Think about social competency. Correctly judging your social competency is a crucial component of social competency. People who struggle with that may feel that they are socially adept but are not. There are also concepts which respondents choose to hide from researchers (i.e., impression management), or that they even hide from themselves (i.e., self-deception) 5. A classic example is racism. Some respondents may feel disinclined to whole ethnic groups but have learned to hide that sentiment in most social settings. Other respondents may hold racist beliefs but have found rationalizations to maintain a self-image as a tolerant, open-minded person.

Multi-item access

The limits of introspection have led researchers to seek for ways to access latent constructs even if respondents are unable (or unwilling) to provide such insights. In survey research, the most common solution is multi-item instruments (e.g., questionnaires that employ several questions with the same response options for a single construct). Multi-item instruments solve the limitations of introspection in an elegant way. Latent constructs are, if you will recall, the unobserved causes for observable phenomena. So, suppose I keep track of many (often behavioral) phenomena influenced by the concept I am interested in. In that case, I can use this to infer the concept. For example, I cannot observe your political interest directly. Still, I can, in principle, observe your media consumption and then infer if you are interested in politics. However, I cannot reliably infer the construct from just one specific behavior. If you click on articles about a new envorinmental protection law, you might be interested in politics or in environmental protection. However, if you also click on articles about other legislative discussions, about elections, or political parties, it becomes more an more likely that your behavior is governed by political interest.

This basic principle is used by multi-item instruments. Each item separately touches upon the construct we aim for but is also influenced by other factors. However, across all items, the construct we aim for is the only thing they all have in common. And this commonality can then be extracted, often with factor analytical methods 6. This means that respondents can answer less abstract questions, while the instrument does the heavy lifting of conceptual integration. However, what does that have to do with comparability? The logic of multiple measurements reflecting one concept is nothing else than a test of comparability built into factor analysis. And as we will see in a bit, we can also make use of this logic to assess single-item (introspective) measures.

Assessing comparability via validity

In what follows, we will take a stroll through the vast landscape of validation approaches. If you will recall, the way we want to establish comparability is to assess if both instruments are valid measures for the same construct. Due to the intangible nature of latent constructs, validation and, thus, comparability assessment is a complex undertaking. Therefore, the approaches below are more intended as inspirations of how to make a case for comparability than fixed rules. And despite the focus on latent constructs, you might also find that many of the ideas can be applied to manifest concepts (such as socio-structural variables) as well.

Construct validity

Construct validity is the closest to what we discussed so far: Testing if our instruments actually measure what we want to measure. However, that is easier said than done. A starting point might be to gather expert opinions; both domain experts familiar with the concept and methodological experts familiar with measurement instrument design and survey response processes 7.

Ideally, however, we do not stop here. Even very well-versed experts may not be able to fully predict how an instrument is interpreted by respondents. Hence, survey instruments are often assessed and improved with pretests 8. This encompasses both qualitative interviews as well as more standardized forms that can be scaled to encompass many test respondents (such as web-probing). In ex-post harmonization, we may not be able to conduct our own pretests. Yet, we can profit from pretests conducted by the data producers. A good resource is thus published pretests, such as in the GESIS Pretest Database, which features pretest reports about English and German instruments. Even if the exact instruments are not included, reading pretests about similar instruments may already be enlightening. For researchers immersed in a topic, it is often hard to predict how thematically naïve respondents will interpret (and often misinterpret) even seemingly obvious questions.

Construct validity is often also assessed with correlative measures. Validity can, after all, be seen as a correlation between the measurement results and the construct we intend to measure. Unfortunately, the construct itself is not directly accessible. However, we can correlate an instrument we are interested in with other, well-established instruments (or other indicators) for the target construct. This is the basis for convergent validity, which is established if different instruments that are supposed to measure the same construct correlate highly. The other side of the coin is discriminant validity, which is established if our instruments correlate only weakly with instruments measuring distinct other constructs 9. With a look towards comparability, this means that the instruments we want to combine should (if administered together) be highly correlated. It also means that they should correlate strongly with other variables governed by the same construct. Different measures of political interest should correlate well with measures of political media usage frequency, for example. At the same time, they should correlate less with media usage frequency of non-political content.

A more elaborate form of construct validity testing is nomological networks. Here different constructs and their expected relationships are formalized in a network of correlations. Relevant indicators and constructs, as well as their respective theoretical relationships with each other, are (often graphically) formalized. A key idea is that the theoretical layer (the constructs and their relationships) is explicitly separated from the observation layer (the measurements we obtain), and the correspondence between measurement and theory is explicitly modeled 10. The practical implementation is then often a structural equation model (SEM). Instead of testing convergent and divergent instruments dyadically, we can include all indicators (measurements) and derived constructs into a single model. We can also include influence factors that impact the target construct and the consequences of our target construct. As such, nomological networks touch upon criterion validity, which will be discussed shortly.

In the case of multi-item instruments, construct validity is often also assessed with a confirmatory factor analysis (CFA). CFAs have many uses, but for our purposes, one is already tremendously helpful: A CFA can assess if all items measure a single factor and thus a single construct [6]. (There are also instruments capturing several factors at once, but that goes beyond the scope of this post.) If we want to ex-post harmonize two multi-item instruments, it is prudent to test if they actually only reflect one construct. That does not prove that it is the same construct, but it is a necessary precondition for comparability. However, CFAs are also advantageous in ex-post harmonization because we can use them as a formalized test whether different instruments measure the same construct, provided we have data where the same participants have answered both (or all) instruments. Crucially, this also works if one instrument is a single-item instrument, or if we combine three or more single-item instruments in one questionnaire. A practical application would work like this: We could combine the wording of different single-item measures into one questionnaire. Each wording is then asked with the same response scale in an item battery. We can then use a CFA to test if all the different wordings reflect the same construct or reflect several different constructs. In an experiment, Cornelia Neuert and I used that approach to compare different question wordings for social trust. Interestingly, positive wording (e.g., other people can be trusted) and negative wording (e.g., others will exploit you) form separate factors. Trust and distrust are separate, if correlated, constructs (Mansucript in preparation 11). And that is just one example of how we can use the logic of multi-item access to latent constructs to better understand single-item (often introspective) instruments.

Content validity

Many constructs in the social sciences are not monolithic and homogeneous. Instead, they have many different facets and expressions. Content validity approaches do this diversity justice by assessing if all relevant facets and expressions are captured by a measure 12. This perspective is informative for ex-post harmonization because we often have to combine instruments with different scopes. Kindly recall the example of internal and external political interest. Now imagine we want to harmonize data measured with three instruments: an omnibus instrument capturing general political efficacy, an instrument capturing only internal efficacy, and one capturing only external efficacy. Comparability, or the lack thereof, is now obviously an issue of content validity.

Criterion validity

Lastly, we can also validate instruments via expected outcomes. The approach is widely used in clinical, educational, and professional settings 13. Instruments used to assess professional aptitude are routinely validated by correlating their scores with the subsequent work performance of employees, for example. A focus on outcomes is, of course, not exclusive to individual diagnostics. Many researchers in the social sciences, especially in applied fields, care a lot about their constructs’ explanatory power for a crucial outcome. As such, it is beneficial to assess comparability with a look towards predictive power. The instruments we want to harmonize should ideally explain the relevant outcome(s) equally well. Even if our main research intention is not predictive, criterion validity can still help us test if both instruments behave similarly when applied to different outcomes. Keep in mind, however, that the outcomes are usually also measured with different instruments in different surveys. Differences in criterion validity can thus also be due to how the criterion (outcome) has been measured. Differences in reliability can be an issue, for example 14. Finally, as I alluded to earlier, nomological networks are a formal way to combine construct and criterion validity in a single model.


That was a lot to take in, I am sure. Let us enjoy a (sadly virtual) smoothie together and recap the main points of this post. It is hard to establish if an instrument measures the concept we are interested in. It is harder still to ensure that several instruments we want to combine all measure the same concept. Fortunately, we can apply all the tools used to validate a single instrument to also assess if two instruments are comparable. Of course, the long list above is not a mandatory checklist. It is also far from complete. Still, the examples hopefully illustrated that we can do a lot to ensure that variables we want to combine will lead to harmonized target variables that can be sensibly interpreted. If you are unsure which approaches to use to establish comparability, I would suggest orienting yourself towards how your field validates single instruments. Lastly, I want to emphasize that all of this is not just a methodological chore that we have to suffer through to appease reviewers. Instead, such an active engagement with the instruments we harmonize can be enormously inspiring. In my experience, methodological explorations of the source material often lead to novel and worthwhile insights into the subject matter as well.


  1. Karp, J. A., & Banducci, S. A. (2008). Political Efficacy and Participation in Twenty-Seven Democracies: How Electoral Systems Shape Political Behaviour. British Journal of Political Science, 38(2), 311–334.
  2. Bollen, K. A. (2002). Latent Variables in Psychology and the Social Sciences. Annual Review of Psychology, 53(1), 605–634.
  3. Tourangeau, R., Rips, L. J., & Rasinski, K. A. (2000). The psychology of survey response. Cambridge University Press.
  4. Tourangeau, R., Rips, L. J., & Rasinski, K. A. (2000). The psychology of survey response. Cambridge University Press.
  5. Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of Personality and Social Psychology, 46(3), 598–609.
  6. Raykov, T., & Marcoulides, G. A. (2011). Introduction to Psychometric Theory. Routledge.
  7. Moosbrugger, H., & Kelava, A. (2012). Testtheorie und Fragebogenkonstruktion (2nd ed.). Springer.
  8. Lenzner, T., & Neuert, C. E. (2017). Pretesting Survey Questions Via Web Probing – Does it Produce Similar Results to Face-to-Face Cognitive Interviewing? Survey Practice, 10(4), 1–11.
  9. Raykov, T., & Marcoulides, G. A. (2011). Introduction to Psychometric Theory. Routledge.
  10. Moosbrugger, H., & Kelava, A. (2012). Testtheorie und Fragebogenkonstruktion (2nd ed.). Springer.
  11. Singh, R. K., & Neuert, C. (2020). Do different measures of social trust capture the same construct? An exploration of factor analytical and qualitative methods. Manuscript in Preparation.
  12. Moosbrugger, H., & Kelava, A. (2012). Testtheorie und Fragebogenkonstruktion (2nd ed.). Springer.
  13. Moosbrugger, H., & Kelava, A. (2012). Testtheorie und Fragebogenkonstruktion (2nd ed.). Springer.
  14. Raykov, T., & Marcoulides, G. A. (2011). Introduction to Psychometric Theory. Routledge.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.