(Not) by any stretch of the imagination: A cautionary tale about linear stretching

Linear stretching is a frequently used approach to combine data from response scales with different numbers of response options. In linear stretching, the scales’ minimum and maximum scores are set as equal, respectively, and all values in between are spread with equal distances within this range. However, while temptingly easy to use, linear stretching runs the risk of seriously biasing analyses in the harmonized dataset.

Linear stretching” ist ein häufig verwendeter Ansatz, um Daten von Antwortskalen mit unterschiedlich vielen Antwortoptionen zu kombinieren. Bei “linear stretching” werden die jeweils die kleinsten und größten möglichen Antwortoptionen gleichgesetzt und alle Antwortoptionen dazwischen mit gleichen Abständen gestreckt. Der Ansatz ist verlockend einfach anzuwenden, aber birgt die Gefahr Analysen mit dem harmonisierten Datensatz ernsthaft zu verzerren.

DOI: 10.34879/gesisblog.2021.30


Introduction

Last month’s blog post explored the relationship between reality and the numerical scores we have in our datasets. This month, we apply those basic ideas to better understand the limitations of a popular harmonization technique: linear stretching. You might now the approach under another name, but chances are you have already come across it. The approach tries to solve one of the most obvious challenges in harmonization: What to do if two instruments differ in the number of response categories. One survey might offer four response options, another five, and yet another seven to capture the same construct. In this post, we will look into how linear stretching works, but also why it falls short of fully establishing comparability.

What is linear stretching?

Linear stretching comes in many forms. To understand the basic idea, we look at a widespread variant: stretching shorter scales to the range of a longer scale. Consider two instruments with the same question wording, but one instrument offers seven response options (i.e., a seven-point scale) and the other only five (i.e., a five-point scale). There is an obvious need to transform the data measured with one (or both) instrument(s) before combining them for our analyses. Linear stretching does this by transforming the scores measured with the shorter scale (five-point) into a format similar to the longer scale (seven-point). The algorithm is the following: Set the lowest values of the scales as equal (here 1 = 1) and set the highest values of the scales equal (here 5 = 7). Then distribute the values in between with equal distances between the two endpoints 1. A five-point scale (1, 2, 3, 4, 5), translated into a seven-point scale, thus becomes a scale with the values 1, 2.5, 4, 5.5, 7. The small animation below illustrates why this is called linear “stretching.”

If we understand this basic idea, we can also understand different variants of linear stretching. Instead of scaling up the five-point scale to the seven-point scale, we can also compress the seven-point scale to the five-point scale. If we want a more standardized scale format for different scale lengths, we can also stretch all scales to a format between 0 and 100 2. Again, the lowest scale point is set to 0, the highest to 100, and all other values are equally spaced in between. For a five-point scale, this would be 0, 25, 50, 75, and 100. However, please note that all these variants work the same way mathematically, and thus all share the same limitations.

Pitfalls

Linear stretching seems quite sensible, at first. However, a small thought experiment reveals that linear stretching is probably not a perfect solution. Remember, what we want in ex-post harmonization is that the same number (approximately) represents the same construct intensity across data measured with different instruments. A “3” should always point towards a similar level of interest in sports, for example. However, linear stretching only accounts for differences in scale points and nothing else. Now imagine two instruments for a construct that happen to have exactly the same number of response options but differ in another aspect, such as the question wording or their response category labels. It becomes apparent, that linear stretching cannot address such differences at all 3. The fallacious implication here would be that the numerical scores mean exactly the same across the two instruments just because the number of response options is the same. And that is obviously not true. If asked about how “passionate” you are about sports, you will probably choose a lower level of agreement than if merely asked how “interested” you are in sports. To better grasp what is happening here (and what linear stretching does not account for), let us apply the intuition we developed last month.

Response “Thresholds”

The basic idea of last month’s post was that most latent constructs we measure are continuous, but we measure them with discrete numeric scores. A measurement instrument cuts the continuum of possible construct expressions (or intensities) into segments. Respondents who fall into a segment are most likely to choose the corresponding response option. Those segment boundaries can be seen as thresholds beyond which the next more (or less) intense response option is chosen.

It is crucial to recognize that different measurement instruments can result in very different thresholds, meaning they cut the range of construct expressions into different segments. The problem with linear stretching now is that it implicitly relies on two unrealistic assumptions about those thresholds. (1) It assumes that the endpoints of different measurement instruments (their highest and lowest scores) capture the same construct intensities. (2) All segments are “equidistant,” or in other words: Each numerical score covers a segment of equal size. The difference in the construct intensity of respondents who chose a “1” and respondents who chose a “2” is the same as between respondents who chose a “2” or a “3” and so on. Unfortunately, response behavior is neither that simple nor predictable. Depending on the question wording and the response options (and many other factors), the construct intensity thresholds between response options move around quite a bit between different measurement instruments. Let us look into some intuitive examples of why that happens and what it signifies.

Questions can be easier and harder to agree to

Please recall the earlier example of asking if people are “passionate” about sports or if people are “interested” in sports. The same respondent would choose a higher level of agreement to the weaker statement “interested” than the stronger statement “passionate”. (In the lingo of psychometry, the two instruments have different “item difficulties” 4.) This means that the thresholds move left or right along the continuum. If the statement is easier to agree to, the agreement segment becomes larger, the disagreement segment becomes smaller. Thus, the response distribution shifts towards the “agreement” end of the scale. The following animation illustrates this visually. Below, we see the latent score cut into segments my thresholds. The thresholds shift left and right depending on how easy or hard the question wording is to agree to. Above, we see the resulting response distribution in our dataset.

Linear stretching cannot account for such shifts to the left or right because it assumes that each scale’s endpoints are in the same position. In our example this means assuming that “totally agree” means the same if asked about being “interested” in something than if asked about being outright “passionate” about something. As a consequence, linear stretching does not correct the response distribution shift to the left or right. This means, responses are systematically biased depending on which survey they came from, even though our dataset is supposed to represent the same population. This introduces error variability into our models and thus reduces their power. The issue may also lead to spurious or biased correlations.

Respondents may use the full range of the scale or only part of it

There are many concepts in the social sciences where respondents do not use the full range of a conventional scale (e.g., a Likert scale from “strongly agree” to “strongly disagree”). A classic example is life-satisfaction scales, where the vast majority of respondents tend towards expressions of higher satisfaction or happiness. Expressions of very low satisfaction are seldomly chosen. The problem is so pronounced that some surveys resorted to asymmetric response scales, such as ‘extraordinarily satisfied,’ ‘very satisfied,’ ‘satisfied,’ ‘fairly satisfied,’ and ‘not very satisfied’ 5. This is important because how much of the range of response options is used impacts the variance of responses. An asymmetrical scale would have a higher variance of responses because respondents use the full range of the scale. A symmetrical scale would have a lower variance. Something similar happens if measurement instruments invite a midpoint response style (respondents favor options in the middle response options) or an extreme response style (respondents favor the outermost options). In any case, the result is the same: Depending on the measurement instruments, responses are spread further apart or squished together.

And again, linear stretching cannot correct such differences. After all, it assumes that the endpoints of both scales are equally far apart in terms of the underlying constructs. The consequence is that the variance of data measured with different instruments is not comparable.

Response options may be more (or less) attractive

Lastly, each response option may have a unique meaning (which can also depend on the question wording or the construct). Think of the left-right continuum. Suppose we measure respondents’ left-right orientation with a scale featuring a midpoint (e.g., the often-used 11-point scale). In that case, that midpoint may be interpreted as being in the center of the political spectrum. Consequently, respondents may find the midpoint very attractive to avoid “choosing sides.” If we look at data from the European Social Survey (ESS) for Germany in 2016, we see that more than a third of all respondents chose the midpoint response option even though they had eleven response options to choose from 6. The result is that the response options are no longer equidistant. The thresholds surrounding such a response option shift because the option is very attractive (or unattractive). In the animation below, the response option “4” becomes more or less attractive or inclusive in the sense that more or less respondents feel represented by it.

Again, this is an issue that linear stretching does not solve since it assumes response categories to be equidistant. This again introduces bias into our harmonized variable. In extreme cases, such shifting thresholds may mean that the values in a linearly stretched variable are not even ordinal anymore. A higher value from one instrument may signify a lower construct expression than a lower value from another instrument.

Conclusion

Now, of course, linear stretching is not completely indefensible. It does mitigate the problem of measurement instruments with a different number of response options to some extent. Linear stretching also does not always introduce all of these biases to a substantial degree. However, it can introduce such biases into our harmonized data. Such biases are also not always easy to identify. If our harmonization project combines data drawn from different populations, then real population differences and biases introduced by linear stretching intermingle and cannot be easily isolated. In the end, analyses using the linearly stretched data may be less reliable and valid. While we should not outright discard conclusions based on linearly stretched data, we should be mindful of the approach’s limitations. And wherever possible, we should look towards harmonization techniques that transform instruments more faithfully. We will focus precisely on this and discuss linear and equipercentile equating as promising alternatives in the next two blog posts in this series, coming up in February and March.

References

  1. Jonge, T. de, Veenhoven, R., & Kalmijn, W. (2017). Diversity in Survey Questions on the Same Topic: Techniques for Improving Comparability. https://doi.org/10.1007/978-3-319-53261-5_1
  2. Cohen, P., Cohen, J., Aiken, L. S., & West, S. G. (1999). The Problem of Units and the Circumstance for POMP. Multivariate Behavioral Research, 34(3), 315–346. https://doi.org/10.1207/S15327906MBR3403_2
  3. Jonge, T. de, Veenhoven, R., & Kalmijn, W. (2017). Diversity in Survey Questions on the Same Topic: Techniques for Improving Comparability. https://doi.org/10.1007/978-3-319-53261-5_1
  4. Moosbrugger, H., & Kelava, A. (2012). Testtheorie und Fragebogenkonstruktion (2nd Edition). Springer.
  5. Jonge, T. de, Veenhoven, R., & Kalmijn, W. (2017). Diversity in Survey Questions on the Same Topic: Techniques for Improving Comparability. https://doi.org/10.1007/978-3-319-53261-5_1
  6. European Social Survey ERIC (ESS ERIC). (2015). European Social Survey (ESS), Round 7—2014. https://doi.org/10.21338/NSD-ESS7-2014

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.