Let me get my bias out of the way early—I’m not a big believer in multidimensional constructs, and I really don’t think there is any reason, given the accessibility of structural equation modeling, to construct mean-scaled scores for psychometrically-measured latent constructs. Nonetheless, it is still quite common to see researchers collapse multiple indicators into a single observed variable (for example, by taking the mean of the set of indicators) to include in regression models, interaction models, and so forth.

Typically the justification for doing so is a reported Cronbach’s alpha coefficient greater than 0.7. I think along with reporting alpha researchers should report the average inter-item correlation, which places alpha in context. Let me explain…

#### What does a high Cronbach’s alpha buy you?^{1}

Cronbach’s alpha measures internal consistency reliability, which is the extent to which a set of indicators intended to measure to same latent construct correlate with each other. The logic comes from classical test theory, where using multiple indicators for a latent construct helps to address random measurement error, the argument being that random error should “zero out” across the indicators such that the *Measured* score, which is what you actually collected, equals the *True*, but unobservable, score (\(S_m=S_t\)).

The challenge is that we can never truly eliminate measurement error. Further, because a latent construct—say happiness—is fundamentally unobservable, any indicators we use to measure the construct will not quite capture the entirety of the construct’s conceptual domain. Because measurement validity is not something directly measurable, we can only infer validity, often by using reliability as a proxy.

Why does this matter? Well, our most common metric for calculating reliability and hence inferring validity has a fundamental weakness—Cronbach’s alpha is highly susceptible to inflation **solely by increasing the number of indicators**.

#### What’s the formula for alpha?

In the formula, \(N\) is the number of indicators in the scale for the latent construct, \(\bar{v}\) is the average variance of the indicators, and \(\bar{c}\) is the average covariance between the indicators.

Recall that \(\bar{c}\) is really what we are interested in—assuming that the indicators are tapping into the same latent construct with equally validity, the indicators should perfectly covary (or in correlation terms, \(\bar{r}=1.0\)). Random, systematic, and conceptual error means that in practice the indicators will not perfectly covary, and so we account for the average covariance and among the vector of indicators.

#### Why alpha is easily biased

Here’s the kicker though—it’s easy to see that if we hold constant the covariance of the indicators, alpha will go up purely as a function of increasing the *number* of indicators:

```
# Lets set two values for N, one small (three indicators), and one large (nine indicators).
# Lets set the mean variance (v_bar) of the indicators at 1.0, effectively creating an average
# inter-item correlation, and then set c_bar equal to .4, which would mean that on avearge,
# the indicators share only about 16% of their variance (=.4^2).
n_small <- 3
n_large <- 9
v_bar <- 1.0
c_bar <- .4
n_small.alpha <- (n_small * c_bar)/(v_bar + ((n_small - 1) * c_bar))
n_large.alpha <- (n_large * c_bar)/(v_bar + ((n_large - 1) * c_bar))
```

Lets take a look at the two values…

`n_small.alpha`

`## [1] 0.6666667`

`n_large.alpha`

`## [1] 0.8571429`

In the small \(N\) case, and under the conventional threshold of .7 to infer sufficient reliability, we would be rightfully suspect that our indicators exhibit adequate internal consistency reliability.

But in the large \(N\) case, we’re well above the .7 threshold, and would—wrongly—conclude that our scale is highly reliable and hence appropriate to collapse into a mean-scaled score.

#### Why is this important?

In both the small \(N\) and large \(N\) scenarios, the amount of variance **not** shared by the indicators is measurement error. Regardless of its source—random, systematic, or conceptual—the greater the measurement error the greater the chance that an estimated model will produce biased or inconsistent coefficient estimates. Measurement error effectively manifests as endogeneity, and the greater the error, the greater the chance for endogeneity. A high alpha may mask a fundamental weakness the model, and give a false sense of security that what you measured is what you intended to measure.

#### So what to do about it?

I think there are three things to consider…

Remember that the ideal measurement model is a single, perfectly valid and hence perfectly reliable indicator. We never get that in practice, so expanding the list of indicators is a reasonable solution to improve the probability that you are measuring what you hoped to measure. But remember there is a clear law of diminishing returns here—a large number of indicators doesn’t necessarily buy you higher reliability!

Ensure the software you are using to calculate alpha provides you with the average inter-item covariance or correlation. Take a look at this value for what it is telling you about the shared variance between the indicators. Remember, the ideal would be an average inter-item correlation of 1.0.

Report the average inter-item correlation along with alpha in your research. Along with taking appropriate steps to deal with the potential endogeneity, allow readers to make up their own mind about the reliability of your psychometrically-measured constructs.

See Spector (1992) for a classic discussion: Spector, PE. 1992. Summated Rating Scale Construction. Sage: Newbury Park, CA.↩