I recently received a review on a paper of mine where the reviewer made this comment:

We make simplifying assumptions all the time in our models that do not match reality…Most of the time, we arrive at correct conclusions.

The context for the comment was pushing back on an argument my coauthors and I made in the paper that simplified measurement models facilitate theoretical precision, improve reproducibility, reduce noise, and are easier to model. When comparing two different sets of measures for effectively the same construct, assuming both measures demonstrably tap into the same conceptual domain, the simpler measure/measurement model is better.

The reviewer expanded to say that, effectively, even if the existing measure is really noisy and messier to model, we shouldn’t replace it unless the new measure makes a ‘theoretical contribution’. Grrr.

I couldn’t disagree more with his or her perspective, for three reasons.

- False confidence in null hypothesis testing

What bugged me most is the notion of a ‘correct conclusion.’ That’s simply not the world we operate in—in the social sciences, truth is fundamentally unknowable, and the best that we can hope for is to derive a reasonable prediction in a real-world context with substantial variance and uncertainty. We just don’t have ‘correct conclusions’, we have best guesses.

In the context of null hypothesis testing, this is where the fallacy of believing that rejecting the null hypothesis is evidence of the ‘truth’ of the alternate hypothesis is so dangerous. We do *not* prove the null, and a statistically significant result is simply a statistically derived, sample-specific metric **suggesting** that the probability that the observed effect size or a larger effect size is due to random chance, **assuming that the null hypothesis of no relationship is true**. The only way the p-value has any meaning is *if* there is no underlying relationship between \(x\) and \(y\) in the real world. ‘Correct conclusions’ in the sense that we know, unequivocally, the nature and size of the effect of \(x\) on \(y\) is simply not something that we can *ever* establish.

- False confidence that statistically significant results in small samples means that the effect is “robust”

Others have written on this much better than I can, but the critical issue is this…While we often think about power as being important to identifying small effects, the fallacy is in thinking the inverse is also true—that identifying an effect in a small (and hence noisy) sample is evidence that the effect is really strong or really robust. That’s simply not the case. In small, noisy samples, it is often *more* likely that the observed effect is inflated by noise, rather than being evidence of a true underlying effect. Because most of our research involves relatively modest sample sizes, we should be particularly circumspect that we’ve identified “correct conclusions” about the relationship between \(x\) and \(y\).

- Misunderstanding the relationship between standard error and sample size

What’s not often appreciated is that, for most models, there is a simple mathematical benefit to lowering the standard error of a measure…

\({se}_{\beta}=\frac{{s}_{e}}{\sqrt{{s}_{x}^{2}*(n-1)}}\)

The standard error of \(\beta\) is a function of the variance in the measure and sample size. If you increase the sample, holding variance constant, the standard error will go down.

But if you lower the variance in the measure, you get an *exponential decrease* in the standard error. Lowering the standard error by a factor of two is just as good as increasing the sample size by a factor of four.

Given that large samples isn’t a problem we usually suffer from, any effort to reduce measurement error in our commonly used measures **should absolutely be** a priority for our field. The benefits of better (less noisy) measures are just too valuable to sacrifice to the desire to make a “theoretical contribution.”

I’ve written before about my desire to relax the expectation to make a theoretical contribution in every paper. I think this is especially true when evaluating new measures and measurement models. Now, I do think it’s fair game to ask that a new measure for an existing construct should demonstrate that it does something that a previous measure couldn’t do, in the sense that the new measure (among other possibilities)…

- Has a lower standard error in repeated sampling; or
- Is easier to instrument in instrument variable/2SLS models; or
- Allows researchers to better identify antecedent relationships; or
- Allows researchers to ask questions that existing measures struggle to address

Now, are these things a theoretical contribution? Well, in the sense that a theoretical contribution is largely in the eye of the beholder, then sure, it could be considered as such. To me though, the value of less noisy measures is that we are less likely, all other things being equal, to observe spurious and/or inflated relationships. It’s simply too easy to observe statistically significant results with noisy measures in the noisy data that entrepreneurship and management scholars generally use. This is what is so irritating about the reviewer’s comment—the idea that despite the noise, we’re generally going to find a “correct conclusion.” That “correct conclusion” could just as easily be spurious and a function of researcher degrees of freedom as it is to be right.

We need to change the mindset that *despite* the simplification and equifinality in data analyses we eventually come to a “correct conclusion.” That’s more faith in social science than I’m willing to extend, and in the end, this mindset can stifle the desire to push the field forward with more rigorous research designs and better measures. A little more humility in our models would do the field some good.