Here’s an interesting take from a few days ago on the American Statistical Association’s statement on the use—and misuse—of p-values that was published last year. I’m certainly in the camp that p-values are more often than not misunderstood and misapplied in published studies, but the challenge I’ve found has been to communicate the myriad of assumptions made when employing the p < .05 standard and how shaky research can be that deviates from these assumptions.
The p-value is the probability that the observed effect, or a larger effect, is due to random chance, assuming that the null hypothesis is true. Generally in the null hypothesis testing framework, the assumption made is that the effect is zero. Not statistically different from zero, but, actually, zero. Therein lies one of the many problems with p-values—very rarely, if ever, would we expect absolutely zero effect in social science research. Our constructs are too noisy, and our theoretical explanations too loose, to reasonably expect an effect of zero.
So given that the null itself isn’t likely to be true, how do we reconcile the p < .05 standard? Well, the best option is to be a Bayesian, but if you have to retain a frequentist perspective, here is one explanation I use with my doctoral students.
The irony of p-values is that the more likely the null hypothesis is to be true, the less valuable the p-value becomes. You can think of it conceptually like a classic conditional probability: Pr(y|x) What’s the probability that a study reporting a statistically significant rejection of the null hypothesis is accurate, given the probability that the null hypothesis is true?
Now, to be clear, the p-value doesn’t—and can’t—say anything about the probability that the null hypothesis is true. What I’m talking about is the researcher using his/her own judgement based on prior work, theory, and deductive reasoning about just how likely it is that the null hypothesis is true in real life. For example, in EO research, the null hypothesis would be that entrepreneurial firms enjoy no performance advantage over conservatively managed firms. We could put the probability of the null hypothesis being true at about 10%—it’s difficult to imagine a meaningful content in which it doesn’t pay to be entrepreneurial, but it’s possible.
So how does that inform evaluating EO research reporting a p < .05 standard? Lets imagine a study reports an effect of EO on firm growth at p = .01. Under a conventional interpretation, we would say that the probability was 1 in 100 that the observed effect, or a larger effect, was due to random chance assuming that the null hypothesis is true. But real life and our judgement says that the null hypothesis has little chance of being true (say 10% in our example). In this case, the p-value actually works pretty well, although it’s not very valuable. The 1 in 100 chance that the results are due to random chance is not that far off from our 1 in 10 chance that being entrepreneurial doesn’t help the firm grow. It’s not valuable in the sense that its told us something that we already knew (or guessed) to be true in the real world.
But what about the case where a study reports p = .01, so still a 1 in 100 probability, but the likelihood of the null hypothesis itself being true is high, say 95%? In other words, the likelihood of the effect being real is very small. The best discussion of the exact probability breakdown is by Sellke et al (2001), and here’s a non-paywalled discussion of the same concept, but in this case the p-value can be downright dangerous. The probability of incorrectly rejecting a highly probable null hypothesis is over 10% at p = .01, and almost 30% at p = .05. In short, the more probable the null, the more likely it is that ‘statistically significant evidence’ in favor of its rejection is flawed.
The bottom line is that there is no substitute for using your own judgement when evaluating a study. Ask yourself just how likely it is that the null hypothesis is to be true, particularly when evaluating research purporting to offer ‘surprising’, ‘novel’, and ‘counterintuitive’ findings. You might find that the author’s statistically significant novel finding is itself likely to be a random variation, or as Andrew Gelman might say, the difference between significant and not-significant is not itself statistically significant.