Null hypothesis testing

· by Brian Anderson · Read in about 9 min · (1848 words) ·

In a funny bit of statistical/science humor, the way we employ null hypothesis testing in most modern research is not what was intended by its originators. The procedure today is really an uncomfortable mashup of two camps that bitterly debated each other.

So for the purposes of this post, I’m going to talk about null hypothesis testing the way that the majority of researchers understand it and use it on a regular basis. This is not necessarily the way it should be, nor the way it was intended to be, but it is the way we (mostly) employ the null hypothesis testing methodology.

Most of the time when we see an hypothesis, it is written something like this:

Hypothesis 1: There is a negative relationship between Drug X and cholesterol count in the blood.

There are lots of ways to write hypotheses, but with the one above, we’re basically saying that we think Drug X lowers cholesterol. Technically, this hypothesis is actually an alternate hypothesis, that is, what the researcher hopes (wants) to find in the data. We contrast the alternate hypothesis against the null hypothesis. The null hypothesis, again most commonly, is that there is no relationship between the variables–in this case, the null would be that Drug X doesn’t actually lower cholesterol. It doesn’t raise cholesterol either; it just doesn’t effect cholesterol in any way.

For purposes of contrasting the alternate against the null, we start on the assumption that the null hypothesis is the ‘true’ relationship, irrespective of whether it is in ‘true’ in the real-world (more on that in a bit). Usually, you’ll see the null hypothesis expressed with one of two equations:

\(H_0:\mu_1=\mu_2\)

\(H_0:\beta=0\)

In the first equation, the null hypothesis (H sub 0) is that there is no difference in the mean value of some outcome variable, say cholesterol count in the blood, between two groups (mu sub 1 and mu sub 2). This is the basic null for many types of experiments, where one group receives a treatment, say a new drug, and the other group receives no treatment (the control). The alternate hypothesis is that there is a difference between the groups on the outcome variable, and we would write it with the following equation:

\(H_1:\mu_1\neq\mu_2\)

For both the null and the alternate for equation 1, the mean value varies as a function of the presence of a predictor, or treatment—in this case, taking Drug X. For the null, we are saying that there is no difference in the mean value of cholesterol count in the blood for the group that took Drug X versus the one that didn’t take Drug X; taking Drug X didn’t effect cholesterol. For the alternate, we are saying there is a difference. Technically, we’re not saying which way the difference works for the alternate, but usually our alternate hypothesis is positivist; taking the drug lowers cholesterol count in the blood.

For the second null equation, consider the standard regression model:

\(y=\alpha+\beta{x}+\epsilon\)

We’re saying with second null hypothesis equation that we don’t expect a change in the outcome (the Y variable) as X changes (the amount of change represented by the β parameter). In relation to our example, under this null, for every unit increase in Drug X, there is no change in the count of cholesterol in the blood. As such, under the null the β parameter equals zero, or equivalently, the slope of the line that best fits the relationship between X and Y is flat (i.e., zero). The alternate hypothesis, similar to what we wrote in our Hypothesis 1 above, is that the β parameter does not equal zero; a change in Drug X dosage does change cholesterol. We can write the alternate hypothesis with the following equation:

\(H_1:\beta\neq0\)

The two null hypothesis equations basically say the same thing but approach the contrast differently. In each case though, there is no relationship between the outcome (the Y variable) and the predictor (the X variable). In contrast, in both alternative hypotheses, we’re saying that there is a difference; taking the drug effects cholesterol count.

Making the Comparison

What we really need though is a way to compare the null hypothesis with the alternate hypothesis in order to determine which hypothesis to reject and which to retain.


—REALLY IMPORTANT POINT—

We do not prove the null hypothesis. We also do not prove the alternate hypothesis. We do not show that the null is ‘right’ and the alternate is ‘wrong’ or vice-versa. The null and the alternates hypotheses take on no-meaning outside of the world of the comparison(s) we are making. We created the null hypothesis, not nature, so there is no objectively true null hypothesis with which to ‘prove’.

—END REALLY IMPORTANT POINT—


What we can do though is look for evidence to reject the null. We do that by establishing whether there is a statistically significant difference between the two hypotheses. This is done usually with a statistical test that generates a p-value, which is a measure of the probability that the difference you observed or a larger difference is due to chance alone. The most common p-value where we say there is a statistically significant difference between two groups or in the slope of an effect (the β parameter) is .05, which means that there is a 1 in 20 probability that the difference we observed or a larger difference is due to chance alone. A p value of .5 would be a 1 in 2 probability; a p-value of .001 would be a 1 in 1,000 probability. The lower the p-value then, the greater our confidence that the difference we observed wasn’t spurious.

If we observe a statistically significant difference, we say that we reject the null hypothesis; in relation to the alternate, we say that we retain the alternate hypothesis. If we observe no statistically significant difference, we say that we failed to reject the null hypothesis; but we offer no statement for the alternate in this case. Again, we never prove either hypothesis. Because we started under the assumption that the null was the ‘true’ relationship, by rejecting the null we found statistically compelling evidence that our starting assumption was incorrect. This is NOT the same thing as saying that the null is false and the alternate is true though! Because we always start a new study on the same relationship with the same null hypothesis—regardless of what we learned about the relationship in the last study—we never ‘prove’ any relationship with null hypothesis testing. For the alternate, we didn’t ‘prove’ anything either; the alternate hypothesis lives to see another day, where we once again look for evidence to show that we can retain the alternate hypothesis.

Drawing the ‘Wrong’ Conclusion – Type I and Type II Error

As with all probabilities, there is the 1 in (whatever) chance that what we are seeing is an aberration/luck/spurious/random occurrence. We’ll explore more about how Type I and Type II errors relate to p-values in another post. Right now though, it’s important to get your mind around what both types of errors mean, and what they mean for null hypothesis testing.

Lets assume that the null hypothesis is that there is no meaningful difference between the mean value of an observed outcome (cholesterol count in the blood does not vary as a function of taking Drug X). A Type I error with this null hypothesis is a false positive. In a false positive, we observed in our data a significant difference between the two groups as a function of whether a participant took Drug X. In reality though, there is no meaningful difference—the drug doesn’t actually work. Our data—or a mistake by the researcher—led us to reject the null, when in fact the null was ‘true’. In technical terms, we incorrectly rejected a true null hypothesis.

A Type II error, again with the null hypothesis that there is no meaningful difference between the mean value of an observed outcome, is a false negative. With the false negative, in our data we observed no significant difference in cholesterol between those that took Drug X and those that didn’t. But the reality is that there was a difference—Drug X actually effects cholesterol—but we didn’t detect a statistically significant difference. Again in technical terms, we incorrectly retained a false null hypothesis.

Both error types are bad in their own way, because we really want to uncover as close to the ‘true’ relationship as we possibly can. But, all statistical tests and research designs have a probability of Type I and Type II error. We can never escape the possibility of drawing the ‘incorrect’ conclusion. Hopefully though, we’re minimizing that possibility.

Criticism

Null hypothesis testing is not without its criticism—much of which is fair but with varying degrees of seriousness—but I’ll leave that for another time. There is one criticism though that I happen to agree with and that I think is a pretty big limitation: very rarely, if ever, does the null hypothesis actually occur in the real world.

Among other things, the problem with setting the null equal to no meaningful difference is because it happens in isolation, that is, without reference to what the effect of X on Y is in reality. Depending on your perspective, this makes it easier to reject the null and find support for your alternate hypothesis. For example, you might hypothesize that entrepreneurial intention is positively related to entrepreneurial action, under the null hypothesis that there is no meaningful relationship between intention and action (i.e., βIntention = 0). In the real world, the probability is very low that intention does not precede action, so you’re a priori more likely to reject the null and support your hypothesis.

A common solution to this problem is bayesian inference, in which we make use of prior evidence along with new information to improve our estimate of the effect of X on Y iteratively, but that’s for a different post :)

Despite its strong assumptions and its shortcomings, I’m a supporter of null hypothesis testing and generally teach from this perspective for two reasons, one practical, and one pedagogical. The practical reason is that alternatives to null hypothesis testing are simply not commonly used nor commonly understood, at least right now. The pedagogical reason is that I think null hypothesis logic helps to simplify many applied statistics concepts, particularly for novice applied statisticians. It’s pretty easy to get your mind around testing the relationship between X and Y starting from the assumption that there is no meaningful relationship there, and looking for evidence that rejects this assumption.

Key Takeaway

Null hypothesis testing is all about setting up a contrast. We start from the perspective that there isn’t really a contrast there—the null is that there is no difference/no relationship between X and Y. When we make the contrast we’re trying to determine, based on a pre-determined probability (the p-value), whether any difference that we did observe is statistically significant (any difference we observed has a low probability of being due to random chance). Remember, we never prove or disprove the null!