There are a lot of ways to judge the usefulness of a paper. For me, I always start with the research design and methodology section when I read an empirical paper (including as an editor and reviewer). I start here because this is how I make an initial assessment of how much confidence to place in a set of model results. But it also gives me a sense of the usefulness of the paper’s conclusions. I judge usefulness based on the likelihood that another researcher (maybe me) could use the paper’s findings to inform a future study. That casts a wide net, but it also means that the findings have to be meaningful.

Admittedly, it’s hard to define what it is to be meaningful. Size doesn’t matter here—assuming a high quality design with reliable measures an effect size of zero, small, medium, or large can all be meaningful. But I do think it’s easy to define results that are trivial, and unfortunately, we see a lot of these papers in strategy and entrepreneurship.

**Directional Hypotheses**

I’ve written a lot of these (though I don’t any more), they are by far the most common way to specify an hypothesis in management, and they go something like this:

\(x\) positively relates to \(y\).

This is really an alternate hypothesis. We are making a statement of how we expect \(x\) to relate to \(y\). Ironically, null hypothesis significance testing does not actually evalaute this hypothesis, despite being the most common aproach in our literature.^{1} Implied in the statement is the notion of statistical significance, where we are looking for evidence that the effect of \(x\) on \(y\) is statistically different from zero. But that’s not what we actually said. All we said was that \(x\) positively relates to \(y\). This is a really, really, really low bar to meat.

Taking a step back, an effect size (\(\beta\)) of .001, .01., .1, .5, or any positive non-zero value all support the hypothesis. If the best theoretical prediction we can make is that \(x\) relates positively to \(y\), then the theory itself is not all that useful. A more useful theory gives us insight into the magnitude of the expected effect. For example, we might say that experiencing a recession will increase a person’s interest in being an entrepreneur, but because there are other, more important factors impacting interest (capital availability, opportunity recognition, and so forth), the effect of the recession itself small; perhaps on the order of 1/10th of a standard deviation change in intentions.

The key point though is that directional hypotheses lower the bar to find evidential support, and the lower the bar, the less useful the study results from the outset.

**Unlikely Null Hypothesis**

With the directional hypothesis specified, the researcher turns to evaluating whether the effect of \(x\) on \(y\) is *statistically* different from zero. To do that, he or she uses a p-value in frequentist frame, typically with the threshold of p < .05. We interpret the p-value here is a less than 1 in 20 chance that we would have observed the effect size we did or a larger estimate assuming the null hypothesis is true. We all to often forget about that last criteria, and it’s been the bain of science for decades. To correctly draw inference from the p-value, we have to assume that the null hypothesis is the “true” hypothesis:

The effect of \(x\) on \(y\) is zero.

The zero is important; not statistically different from zero, but **actually** zero. That’s a really, really, really high bar. There are some fields—drug testing for example—with a plausible null hypothesis. In management research, the opposite is the case; the null hypothesis is highly unlikely to be true. Returning to our recession example, we would have to assume that, in reality, a recession does not change an individual’s attitude towards entrepreneurship **at all**. Is is possible that for some people (say tenured professors) experiencing a recession has absolutely no effect on entrepreneurial intentions? Of course. Is it likely that the assumption of a zero effect is predominate in the population? No. But we have to treat the null as *true* to make the the p-value useful.

Where researchers run into problems is that the less likely the null hypothesis is from the beginning, the more likely it is for him or her to reject the null hypothesis with a p-value that crosses the p < .05 threshold. When you layer p-hacking, multiple comparisons, and a 1 in 20 chance of crossing p < .05 by random luck, the odds are in the researcher’s favor to find a “statistically significant” result.

So before we’ve even collected any data or run a model, by specifying only a directional hypothesis with a null hypothesis that is highly unlikely to be true in reality, we’ve set ourselves up to cross the p < .05 threshold but with trivial results.

**Small Effect Size + Large Enough Dataset**

These factors go together, and to see the impact, we need to appreciate the role of sample size in determining the p-value.

- As sample size goes UP, the standard error goes DOWN
- As standard error goes DOWN, the t-statistic goes UP (\(t\) = Effect size estimate (\(\beta\)) / standard error)
- As the t-statistic goes UP, the p-value goes DOWN.

This means that with a large enough sample size, trivial—even miniscule—effect size estimates can yield p-values well below p < .05.

This reality, among others, of the p-value is why the American Statistical Assocation issued their first policy guidance on a matter of statistical practice warning researchers not to infer anything about the strength, size, utility, or meaningfulness of an effect based on the p-value.

Fortunately, it is much easier today to get large—very large—datasets. This is a very good thing, because more data is always better. But failing to recognize the relationship that sample size plays in the calculation of the p-value is a recipe for trivial results. Even if the effect is statistically significant at the p < .001 level, what is the use if the change in \(y\) for a unit change in \(x\) is so small that it is more likely to be: a) simply an artificate of the null hypothesis being highly unlikely in the first place; b) a function of measurement error; c) deviations from normalcy; d) endogeneity between \(x\) and \(y\); and/or e) sampling variation than it is to represent a true population estimate.

**A way out?**

For me, I’ve converted to full Bayesian, and I’m in the camp that null hypothesis testing in management, strategy, and entrepreneurship research does more harm than good in producing useful insights that impact theory and practice. But while frequentist statistics are still more common than Bayesian methods, it’s helpful to remember the formula…

\[\text{Directional Hypothesis + Unlikely Null Hypothesis + Small Effect Size + Large Enough Dataset = Trivial Insights}\]

Fortunately, Bayesian inference does allow us to test this hypothesis, which is one of the many reasons why Bayesian inference is far more useful to management researchers.↩