P-Value: A Simple Explanation for Non-Statisticians

A p-value is a probability, a number between 0 and 1, calculated after running a statistical test on data. A small p-value (< 0.05 in general) means that the observed results are so unusual assuming that they were due to chance only.

It is a way of telling if the results obtained should be taken seriously or not, based on running the experiment just once.

The goal of the p-value is to make us less vulnerable to be fooled by random variations while saving us the cost of repeating the experiment an unlimited number of times in order to account for these random events.

In most health-related studies a p-value < 0.05 (i.e. less than 5%) is considered low enough to conclude that the observed results are so unusual given that there is no effect.

Why use a threshold of 0.05?

The 0.05 is called the level of statistical significance. Keep in mind that there is nothing special about 0.05. In physics for example, the threshold for declaring statistical significance is 0.0000003!

The level of significance must be chosen in the design phase of the study, so before looking at the data and running any statistical test.

Once chosen, we cannot change our minds!

If we select a threshold of 0.05, then after running our statistical analysis and according to the p-value obtained, we interpret the results as follows:

If p-value ≤ 0.05: This means that the observed result is statistically significant (our data are so unusual assuming that there is no true effect)
If p-value > 0.05: This means that the result is not statistically significant (our data are not unusual assuming that there is no true effect)

Note that for our results to be meaningful, they must also be practically significant. A treatment that reduces the 10-year risk of heart attack from 2% to 1.99% will not have any practical advantage with such small effect, although it might be statistically significant (i.e. associated with a p-value < 0.05).

BOTTOM LINE:

Not all statistically significant results will be practically significant as their effect might be too small to have any real-world consequences.

Common misinterpretations of p-values

Misinterpretation #1: A p-value of 0.04 means that there is a 4% probability that chance alone can explain our results

A p-value is NOT the probability that a given hypothesis is true or false. Instead, it is a measure of how well our data are consistent with the hypothesis that there is no effect. If you want to calculate the probability that a theory is correct or not, you need Bayesian statistics not p-values.

Misinterpretation #2: A large p-value (> 0.05) means that there is no effect; A small p-value (< 0.05) means that there is an effect

If we choose a statistical significance level of 0.05: A p-value < 0.05 tells us that if we believe that there is an effect (i.e. we believe that the difference found in our results is not due to chance), then if we repeat the experiment many times, we won’t be wrong more than 5% of the time. A p-value > 0.05 does not provide evidence of no effect, it simply means that randomness or chance cannot be ruled out as an explanation of our results.

Misinterpretation #3: A p-value of 0.001 means that the effect is stronger than with a p-value of 0.03

A p-value says nothing about the size of the effect. One important thing to note is that p-values are sensitive to the size of the sample you are working with — all other things held constant, a larger study will yield lower p-values. This however would not change the practical significance of the results.

In fact, if we examine the relationship between any 2 unrelated random variables, we have a chance of 5% of getting a p-value < 0.05. This means that by running 20 statistical test, on average 1 of them will be statistically significant.

So how can we be sure that the researchers who conducted the study didn’t run many tests in order to increase their chances of getting a statistically significant result?

The answer is that we cannot rule out multiple testing.

This, among many other reasons, such as publication bias and conflicts of interest, is why we must be very conservative when we interpret p-values.

Conclusion

The biggest problem with p-values is that they are often misinterpreted.

When reading a scientific article, and in order to save some time, a lot of people skip the methods and results sections and go straight to p-values to see which effects were significant and memorize results accordingly — Just remember that no single number can summarize a study, its design, methodology, and biases.

Finally, there is no harm in using p-values but we should certainly learn to interpret them correctly and be aware of their limitations.

References

Ronald L. Wasserstein & Nicole A. Lazar (2016): The ASA’s statement on p-values: context, process, and purpose, The American Statistician, DOI: 10.1080/00031305.2016.1154108.
Kim, J., & Bang, H. (2016). Three common misuses of P values. Dental hypotheses, 7(3), 73–80. doi:10.4103/2155-8213.190481.
Daniel Lakens. Improving your statistical inferences, Coursera.org