Scientific research often boils down to a p-value; p-values are replete within medical journals, conferences, and scientific discussions. Importantly, research evidence interpreted using the p guide policy and practice.
What is a p-value and what does it mean?
The p stands for the probability of observed difference due to chance under an assumption, classically called ‘the null hypothesis’. Like any probability, p varies between 0 and 1, with values closer to 0 representing a low probability that the observed effect is due to chance, and a value of 1 suggesting that no difference exists.
Significance testing was first introduced by R.A. Fisher in the 1920s1. His idea was predicated on a desire to understand whether an association found between two phenomena was due purely to chance. For example, we study the weight of 5 boys and girls in a rigorous physical education class and compare this weight with the rest of the school. Our results show a 10 kg lower weight in the rigorous class. Is this observation a result of a randomly sampled group that happens to have a lower weight but is in fact same as the rest of the school? Or is this a distinct group that reflects the fruits of the teachers’ labour? Fisher saw p as a measure of strength of evidence against the null hypothesis and introduced a threshold of 0.05 to deem insufficient evidence to accept this null hypothesis2. To continue with our example, let’s say that the 10 kg difference was accompanied by a p of 0.02. We would interpret this as a 2% probability that a result of 10 kg or more is observed purely due to chance, assuming that there is indeed no difference between the rigorous group and rest of the school (i.e. the null is true). It is certainly possible to obtain a difference of 10 kg between our athletic group and the school mean due to chance alone, but it would be unlikely to occur. If the difference is due to chance, then repeating the experiment by sampling a different set of 5 students from the rigorous class can show that their mean is same as the school’s.
Fisher’s arbitrary use of 0.05 to assess and accept the degree of evidence against the null prompted his contemporaries, Neyman and Pearson, to introduce alternative hypotheses to the mix2. Technical and philosophical differences between these two approaches have been debated for close to 80 years and it is unlikely that they will be resolved here, nonetheless, let’s briefly discuss what hypothesis testing adds. In Neyman and Pearson’s view, two hypotheses are put forward, a null and alternate (Figure 1). The hypothesis test is used to reject one and accept another using the predetermined critical value on the same scale as the p value, named alpha. With introduction of an alternate hypothesis, we are now able to elucidate the two types of errors that can be made in interpreting results, type I and type II errors2. Type I error is the probability of rejecting null hypothesis when it should be accepted, whereas type II error is accepting the null hypothesis when it should have been rejected. Traditionally, with a threshold alpha of 0.05, type I error is 5%; with a power of 80%, type II error is 20%.
Figure 1. Null and alternate hypotheses
Common misinterpretations of the p
p values are not error probability for rejecting the null hypothesis. From our example above, this would be saying that there is a 2% chance that rejecting the null hypothesis is an error, i.e. 2% chance that the null is true. When in fact, what our results show is that the probability of obtaining at least the observed difference purely due to random error is 2% assuming the null hypothesis was true. From a theoretical perspective, p-values are interpreted with the assumption that the null hypothesis is a 100% true, so it is not possible for this value to elucidate whether the null is true or not. Indeed, error rates are quite different and tend to be much higher. A study by Sellke and colleagues found error rates to be at least 23% for p values of 0.05 and at least 7% for p values of 0.023. The specifics of the methods used to derive this are discussed here.
p values do not speak to the truth of the observed effect. The above error rates also highlight why it is imperative to interpret p values correctly; low p values do not guarantee that the observed difference is true. p values cannot discern between whether 1) the null is true but we’ve by chance encountered an unusual population or 2) the null is false. To determine scientific truth, replication studies are paramount. In fact, a recent study found that only 36% of replication studies are themselves statistically significant! The authors also found that the p value used in the original studies predicts whether the results will be replicated. Not surprisingly, lower p values in the index study led to a greater likelihood of the results being replicated. Specifically, the authors found that 63% of studies with p <0.001 were replicated, while only 41% with original p <0.02, 26% with a 0.02 < p < 0.04, and astonishing still a mere 18% with a p > 0.044.
p values do not speak to the size or clinical importance. In fact, large sample sizes can lead to extremely small p values for observed differences that are minute. Clinically important differences should be established a priori for meaningful and interpretable results.
Thresholds for significance need to be modified with multiple testing. A p of 0.05 accepts that there is a 5% probability of obtaining at least 1 positive finding by chance. This value – called the family wise error rate – increases if more tests are conducted without adjusting for multiple comparisons.
Several alternatives and supplements have been proposed to aid in the interpretation of data. Understandably, their utility varies based on the research question. The list below is by no means exhaustive.
Effect sizes: Statistical significance is dependent on effect size, sample variability, and sample size. For this reason, very large studies can often obtain a p value below the critical threshold for small effects and thereby be considered ‘significant’. This is true unless there is truly no effect, i.e. effect size is zero5. For this reason, effect sizes themselves can often be a better gauge of a meaningful result than the p.
Akaike information criteria (AIC): Let’s assume we want to identify a set of predictors that best describe our data. In this instance, given a set of candidate models, the most suitable is identified as the one with the lowest AIC. Additional predictors are considered meaningful in explaining the data if they minimize the AIC. Although the method used to determine penalization of added predictors is also somewhat arbitrary. Other measures of model fit are used similarly to test hypotheses.
Cross validation error: The accuracy of a prediction model, with a set of identified predictors that best describe a set of data, can be tested by applying said model to a different dataset. Prediction error in the new data can then be used to make decisions about the predictors’ stability of association with the outcome of interest.
Fragility index: As discussed above, even small effects can be significant at a 0.05 level, but represent small, meaningless, effects. Walsh and colleagues investigated the ‘fragility’ of these effects in randomized controlled trials by changing the status of patients without an event to an event until the p-value exceeded 0.056. Small numbers, i.e. fewer patients with changed statuses before p became > 0.05, represented a more fragile result. In their sweep of 399 RCTs, they found 25% had a Fragility Index of 3 or less. The results are alarming and suggest that only 3 events had to become non-events before the study lost its significance.
A thanks to Akram Alyass for comments.
- Lehmann EL. The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? J Am Stat Assoc. 1993;88(424):1242. doi:10.2307/2291263.
- Dahiru T. P – value, a true test of statistical significance? A cautionary note. Ann Ibadan Postgrad Med. 2008;6(1):21-26. http://www.ncbi.nlm.nih.gov/pubmed/25161440. Accessed January 30, 2018.
- Sellke T, Bayarri MJ, Berger JO. Calibration of ρ Values for Testing Precise Null Hypotheses. Am Stat. 2001;55(1):62-71. doi:10.1198/000313001300339950.
- Open Science Collaboration OS. PSYCHOLOGY. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716. doi:10.1126/science.aac4716.
- Sullivan GM, Feinn R. Using Effect Size-or Why the P Value Is Not Enough. J Grad Med Educ. 2012;4(3):279-282. doi:10.4300/JGME-D-12-00156.1.
- Walsh M, Srinathan SK, McAuley DF, et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. J Clin Epidemiol. 2014;67(6):622-628. doi:10.1016/j.jclinepi.2013.10.019.