Even outside of science and statistics you hear about the p-value. That mystical being that everyone wants to be lower than .05. Magical words like „statistical significance“ are tied to it. Sometimes, you will see strange headlines demanding the retirement of p-values. Is there something to it? And what does it mean?

– When we compare groups, we calculate statistical tests that often throw out p-values. If this value is below .05, we call that statistically significant.
– However, statistical significance is often misinterpreted – even by professionals.
– The p-value helps us to eyeball the fit of our data with the null hypothesis – e.g. the hypothesis that there is no difference in the ratings of two burger restaurants.
– The problem is: Even if p < .05 that does not mean that a difference between two groups is more likely.
– That is why some scientists argue to abolish mindless significance testing – instead, we should report intervals and uncertainty more often.

Challenge p-value

Everytime we want to know if there is a difference between two groups, we calculate a statistical test.1 For example, we could ask if people who got a drug have a better chance of recovery than those who did not. With the most common tests, we get a so-called p-value – and if that is below .05, we report a „statistically significant difference“. However, it’s not that easy to get your head around what that means. Even people who are trained in statistics frequently interpret p-values wrong. These misconceptions are usually thought to be (one of the) causes for the replication crisis in science, i.e. the problem that some findings won’t show again when you do another experiment. One of the main culprits is the publication bias, which is directly tied to the p-value: Statistically significant findings (p < .05) have better chances to get published. So obviously, some authors try to squeeze their p-values a little bit – which you can e.g. achieve by manipulating exclusion criteria for your participants or some other statistical tricks. This is called „p-hacking“ and it does not always happen consciously: It can be quite hard to tell which kind of manipulation of your data is necessary and which kind is wrong. Maybe that is why Coldplay sing in The Scientist: „Nobody said it was easy.“

So, the problem with p-values is a mixture between lack of knowledge and (conscious) manipulation. To understand the current criticism, we first need to understand what the p-value actually means. I don’t want so sugar-coat it: Interpreting the p-value can be quite mind-bending and I had to stare into the void for three hours straight until I felt mentally prepared to write this post. But fear not – I think I’ll manage a comprehensible explanation that doesn’t require a math or statistics course.

Why the p-value?

Why all the fuss? Because everytime we compare groups, we can’t get hold of the whole population. Instead, we draw (comparably small) samples that we use to infer statements about the population.2 When doing so, we might randomly pick up some differences between groups that don’t have anything to do with the effect we wanted to investigate. It’s best if we illustrate that with an example: We compare two burger restaurants and want to know if one of them got better online ratings than the other.

We’ll simulate that in R, my favourite programming language. That way, we know all restaurant ratings (i.e. the whole population) and can toy around with different scenarios. I generate 10 million ratings on a scale from 1 – 10. Because I can. I use these 10 million ratings for burger restaurant 1 and the same set of ratings for burger restaurant 2. That means, there are 20 million ratings in total and both restaurants are identical. For both restaurants, the distribution of ratings looks like this:

Ratings are quite scattered: Some customers had a lucky day and gave the maximum score of 10, some people might be hard to impress or had to wait for a long time, so they gave only one point. Overall, we see solid, but not outstanding burger restaurants: Most ratings are between 5 or 6, which also is the average rating.

Here’s the thing: In real life, we wouldn’t have access to all 20 million ratings, but would rather have a small set of ratings – say, 100 per burger restaurant – drawn from the internet.3 That’s where we have a problem: Even if our sample is drawn randomly, it might be biased. With a bit of bad luck, we might only get the worst ratings for burger restaurant 1 and only the best for burger restaurant 2. From 10 million ratings, for example, we can tell from the plot that 1 point was awarded about 500,000 times, so there are multiple ways we could happen to draw a sample for burger restaurant 1 that consists solely of 1-point-ratings. It might then appear like burger restaurant 2 is the better one – even though the two restaurants are exactly the same if you consider all ratings.

This phenomenon – drawing a sample that does not represent the true population – is called sampling error. Even with large, random samples it cannot be fully prevented. Except when you have access to the whole population. But that’s highly unrealistic in most scenarios.

What the p-value is trying to tell us

Statistical tests – and the p-values they return – try to estimate the scope of this very problem: How likely is it that I’ve accidentally drawn a sample where I find a difference when there is in fact none? Here, we see how powerful a simulation is: I can draw as many samples as I wish. From my 20 million burger ratings, I sample 10 ratings for each burger restaurant 10,000 times. And then, I sample 50 ratings for each restaurant 10,000 times. In fact, I draw 10,000 ratings for each of the following group size: 10, 50, 100, 200, 500, 1000, 2000, 3000 and 4000. That means, I pretend to conduct my „burger experiment“ 90,000 times (10,000 x 9 different group sizes). Then, I look at the ratio of cases where it looks like there is a difference between restaurants. Each time, I compare the two burger restaurants with a so-called t-test which gives me a p-value. It tells me: In which percentage of cases would I find a difference that is even more extreme than the one I have just observed – given that in fact, there is no difference? For now, we can ignore the part that goes „given that in fact, there is no difference“. Our simulations have the major advantage that we know there is no difference. In our case, we are allowed to directly translate the p-value as: „In which percentage of cases would I find a difference that is even more extreme than the one I have just observed.“ That sounds a little complicated, but don’t worry: It will get clearer in a bit. First, we want to get a feeling for how our data looks like when we observe different p-values. Let’s take a look at four examples from our 90,000 simulated burger experiments.

Example 1: We got several cases of p = 1 in our simulations. That means: Pretty much any other sample of ratings would result in a more extreme difference between restaurants than the one we observe here. For example, in that run where both burger restaurants received a mean rating of 5.5 when I drew 4,000 ratings for each of them. It is easy to see how the difference in another sample could be larger, but never smaller. Since 1 is just another way to write 100 %, that means: In 100 % of our simulated burger experiments, we would find a larger difference between restaurants than this.

Example 2: What does the magical threshold of p = .05 look like? We pretty much hit this value spot on with e.g. a mean rating of 5.56 for burger restaurant 1 and a mean rating of 5.41 for burger restaurant 2 when I drew 2,000 ratings for each restaurant. Again, since .05 is just a different way of saying 5 %: That means in only 5 % of our simulated samples we will find a difference larger than this. So, that’s what they call „statistically significant“.

Example 3: What is the smalles p-value I can find in my 10,000 randomly drawn samples? We get that with a mean rating of 5.38 for burger restaurant 1 and a mean rating of 5.61 for burger restaurant 2 with a group size of 4000 ratings for each of them. The p-value is p = 0.0000097 here – we would call that highly significant. Note that the difference between the two mean ratings is rather small – but the large sample size drives the p-value up.

Example 4: So, let’s look what the biggest difference between restaurants is that we can find. Holy cow. We spot a mean rating of 7.3 for restaurant 1 and 3.1 for restaurant 2. That’s a difference of 4.2 points! The p-value here is 0.00033. Note that the group size for each restaurant was only 10 ratings. Still, I bet a lot of people would already strongly prefer eating at restaurant 1 when they found such a difference and there were 10 ratings for each restaurant on Google maps. It’s quite remarkable, really. Still, that difference is rather small when you imagine that theoretically, we could have drawn 100 1-point-ratings for burger restaurant 1 and 100 10-point ratings for burger restaurant 2. If we draw enough samples, this case will happen at some point. That also means: Some poor burger restaurant scientist will eventually find such a sample due to the crazy ways of chance.

With our simulations, we can check if our t-test did a good job. Following the logic of null hypothesis significance testing, we should expect a ratio of p <= .05 in about 5 % of our simulated experiments. Around 2 % of our samples should results in p-values <= .02 and so on. And indeed: In 4.92 % of our samples, we find p < .05.

So, the p-value helps us to estimate the chance of a sampling error. There is just one problem: It can only give us a reliable estimate for the chance of a sampling error, when there is no difference between the groups. Remember we could ignore that in our example because we knew exactly that burger restaurants do not differ. In that case, we get a perfect estimate how often to expect a sampling error. If the burger restaurants were different, we would not expect to get 5 % of p < .05. We would expect to get more.

Sooo … ? Where’s the problem?

First, the problem seems obvious: In real life, we don’t know if burger restaurants differ or not. After all, this is what we’re trying to find out! That means we don’t know if our p-values reliably estimate the risk of a sampling error. Sounds like they’re pretty much useless. On second thought, though, we arrive at the following conclusion: „Wait a minute. We know that 5 % of p-values are < .05 when there is no difference. We also know that more p-values are < .05 when there is a difference. Doesn’t that mean: When the p-value is < .05, it is more likely that there is a difference?!“ That’s what a lot of people think: As soon as p < .05, they assume burger restaurant 1 to be worse (or better) than burger restaurant 2. But, ladies and gentlement, that’s exactly where the misconception lies! Listen up as we get right to the core of the issue that can blow over even the most hard-boiled statistician: We fail to recognise that we’re dealing with conditional probablilities. The Presbyterian minister Thomas Bayes made them famous back then and even today, we are still using Bayes‘ theorem.

No cancer even though the test was positive

A very famous and pretty mind blowing example for Bayes‘ theorem is that of screening for breast cancer. To discuss that in detail, I’d need another blog post. It all boils down to this, though: Say, the chances of having breast cancer are 1 % in the population. To screen for breast cancer, we conduct a mammogram. Like any screening method, the test sometimes is wrong. If a woman has cancer, the test will be positive in 80 % of the cases. But even in healthy women, it will come out positive in 10 % of the cases. Now we find a women whose test has been positive. How likely is it that she actually has breast cancer? Surprisingly, only 7.5 %. That means: Yes, it is now more likely than before that she has cancer (before: 1 % – now: 7.5 %). But: It is still far more likely that she doesn’t have cancer.

This is exactly what could happen with our statistical test: When we find p < .05 it is more likely than before that there is a difference between burger restaurants. However, it could still be the case that „There is no difference“ is more likely then „There is a difference“. Far more likely. That also means: The p-value can’t tell you if there is a difference between two groups or not – or even which is more likely!

It is important to note that this does not need to be the case. It could be that after a positive mammogram, it is more likely to have cancer than to be healthy. It could be that after a statistically significant results, the hypothesis that there is a difference between groups is more likely than the hypothesis that there is none. The confusing effect we saw in the cancer example happens partly because it is very unlikely to have cancer to begin with. Unfortunately, we don’t know in advance how likely our hypotheses are, e.g. what the probability is that the burger restaurants differ.

Goodness gracious! Why do we even do this?!

That’s what some scientists are wondering rightnow. That’s why 800 of them signed a petition a petition to abolish p-values. That’s a lot of people, but still only a fraction of all the scientists in the world. It is important to say that those scientists‘ problem is not significance testing itself, but rather its misconduction. We saw that p-values are pretty much useless when using them as criterion to decide if there is a difference between groups or not. Yet, this is what many people are doing. P-values still provide information of how well our data fits the null hypothesis (that there is no difference). So, the p-value is actually a helpful little critter. That is, when you don’t divide the world into „significant or not“ (maybe using asterisks). Or, even worse, infer the likelihood of hypotheses from a p-value. However, if most people – even those with statistical training – can’t make sense of the term „statistically significant“, we should indeed come up with a better solution.

So … what should we do instead?

The initiators of the petition vote for using intervals. For example, instead of reporting the difference between the means of the two burger restaurants, we would indicate within which range we would expect it to lie. This is something you can also determine mathematically. From „Burger restaurant 1 received 1.24 points more on average“ we go to: „Burger restaurant 1 received, on average, between .41 and 2.01 points more.“ That paints a different picture – and also comes with more uncertainty. So what if I’m a sponsor trying to decide in which restaurant to invest? In that case, I need clear distinctions – „difference or not“ instead of squishy intervals. The authors of the petition, however, argue that right now, decisions also aren’t solely based on p-values. Our hypothetical sponsor will certainly think about a whole bunch of other things: Which restaurant is more likeable in his opinion? Where does he see the greater potential for the future? And if there is so much uncertainty involved, a yes-or-no-decision is probably wrong anyways. No decision is better than the wrong one … isn’t it? In the end, it’s a philosophical question dealing with how much risk we are willing to take and for how much errors we allow in our decisions. And how desperately we are in need of a clear-cut decision.

Another alternative are new statistical procedures that only became possible through recent advances in computational power. A lot of these rely for example on simulations, not unlike the ones we just did. Even Bayes‘ theorem I just mentioned currently plays a fancy part in statistics. Using it, we are able to acutally compare two hypotheses to each other. Of course, each procedure also has disadvantages and limits, especially when you get the interpretation wrong. This is not only a weakness of the p-value.

The punchline is: It’s not either black or white. Whenever we deal with data, we have to prepare for uncertaintly and sometimes surprisingly squishy results. That does not mean that statistics is random or that we can’t draw any solid conclusions with it. It just means that the road until we get there is slightly longer and a bit more exhausting than we sometimes would like to think.

Data and escalation

After the German version of this post, some people asked me if I shouldn’t have used a Wilcoxon rank sum test instead of a t-test because my burger restaurant ratings were ordinal data. So I let the simulation run again to prove that the t-test is robust. Then, things went slightly viral on Twitter and people were suggesting even more analyses. Check out the Wilcoxon Wars here. Also, find the R code for it in this GitHub repo. I also adapted the code for this blog post – if you want to toy around with the very data I described here, use the modified functions the script modified_blog_function.R. You can also find the code for all the plots there.

  1. Statistical tests are not only used to find differences between groups, but that is the most common case – and probably the one that you are interested in.
  2. That doesn’t necessarily refer to „everyone in the world“. For a drug against lung cancer that might be „all people with lung cancer“.
  3. This is not important for our example, but in case you were wondering: I assume that each person rated only one of the burger restaurants, not both. Hence, the two groups are independent.

Links and sources in the order of appearance, 17.06.2019

[1] Gigerenzer, G. (2018). Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science 1(2), 198 – 218.
I don’t like the tone of that paper much, but see p. 207 for a table of how many percent of scientists and students show misconceptions about the p-value.
[2] Hopewell, S., Loudon, K., Clarke, M.J., Oxman A.D. & Dickersin, K. (2009). Publication bias in clinical trials due to statistical significance or direction of trial results. Cochrane Database of Systematic Reviews 1, Art. No.: MR000006. DOI: 10.1002/14651858.MR000006.pub3.
[3] Head, M.L., Holman, L., Lanfear, R., Kahn, A.T. & Jennions, M.D. (2015). The Extent and Consequences of P-Hacking in Science. PLOS Biology 13(3): e1002106. https://doi.org/10.1371/journal.pbio.1002106
[4] Cut The Knot – Bayes‘ Theorem. 1996-2018, Alexander Bogomolny.
[5] Amrhein, V., Greenland, S. & McShane, B. Scientists rise up against statistical significance. Nature 567, 305-307. doi: 10.1038/d41586-019-00857-9