p-values are everywhere in scientific papers but what does a p-value even mean and why
is it so special if it's less than 0.05 In this video we'll explain how p-values
help scientists make conclusions For example in this 2006 paper scientists
were studying weight maintenance in mice and the impact of the microbiome on weight gain.
The microbiome is the collection of microbes that live in or on an organism's body. The scientists
had shown that the gut microbiome of obese mice had a different composition of bacteria than the
microbiome of lean mice but they wanted to know if the bacteria in the microbiome was the result
of the difference in weight or if the microbiome caused the difference in weight to test that they
took mice that had no microbiome of their own and they did a fecal transplant yeah you
heard that right they took feces from lean or obese mice and they placed the fecal bacteria
inside mice that had no bacteria of their own Two weeks later both sets of mice had gained
weight! The bacteria helped the mice get more energy from their food so the question is can
we conclude that the microbiome from obese mice makes the mice gain more weight.
Yes we make that
conclusion the bar is higher they gained more weight, what's the alternative explanation? That
they're not actually different. What do you mean that they're not different? The bar is higher! Of
course they're different! Let me give you another example: I had this idea once that people's height
depended on their last name why so I thought maybe people whose last name starts in the second half
of the alphabet – maybe those people are on average a little bit taller… I know I know I know – but
it's a testable hypothesis so I did an experiment I asked people on Twitter to tell me their
height and the first letter of their last name and look according to the data people whose last
name starts with the letter in the second half of the alphabet are on average taller! But
that's ridiculous! You only had two people in each category! Like, you just happened to get two
tall people in the N-Z category! Well, how do you know that though how do you know it's chance?
I mean we know there's a difference in their last name! What did the data look like when you had more
people? OK OK so when I looked at more data the difference between the groups got smaller but it
was still a difference! But I mean still like maybe you just had some tall people in the N-Z
category just by chance like that's possible right? Exactly! Exactly! That's what p-values help us do:
they help us tell the difference between random chance and a real difference between the groups
so going back to the mice how do we know whether this difference is due to random chance or due
to the different fecal transplants? I mean if you took 100 mice and weighed 10 of them their
average weight would be different than if you weighed another 10 mice.
But that's just because
of random chance. Now if I told you that these two groups received different fecal transplants then
you might think that the difference in weights is due to the fecal transplants. But what if the
fecal transplants have no effect? I mean we know that if you take two different groups and weigh
them you'll get slightly different averages right? So how do we decide whether this difference
in weights is due to random chance or due to different fecal transplants? The p-value helps us
decide between these two. In statistics these two possibilities have names: the null hypothesis
is that the two groups are actually the same but you see a difference in your measurement
just because of random chance. The alternative hypothesis is that the difference in weight is
not due to random chance but rather due to some real difference in the characteristics of the two
groups, like which fecal transplant the mice got. In statistics we call these two possibilities the
null hypothesis and the alternative hypothesis but they are different than the scientific
hypothesis! In this case the scientific hypothesis is that the microbiome of
obese mice contributes to mouse obesity. Just because they're both called a hypothesis
doesn't mean they're the same so don't mix them up! So to answer the scientific question, we
need to know whether this difference is real or just due to random chance.
To distinguish
between these two possibilities we use a p-value. A p-value is a number that is
calculated using a statistical test. There are lots of different statistical tests
that can be used on different types of data but each statistical test has an output which
is the p-value. The p-value is a probability – that's what the p stands for – it's the probability
that the observed difference could have happened IF THE NULL HYPOTHESIS WERE TRUE For example
imagine that we have a lot of mice and we weigh two groups of them. Let's assume the two groups
are not actually different remember that's the null hypothesis. So when you measure their weights
the averages are not exactly the same just because of random chance in which mice got measured.
But
that's normal in fact the probability of getting a tiny difference is actually quite high even if
there is no real difference between the two groups. That probability is the p-value. It's usually
expressed as a fraction instead of a percentage. Now if you measured two new groups the
averages would be different and you'd have a different p-value. Now let's say you
measure two groups and the p-value is small that means getting that result is pretty unlikely
IF THE NULL HYPOTHESIS WERE TRUE So that's when you say maybe in this case the null hypothesis is
not true so these data are better described by the alternative hypothesis: that the groups really are
different. But the p-value can be anything right? From low to high so how do you know when it's
low enough to reject the null hypothesis? That's a great question! Scientists tend to reject the null
hypothesis when the p-value gets less than 0.05 To explain why p values less than 0.05 are so
special let's play a card game, have a seat.
Okay… Here's how it works: I'm going
to flip over one card at a time every time i flip over a black
card you get a bar of chocolate. Every time I flip over a red card you owe
me a dollar. Okay… I mean I like chocolate… Alright let's play! Okay… a dollar huh? Alright here's a dollar. Alright, another dollar. Come on come on come on come on Oh come on! Uh uh, no! There's something wrong with your deck.
It's rigged! Your deck is rigged! You're right it is! When you started the game you assumed
that the deck I was using was a normal deck that was your null hypothesis.
The first time you
got a red card that didn't tip you off because if the deck I'm using is the same as a normal
deck there's a 50 probability that you'll get a red card. That's a pretty high chance so you
weren't suspicious yet. The second time you got a red card you still weren't suspicious because
if the deck I'm using is the same as a normal deck the probability of getting two red cards
in a row is 25 – that's a one in four chance which isn't that unusual. As I kept flipping over cards
the probability of getting that many red cards in a row if the deck I'm using is the same as a
normal deck kept going down.
You started getting suspicious but you weren't sure yet that it was a
rigged deck. Finally when I flipped over the fifth red card that's when you said – there's something
wrong with your deck! – because if the deck I'm using is the same as a normal deck the probability of
getting five red cards in a row is three percent and because that's so unlikely I thought I don't
think the deck you're using is a normal deck? Yes that's right because getting that result if the
null hypothesis is true is so intuitively unlikely that that's when you rejected the null hypothesis.
So if the p-value is the probability of getting a result if the null hypothesis is true then if the
p-value goes less than five percent – for p-values we use fractions so 0.05 – we reject the null
hypothesis and conclude the alternative? Yes that's right! And let's be really clear: it's not like
there's something magical that happens when you pass the 0.05 threshold 0.049 is really similar
to 0.051.
Because it's just a probability? Exactly! Remember the mice? The mice that received the
obese fecal transplants gained more weight. But was the difference in weight just random because
of slight differences in the mice they chose or was it the result of the fecal transplants? The
scientists used a statistical test called a t-test to calculate the p-value and in this case the
p-value was less than 0.05 so the scientists rejected the null hypothesis and concluded that
the difference between the two groups of mice was real and likely due to the fecal transplants.
I mean isn't that cool – the bacteria inside a mouse can affect how much weight it gains – which
might be true for humans too.
But don't try a fecal transplant at home besides being able to transmit
diseases poo is a lot harder to work with than p… values! No! No! Such a bad joke….