p-values are everywhere in scientific papers but what does a p-value even mean and why

is it so special if it's less than 0.05 In this video we'll explain how p-values

help scientists make conclusions For example in this 2006 paper scientists

were studying weight maintenance in mice and the impact of the microbiome on weight gain.

The microbiome is the collection of microbes that live in or on an organism's body. The scientists

had shown that the gut microbiome of obese mice had a different composition of bacteria than the

microbiome of lean mice but they wanted to know if the bacteria in the microbiome was the result

of the difference in weight or if the microbiome caused the difference in weight to test that they

took mice that had no microbiome of their own and they did a fecal transplant yeah you

heard that right they took feces from lean or obese mice and they placed the fecal bacteria

inside mice that had no bacteria of their own Two weeks later both sets of mice had gained

weight! The bacteria helped the mice get more energy from their food so the question is can

we conclude that the microbiome from obese mice makes the mice gain more weight.

Yes we make that

conclusion the bar is higher they gained more weight, what's the alternative explanation? That

they're not actually different. What do you mean that they're not different? The bar is higher! Of

course they're different! Let me give you another example: I had this idea once that people's height

depended on their last name why so I thought maybe people whose last name starts in the second half

of the alphabet – maybe those people are on average a little bit taller… I know I know I know – but

it's a testable hypothesis so I did an experiment I asked people on Twitter to tell me their

height and the first letter of their last name and look according to the data people whose last

name starts with the letter in the second half of the alphabet are on average taller! But

that's ridiculous! You only had two people in each category! Like, you just happened to get two

tall people in the N-Z category! Well, how do you know that though how do you know it's chance?

I mean we know there's a difference in their last name! What did the data look like when you had more

people? OK OK so when I looked at more data the difference between the groups got smaller but it

was still a difference! But I mean still like maybe you just had some tall people in the N-Z

category just by chance like that's possible right? Exactly! Exactly! That's what p-values help us do:

they help us tell the difference between random chance and a real difference between the groups

so going back to the mice how do we know whether this difference is due to random chance or due

to the different fecal transplants? I mean if you took 100 mice and weighed 10 of them their

average weight would be different than if you weighed another 10 mice.

But that's just because

of random chance. Now if I told you that these two groups received different fecal transplants then

you might think that the difference in weights is due to the fecal transplants. But what if the

fecal transplants have no effect? I mean we know that if you take two different groups and weigh

them you'll get slightly different averages right? So how do we decide whether this difference

in weights is due to random chance or due to different fecal transplants? The p-value helps us

decide between these two. In statistics these two possibilities have names: the null hypothesis

is that the two groups are actually the same but you see a difference in your measurement

just because of random chance. The alternative hypothesis is that the difference in weight is

not due to random chance but rather due to some real difference in the characteristics of the two

groups, like which fecal transplant the mice got. In statistics we call these two possibilities the

null hypothesis and the alternative hypothesis but they are different than the scientific

hypothesis! In this case the scientific hypothesis is that the microbiome of

obese mice contributes to mouse obesity. Just because they're both called a hypothesis

doesn't mean they're the same so don't mix them up! So to answer the scientific question, we

need to know whether this difference is real or just due to random chance.

To distinguish

between these two possibilities we use a p-value. A p-value is a number that is

calculated using a statistical test. There are lots of different statistical tests

that can be used on different types of data but each statistical test has an output which

is the p-value. The p-value is a probability – that's what the p stands for – it's the probability

that the observed difference could have happened IF THE NULL HYPOTHESIS WERE TRUE For example

imagine that we have a lot of mice and we weigh two groups of them. Let's assume the two groups

are not actually different remember that's the null hypothesis. So when you measure their weights

the averages are not exactly the same just because of random chance in which mice got measured.

But

that's normal in fact the probability of getting a tiny difference is actually quite high even if

there is no real difference between the two groups. That probability is the p-value. It's usually

expressed as a fraction instead of a percentage. Now if you measured two new groups the

averages would be different and you'd have a different p-value. Now let's say you

measure two groups and the p-value is small that means getting that result is pretty unlikely

IF THE NULL HYPOTHESIS WERE TRUE So that's when you say maybe in this case the null hypothesis is

not true so these data are better described by the alternative hypothesis: that the groups really are

different. But the p-value can be anything right? From low to high so how do you know when it's

low enough to reject the null hypothesis? That's a great question! Scientists tend to reject the null

hypothesis when the p-value gets less than 0.05 To explain why p values less than 0.05 are so

special let's play a card game, have a seat.

Okay… Here's how it works: I'm going

to flip over one card at a time every time i flip over a black

card you get a bar of chocolate. Every time I flip over a red card you owe

me a dollar. Okay… I mean I like chocolate… Alright let's play! Okay… a dollar huh? Alright here's a dollar. Alright, another dollar. Come on come on come on come on Oh come on! Uh uh, no! There's something wrong with your deck.

It's rigged! Your deck is rigged! You're right it is! When you started the game you assumed

that the deck I was using was a normal deck that was your null hypothesis.

The first time you

got a red card that didn't tip you off because if the deck I'm using is the same as a normal

deck there's a 50 probability that you'll get a red card. That's a pretty high chance so you

weren't suspicious yet. The second time you got a red card you still weren't suspicious because

if the deck I'm using is the same as a normal deck the probability of getting two red cards

in a row is 25 – that's a one in four chance which isn't that unusual. As I kept flipping over cards

the probability of getting that many red cards in a row if the deck I'm using is the same as a

normal deck kept going down.

You started getting suspicious but you weren't sure yet that it was a

rigged deck. Finally when I flipped over the fifth red card that's when you said – there's something

wrong with your deck! – because if the deck I'm using is the same as a normal deck the probability of

getting five red cards in a row is three percent and because that's so unlikely I thought I don't

think the deck you're using is a normal deck? Yes that's right because getting that result if the

null hypothesis is true is so intuitively unlikely that that's when you rejected the null hypothesis.

So if the p-value is the probability of getting a result if the null hypothesis is true then if the

p-value goes less than five percent – for p-values we use fractions so 0.05 – we reject the null

hypothesis and conclude the alternative? Yes that's right! And let's be really clear: it's not like

there's something magical that happens when you pass the 0.05 threshold 0.049 is really similar

to 0.051.

Because it's just a probability? Exactly! Remember the mice? The mice that received the

obese fecal transplants gained more weight. But was the difference in weight just random because

of slight differences in the mice they chose or was it the result of the fecal transplants? The

scientists used a statistical test called a t-test to calculate the p-value and in this case the

p-value was less than 0.05 so the scientists rejected the null hypothesis and concluded that

the difference between the two groups of mice was real and likely due to the fecal transplants.

I mean isn't that cool – the bacteria inside a mouse can affect how much weight it gains – which

might be true for humans too.

But don't try a fecal transplant at home besides being able to transmit

diseases poo is a lot harder to work with than p… values! No! No! Such a bad joke….