Consider this example - we don't change the treatment at all, we just update its name. We split into two groups and run the same treatment on both, but under one of the two names at random. We get a p value of 0.2 that the new one is better. Is it reasonable to say that there's a >= 80% chance it really was better, knowing that it was literally the same treatment?
Parent: "5% chance of looking as good as it did, if it were truly no better than the alternative." This accepts the premise that the product quality is a fact, and only uses probability to describe the (noisy / probabilistic) measurements, i.e. "5% chance of looking as good".
Parent is right to pick up on this, if we're talking about a single product (or, in medicine, if we're talking about a single study evaluating a new treatment). But if we're talking about a workflow for evaluating many products, and we're prepared to consider a probability model that says some products are better than the alternative and others aren't, then the author's version is reasonable.
It’s not reasonable unless there is a real difference between those “many products” which is large enough to be sure that it would rarely be missed. That’s a quite strong assumption.
You throw a coin five times and I predict the result correctly each time.
#1 You say that I have precognition powers, because the probability that I don’t is less than 5%
#2 You say that I have precognition powers, because if I didn’t the probability that I would have got the outcomes right is less than 5%
#2 is a bad logical conclusion but it’s based on the right interpretation (while #1 is completely wrong): it’s more likely that I was lucky because precognition is very implausible to start with.
The correct statement is saying P(saw these results | no real effect) < 5%
Consider two extremes, for the same 5% threshold:
1) All of their ideas for experiments are idiotic. Every single experiment is for something that simply would never work in real life. 5% of those experiments pass the threshold and 0% of them are valid ideas.
2) All of their ideas are brilliant. Every single experiment is for something that is a perfect way to capture user needs and get them to pay more money. 100% of those experiments pass the threshold and 100% of them are valid ideas.
(P scores don't actually tell you how many VALID experiments will fail, so let's just say they all pass).
This is so incredibly common in forensics that it's called the "prosecutor's fallacy."
E.g. imagine your test has a 5% false positive rate for a disease only 1 in 1 million people has. If you test 1 million people you expect 50,000 false positive and 1 true positive. So the chance that one of those positive results is a false positive is 50,000/50,001, not 5/100.
Using a p-value threshold of 0.05 similar to saying: I'm going to use a test that will call a false result positive 5% of the time.
The author said: chance that a positive result is a false positive == the false positive rate.
correct: given that the null hypothesis is correct, what's the probability of us getting this result or more extreme ones by chance?
from Bayes you know that P(A|B) and P(B|A) are 2 different things
Parent: 5% chance it could be same
edit: Will the down voter please explain yourself? p-values are tail probabilities, and points have zero measure in continuous random variables.
As a climber I see ego depletion happen all the time. You find a crumbly hold, or get harassed by an insect, or whatever else it is, and you conclude that the next move is the crux. Then other people climb it and nobody agrees with you--that move was one of the easier ones. Anecdata, of course, I just wish we could learn from the bad science and then be washed clean, rather than be haunted by it.
No, it means "I’m willing to ship something that if it was not better than the alternative it would have had only a 5% chance of looking as good as it did.”