Comment by kgwgk - Hacker Neue

kgwgk 3 days ago parent

> Setting a p-value threshold of 0.05 is equivalent to saying: "I’m willing to accept a 5% chance of shipping something that only looked good by chance."

No, it means "I’m willing to ship something that if it was not better than the alternative it would have had only a 5% chance of looking as good as it did.”

wavemode 3 days ago

Can you elaborate on the difference between your statement and the author's?

sweezyjeezy 3 days ago

This is a subtle point that even a lot of scientists don't understand. A p value or < 0.05 doesn't mean "there is less than a 5% chance the treatment is not effective". It means that "if the treatment was only as effective, (or worse) than the original, we'd have < 5% chance of seeing results this good". Note that in the second case we're making a weaker statement - it doesn't directly say anything about the particular experiment we ran and whether it was right or wrong with any probability, only about how extreme the final result was.

Consider this example - we don't change the treatment at all, we just update its name. We split into two groups and run the same treatment on both, but under one of the two names at random. We get a p value of 0.2 that the new one is better. Is it reasonable to say that there's a >= 80% chance it really was better, knowing that it was literally the same treatment?

datastoat 3 days ago

Author: "5% chance of shipping something that only looked good by chance". One philosophy of statistics says that the product either is better or isn't better, and that it's meaningless to attach a probability to facts, which the author seems to be doing with the phrase "5% chance of shipping something".

Parent: "5% chance of looking as good as it did, if it were truly no better than the alternative." This accepts the premise that the product quality is a fact, and only uses probability to describe the (noisy / probabilistic) measurements, i.e. "5% chance of looking as good".

Parent is right to pick up on this, if we're talking about a single product (or, in medicine, if we're talking about a single study evaluating a new treatment). But if we're talking about a workflow for evaluating many products, and we're prepared to consider a probability model that says some products are better than the alternative and others aren't, then the author's version is reasonable.

pkhuong 3 days ago

One easy slip-up with discussing p values in the context of a workflow or a decision-making process is that a process with p < 0.05 doesn't give us any bound on the actual ratio of actually good VS lucky changes. If we only consider good changes, the fraction of false positive changes is 0%; if we only consider bad changes, that fraction is 100%. Hypothesis testing is no replacement for insight or taste.

kgwgk OP 3 days ago

> But if we're talking about a workflow for evaluating many products, and we're prepared to consider a probability model that says some products are better than the alternative and others aren't, then the author's version is reasonable.

It’s not reasonable unless there is a real difference between those “many products” which is large enough to be sure that it would rarely be missed. That’s a quite strong assumption.

jonahx 3 days ago

This is the key point.

kgwgk OP 3 days ago

There are a few good explanations already (also less good and very bad) so I give a simple example:

You throw a coin five times and I predict the result correctly each time.

#1 You say that I have precognition powers, because the probability that I don’t is less than 5%

#2 You say that I have precognition powers, because if I didn’t the probability that I would have got the outcomes right is less than 5%

#2 is a bad logical conclusion but it’s based on the right interpretation (while #1 is completely wrong): it’s more likely that I was lucky because precognition is very implausible to start with.

jonahx 3 days ago

Dead on again.

What this and your other comment make clear is that once you start talking about the probability that X is true, especially in the context of hypothesis testing, you've moved (usually unwittingly) into a Bayesian framing, and you better make your priors explicit.

productmanager 3 days ago

I find it helpful to keep in mind that the traditional statistical significance test is a statement about a conditional probability. i.e. it's the probability of the data given the hypothesis (the null hypothesis). But what many actually want is the probability of the hypothesis given the data. Sometimes these are referred to as the frequentist vs. bayesian approach. There's a helpful recent podcast here by with author of Trustworthy Online Controlled Experiments https://music.youtube.com/podcast/hEzpiDuYFoE

drc500free 3 days ago

The wrong statement is saying P(no real effect) < 5%

The correct statement is saying P(saw these results | no real effect) < 5%

Consider two extremes, for the same 5% threshold:

1) All of their ideas for experiments are idiotic. Every single experiment is for something that simply would never work in real life. 5% of those experiments pass the threshold and 0% of them are valid ideas.

2) All of their ideas are brilliant. Every single experiment is for something that is a perfect way to capture user needs and get them to pay more money. 100% of those experiments pass the threshold and 100% of them are valid ideas.

(P scores don't actually tell you how many VALID experiments will fail, so let's just say they all pass).

This is so incredibly common in forensics that it's called the "prosecutor's fallacy."

ghkbrew 3 days ago

The chance that a positive result is a false positive depends on the false positive rate of your test and on total population statistics.

E.g. imagine your test has a 5% false positive rate for a disease only 1 in 1 million people has. If you test 1 million people you expect 50,000 false positive and 1 true positive. So the chance that one of those positive results is a false positive is 50,000/50,001, not 5/100.

Using a p-value threshold of 0.05 similar to saying: I'm going to use a test that will call a false result positive 5% of the time.

The author said: chance that a positive result is a false positive == the false positive rate.

leoff 3 days ago

wrong: given that we got this result, what's the probability the null hypothesis is correct?

correct: given that the null hypothesis is correct, what's the probability of us getting this result or more extreme ones by chance?

from Bayes you know that P(A|B) and P(B|A) are 2 different things

likecarter 3 days ago

Author: 5% chance it could be same or worse

Parent: 5% chance it could be same

esafak 3 days ago

@wavemode: In other words, the probability of it being the exactly the same is typically (for continuous random variables) zero, so we consider the tail probability; that of it being the same or more extreme.

edit: Will the down voter please explain yourself? p-values are tail probabilities, and points have zero measure in continuous random variables.

phaedrus441 3 days ago

This! I see this all the time in medicine.

__MatrixMan__ 1 day ago

I'm under the impression that there was a paper on "ego depletion" has sort of become a poster child for p-value hacking. It has sort of poisoned the topic. Upon learning of such a hack one should rationally revert to a null hypothesis, not revert to the assumption that the hacked conclusion was false.

As a climber I see ego depletion happen all the time. You find a crumbly hold, or get harassed by an insect, or whatever else it is, and you conclude that the next move is the crux. Then other people climb it and nobody agrees with you--that move was one of the easier ones. Anecdata, of course, I just wish we could learn from the bad science and then be washed clean, rather than be haunted by it.

This item has no comments currently.