I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.
At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.
For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).
0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.
Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!
But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.
If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.
You don't have to make the status quo be the null hypothesis. If you make a change, you probably already think that your change is better or at least neutral, so make that the null. If you get a strong signal that your change is actually worse, rejecting the null, revert the change.
Not "only keep changes that are clearly good" but "don't keep changes that are clearly bad."
Not many users means that getting to stat sig will take longer (if at all).
Sometimes you just need to trust your design/product sense and assert that some change you’re making is better and push it without an experiment. Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep
The company has a large user base, it’s just SaaS doesn’t have the same conversion # as, say, e-commerce.
Completely agree on the Bayesian point though, and the importance of defining the loss function. Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time.
That’s in line with what I was saying so I’m not sure where I missed the point.
P-value a function of effect size, variance and sample size. Bigger wins would be those that have a larger effect and more consistent effect, scaled to the number of users (or just get more users).
This was the part I was quibbling with. The size of the p value is pretty much irrelevant unless you know how much data you are collecting. The p values might always be about ~.05 if you know the effects are likely large and powered the study appropriately.
The degree of care can be different in less critical contexts, but then you shouldn’t lie to yourself about how much you care.
But continue a percentage of A/B/n testing as well.
This allows for a balancing of speed vs. certainty
This is especially useful for something where the value of the choice is front loaded, like headlines.
Sometimes someone just has to make imperfect decisions based on incomplete information, or make arbitrary judgment calls. And that’s totally fine… But it shouldn’t be confused with data-driven decisions.
The two kinds of decisions need to happen. They can both happen honestly.
I'm basically making the case that "Your startup deserves the same rigor [as medical testing]" is making a pretty bold assertion, and that the reality is that most of us can get away with much less rigor and still get ahead in terms of improving our outcomes.
In other words, it's still A/B testing if your p-value is 0.10 instead of 0.05. There's nothing magical about the 0.05 number. Most startups could probably get away with a 20% chance of being wrong on any particular test and still come out ahead. (Note: this assumes that the thing your testing is good science -- one thing we aren't talking about is how many tests are actually changing many variables at once and maybe that's not great!)
Make your expectations explicit instead of implicit. 0.05 is completely arbitrary. If you are comfortable with a 50/50 chance of being right, make your threshold less rigorous.
It's not about space rocket type of rigor, but it's about a higher bar than the current state.
(Besides, Elon's rockets are failing left and right, in contrast to what NASA achieved in the 60s, so there are some lessons there too.)
I don't disagree with your statement, I just think you are addressing a different problem from A/B testing and statistical significance.
If that’s the difference between success and failure then that is pretty important to you as a business owner.
> do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive
That’s a reasonable, and in plenty of contexts the absolute best, approach to take. But don’t call it A/B testing, because it’s not.
But in my post, I specifically called out a line in OP's article that I disagreed with: (paraphrasing) "Your startup deserves the same rigor as medical testing."
To clarify -- and support your point --, we're shipping software, not irreversible medical procedures. If you get it wrong, you sell fewer widgets /temporarily/ and you revert back to a known better solution. With medicine, there aren't necessarily take-backsies -- but there absolutely are in software. Reverting deploys is something all of us do quite regularly!
Is it A/B testing? Maybe, maybe not. I'm not a data scientist. But I think saying that your startup deserves the same rigor as a medical test is misleading at best and harmful at worst.
I just think companies should be more okay with educated risks, rather than waiting days, weeks, months for statistical significance on a feature that has little chance of actually having a negative impact. As you said elsewhere in the thread, for startups, stasis is death.
(BTW, I've read a lot of your other comments in the thread. I think we're pretty well aligned!)
I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.
The reason we have this stuff in medicine is because it is genuinely important, and because a treatment often has bad side-effects, it's worse to give someone a bad treatment than to give them nothing, that's the point of the Hypocratic oath. You don't need this for your dumb B2C app.
Most of us don't, indeed. So still aligned with your perspective, it's good to take in consideration what we are currently working on, and what will be the possible implication. Sometimes the line is not so obvious though. If we design a library or framework which is not very specific to a inconsequential outcome, it's no longer obvious what policy make the more sense.
The author didn't go into why companies do this (ignoring or misreading test results). Putting lack of understanding aside, my anecdotal experience from the time I worked as a data scientist boils down to a few major reasons:
- Wanting to be right. Being a founder requires high self-confidence, that feeling of "I know I'm right". But feeling right doesn't make one right, and there's plenty of evidence around that people will ignore evidence against their beliefs, even rationalize the denial (and yes, the irony of that statement is not lost on me); - Pressure to show work: doing the umpteenth UI redesign is better than just saying "it's irrelevant" in your performance evaluation. If the result is inconclusive, the harm is smaller than not having anything to show - you are stalling the conclusion that your work is irrelevant by doing whatever. So you keep on pushing them and reframing the results into some BS interpretation just to get some more time.
Another thing that is not discussed enough is what all these inconclusive results would mean if properly interpreted. A long sequence of inconclusive UI redesign experiments should trigger a hypothesis like "does the UI matter"? But again, those are existentially threatening questions for the people in the best position to come up with them. If any company out there were serious about being data-driven and scientific, they'd require tests everywhere, have external controls on quality and rigour of those and use them to make strategic decisions on where they invest and divest. At the very least, take them as a serious part of their strategy input.
I'm not saying you can do everything based on tests, nor that you should - there are bets on the future, hypothesis making on new scenarios and things that are just too costly, ethically or physically impossible to test. But consistently testing and analysing test results could save a lot of work and money.
True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.
We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.
As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.
> A lot of companies are, arguably, _too rigorous_ when it comes to testing.
My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.
Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.
> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.
Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.
> I do like their proposal for "peeking" and subsequent testing.
What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.
> We're shipping software. We can change things if we get them wrong.
That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.
> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.
While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.
> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.
I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."
> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.
But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?
While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.
At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.
But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.
IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)
But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.