Comment by Jemaclus - Hacker Neue

Jemaclus 3 days ago parent

> This isn't academic nit-picking. It's how medical research works when lives are on the line. Your startup's growth deserves the same rigor.

But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?

While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.

At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.

But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.

IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)

But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.

travisjungroth 3 days ago

Completely agree. The sign up flow for your startup does not need the same rigor as medical research. You don’t need transportation engineering standards for your product packaging, either. They’re just totally different levels of risk.

I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.

At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.

For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).

0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.

Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!

But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.

If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.

yorwba 3 days ago

> You don’t need all the status quo bias of null hypothesis testing.

You don't have to make the status quo be the null hypothesis. If you make a change, you probably already think that your change is better or at least neutral, so make that the null. If you get a strong signal that your change is actually worse, rejecting the null, revert the change.

Not "only keep changes that are clearly good" but "don't keep changes that are clearly bad."

scott_w 3 days ago

This is a reasonable approach, particularly when you’re looking at moving towards a bigger redesign that might not pay off right away. I’ve seen it called “non-inferiority test,” if you’re curious.

parpfish 3 days ago

Especially for startups with a small user base.

Not many users means that getting to stat sig will take longer (if at all).

Sometimes you just need to trust your design/product sense and assert that some change you’re making is better and push it without an experiment. Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep

scott_w 3 days ago

100% this. I’ve seen people get too excited to A/B test everything even when it’s not appropriate. For us, changing prices was a common A/B test when the relatively low number of conversions meant the tests took 3 months to run! I believe we’ve moved away from that, now.

The company has a large user base, it’s just SaaS doesn’t have the same conversion # as, say, e-commerce.

bigfudge 3 days ago

The idea you should be going after bigger wins than .05 misses the point. The p value is a function of the effect size and the sample size. If you have a big effect you’ll see it even with small data.

Completely agree on the Bayesian point though, and the importance of defining the loss function. Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time.

travisjungroth 3 days ago

> If you have a big effect you’ll see it even with small data.

That’s in line with what I was saying so I’m not sure where I missed the point.

P-value a function of effect size, variance and sample size. Bigger wins would be those that have a larger effect and more consistent effect, scaled to the number of users (or just get more users).

bigfudge 3 days ago

> But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway.

This was the part I was quibbling with. The size of the p value is pretty much irrelevant unless you know how much data you are collecting. The p values might always be about ~.05 if you know the effects are likely large and powered the study appropriately.

epgui 3 days ago

It does, if you assume you care about the validity of the results or about making changes that improve your outcomes.

The degree of care can be different in less critical contexts, but then you shouldn’t lie to yourself about how much you care.

renjimen 3 days ago

But there’s an opportunity cost that needs to be factored in when waiting for a stronger signal.

Nevermark 3 days ago

One solution is to gradually move instances to you most likely solution.

But continue a percentage of A/B/n testing as well.

This allows for a balancing of speed vs. certainty

imachine1980_ 3 days ago

do you use any tool for this, or simply crunk up slightly the dial each day

travisjungroth 3 days ago

There are multi armed bandit algorithms for this. I don’t know the names of the public tools.

This is especially useful for something where the value of the choice is front loaded, like headlines.

hruk 3 days ago

We've used this Python package to do this: https://github.com/bayesianbandits/bayesianbandits

scott_w 3 days ago

There is but you can decide that up front. There’s tools that will show you how long it’ll take to get statistical significance. You can then decide if you want to wait that long or have a softer p-value.

epgui 3 days ago

Even if you have to be honest with yourself about how much you care about being right, there’s still a place for balancing priorities. Two things can be true at once.

Sometimes someone just has to make imperfect decisions based on incomplete information, or make arbitrary judgment calls. And that’s totally fine… But it shouldn’t be confused with data-driven decisions.

The two kinds of decisions need to happen. They can both happen honestly.

Jemaclus OP 2 days ago

I don't think I'm making the case that you shouldn't test things or care about the results, but rather a matter of degree of risk that should be acceptable. In medicine, if you get it wrong, people /die/. In software, if you get it wrong, /you sell fewer widgets/. That's a pretty major difference. You can't get it wrong in medicine, but you /can/ get it wrong in software without it being catastrophic failure.

I'm basically making the case that "Your startup deserves the same rigor [as medical testing]" is making a pretty bold assertion, and that the reality is that most of us can get away with much less rigor and still get ahead in terms of improving our outcomes.

In other words, it's still A/B testing if your p-value is 0.10 instead of 0.05. There's nothing magical about the 0.05 number. Most startups could probably get away with a 20% chance of being wrong on any particular test and still come out ahead. (Note: this assumes that the thing your testing is good science -- one thing we aren't talking about is how many tests are actually changing many variables at once and maybe that's not great!)

jjmarr 3 days ago

Can this be solved by setting p=0.50?

Make your expectations explicit instead of implicit. 0.05 is completely arbitrary. If you are comfortable with a 50/50 chance of being right, make your threshold less rigorous.

scott_w 3 days ago

I think at that point you may as well skip the test and just make the change you clearly want to make!

bigfudge 3 days ago

Or collect some data and see if the net effect is positive. It’s possibly worth collecting some data though to rule out negative effects?

scott_w 3 days ago

Absolutely, you can still analyse the outcomes and try to draw conclusions. This is true even for A/B testing.

yusina 3 days ago

I see where you are coming from, and overtesting is a thing, but I really believe that the baseline of quality of all software out there is terrible. We are just so used to it and it's been normalized. But there is really no day going by during which I'm not annoyed by a bug that somebody with more attention to quality would have not let through.

It's not about space rocket type of rigor, but it's about a higher bar than the current state.

(Besides, Elon's rockets are failing left and right, in contrast to what NASA achieved in the 60s, so there are some lessons there too.)

Jemaclus OP 2 days ago

I think there's a pretty big difference between QA (letting bugs go by) and A/B testing, and your post appears to me to be conflating the two. I would argue that you are better off spending your time QAing a feature that you have high confidence is positive ROI, than spending weeks waiting for an A/B test to reach stat sig.

I don't disagree with your statement, I just think you are addressing a different problem from A/B testing and statistical significance.

scott_w 3 days ago

> The consequences of getting it wrong are... you sell fewer widgets?

If that’s the difference between success and failure then that is pretty important to you as a business owner.

> do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive

That’s a reasonable, and in plenty of contexts the absolute best, approach to take. But don’t call it A/B testing, because it’s not.

Jemaclus OP 2 days ago

Absolutely. If you're the business owner, selling fewer widgets is Very Bad!

But in my post, I specifically called out a line in OP's article that I disagreed with: (paraphrasing) "Your startup deserves the same rigor as medical testing."

To clarify -- and support your point --, we're shipping software, not irreversible medical procedures. If you get it wrong, you sell fewer widgets /temporarily/ and you revert back to a known better solution. With medicine, there aren't necessarily take-backsies -- but there absolutely are in software. Reverting deploys is something all of us do quite regularly!

Is it A/B testing? Maybe, maybe not. I'm not a data scientist. But I think saying that your startup deserves the same rigor as a medical test is misleading at best and harmful at worst.

I just think companies should be more okay with educated risks, rather than waiting days, weeks, months for statistical significance on a feature that has little chance of actually having a negative impact. As you said elsewhere in the thread, for startups, stasis is death.

(BTW, I've read a lot of your other comments in the thread. I think we're pretty well aligned!)

brian-armstrong 3 days ago

The thing is though, you're just as likely to be not improving things.

I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.

scott_w 3 days ago

Allow me to rephrase what I think you’re saying:

Startups need to ship because they need to have a habit of moving constantly to survive. Stasis is death for a startup.

sweezyjeezy 3 days ago

Yes 100% this. If you're comparing two layouts there's no great reason to treat one as a 'treatment' and one as a 'control' as in medicine - the likelihood is they are both equally justified. If you run an experiment and get p=0.93 on a new treatment - are you really going to put money on that result being negative, and not updating the layout?

The reason we have this stuff in medicine is because it is genuinely important, and because a treatment often has bad side-effects, it's worse to give someone a bad treatment than to give them nothing, that's the point of the Hypocratic oath. You don't need this for your dumb B2C app.

BrenBarn 3 days ago

The other thing is that in those medical contexts, the choice is often between "use this specific treatment under consideration, or do nothing (i.e., use existing known treatments)". Is anyone planning to fold their startup if they can't get a statistically significant read on which website layout is best? Another way to phrase "do no harm" is to say that a null result just means "there is no reason to change what you're doing".

psychoslave 3 days ago

> We aren't shooting rockets into space.

Most of us don't, indeed. So still aligned with your perspective, it's good to take in consideration what we are currently working on, and what will be the possible implication. Sometimes the line is not so obvious though. If we design a library or framework which is not very specific to a inconsequential outcome, it's no longer obvious what policy make the more sense.

bobbruno 3 days ago

It's not a matter of life and death, I agree - to some extent. Startups have very limited resources, and ignoring inconclusive results in the long term means you're spending these resources without achieving any bottom line results. If you do that too much/too long, you'll run out of funding and the startup will die.

The author didn't go into why companies do this (ignoring or misreading test results). Putting lack of understanding aside, my anecdotal experience from the time I worked as a data scientist boils down to a few major reasons:

- Wanting to be right. Being a founder requires high self-confidence, that feeling of "I know I'm right". But feeling right doesn't make one right, and there's plenty of evidence around that people will ignore evidence against their beliefs, even rationalize the denial (and yes, the irony of that statement is not lost on me); - Pressure to show work: doing the umpteenth UI redesign is better than just saying "it's irrelevant" in your performance evaluation. If the result is inconclusive, the harm is smaller than not having anything to show - you are stalling the conclusion that your work is irrelevant by doing whatever. So you keep on pushing them and reframing the results into some BS interpretation just to get some more time.

Another thing that is not discussed enough is what all these inconclusive results would mean if properly interpreted. A long sequence of inconclusive UI redesign experiments should trigger a hypothesis like "does the UI matter"? But again, those are existentially threatening questions for the people in the best position to come up with them. If any company out there were serious about being data-driven and scientific, they'd require tests everywhere, have external controls on quality and rigour of those and use them to make strategic decisions on where they invest and divest. At the very least, take them as a serious part of their strategy input.

I'm not saying you can do everything based on tests, nor that you should - there are bets on the future, hypothesis making on new scenarios and things that are just too costly, ethically or physically impossible to test. But consistently testing and analysing test results could save a lot of work and money.

Jemaclus OP 2 days ago

Excellent response. Thank you!

syntacticsalt 3 days ago

> Most companies don't cost peoples' lives when you get it wrong.

True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.

We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.

As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.

> A lot of companies are, arguably, _too rigorous_ when it comes to testing.

My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.

Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.

> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.

Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.

> I do like their proposal for "peeking" and subsequent testing.

What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.

> We're shipping software. We can change things if we get them wrong.

That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.

> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.

While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.

> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.

I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."

> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.

If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.

Jemaclus OP 2 days ago

I think you're being overly pedantic here. I'm not a data scientist, just an engineering manager who is frustrated with data scientists ;)

That said, I do appreciate your corrections, but I don't think anything you said fundamentally changes my philosophical approach to these problems.

AIFounder 3 days ago (dead)

This item has no comments currently.