150
points
I'm excited to introduce tea-tasting, a Python package for the statistical analysis of A/B tests
It features Student's t-test, Bootstrap, variance reduction using CUPED, power analysis, and other statistical methods.
tea-tasting supports a wide range of data backends, including BigQuery, ClickHouse, PostgreSQL, Snowflake, Spark, and more, all thanks to Ibis.
I consider it ready for important tasks and use it for the analysis of switchback experiments in my work.
For those who haven't read about Fisher's tea experiment: There was a woman who claimed she could tell if the milk was put into the cup before or after pouring the tea. Fished didn't think so, and developed the experimental technique to test this idea. Indeed she could, getting them all right iirc.
[1] see https://media.trustradius.com/product-downloadables/UP/GB/AD... for a discussion of the problems with a t-test. There is also a more detailed whitepaper from Optimizely somewhere
[1] https://github.com/assuncaolfi/savvi/
[2] https://openreview.net/forum?id=a4zg0jiuVi
1. Experiments with 3 or more variants are quite rare in my practice. I usually try to avoid them.
2. In my opinion, the Bonferroni correction is just wrong. It's too pessimistic. There are better methods though.
3. The choice of alpha is subjective. Why use a precise smart method to adjust a subjective parameter? Just choose another subjective alpha, a smaller one :)
But I can change my opinion if I see a good argument.
I agree that Bonferroni is often too pessimistic. If you Bonferroni correct you'll usually find nothing is significant. And I take your point that you could adjust the $\alpha$. But then of course, you can make things significant or not as you like by the choice.
False Discover Rate is less conservative, and I have used it successfully in the past.
People have strong incentives to find significant results that can be rolled out, so you don't want that person choosing $\alpha$. They will also be peaking at the results every day of a weekly test, and wanting to roll it out if it bumps into significance. I just mention this because the most useful A/B libraries are ones that are resistant to human nature. PM's will talk about things being "almost significant" at 0.2 everywhere I've worked.
I'm considering the following: - FWER: Holm–Bonferroni, Hochberg's step-up. - FDR: Benjamini–Hochberg, Benjamini–Yekutieli.
I'm wondering if you'd like to accept a contribution for Bayesian AB Testing, based on this whitepaper[0] and developed in Numpy.
If so, we can chat at my email gbenatt92 at zohomail dot com, or I can open a draft PR to discuss the code and paper.
[0]https://vwo.com/downloads/VWO_SmartStats_technical_whitepape...
Regarding your question, first, I'd like to understand what problem you want to solve, and whether this approach will be useful for other users of tea-tasting.
At my company we have very time sensitive AB tests that we have to run with very few data points (at most 20 conversions per week, after 1000 or so failures).
We found out that using Bayesian A/B testing was excellent for our needs as it could be run with fewer data points than regular AB for the sort of conversion changes we aim for. It gives a probability of group B converting better than A, and we can run checks to see if we should stop the test.
Regular ABs would take too long and the significance of the test wouldnt make much sense because after a few weeks we would be comparing apples to oranges.
Most probably, in your case, higher sensitivity (or power) comes at the cost of higher type I error rate. And this might be fine. Sometimes making more changes and faster is more important than false positives. In this case, you can just use a higher p-value threshold in the NHST framework.
You might argue that the discrete type I error does not concern you. And that the potential loss in metric value is what matters. This might be true in your setting. But in real life scenarios, in most cases, there are additional costs that are not taken into account in the proposed solution: increased complexity, more time spent on development, implementation, and maintenance.
I suggest reading this old post by David Robinson: https://varianceexplained.org/r/bayesian-ab-testing/
While the approach might fit in your setting, I don't believe most of other users of tea-tasting would benefit from it. For the moment, I must decline your kind contribution.
But you still can use tea-tasting and perform the calculations described in the whitepaper. See the guide on how to define a custom metric with a statistical test of your choice: https://tea-tasting.e10v.me/custom-metrics/
Not many people have enough traffic to A/B test small effects and reach significance without running the test for multiple years.
I don't use CUPED in my tests... how much can it reduce wait times?
What's really important if you want to improve a website via A/B testing is a constant stream of new hypotheses (i.e. new variants). You can call tests "early" so long as you have new tests lined up it boils don't to a classic exploitation/exploration problem. In fact, in early development rapid iteration often yields superior results to waiting for significance.
As a website matures and reaches closer to some theoretical optimal conversion point, then it starts becoming increasing important to wait until you are very certain of an improvement. But if you're just starting A/B testing, more iteration will yield greater success than more certainty.
Another way to say that is: you can randomly pick a winner
Taking a long time to reach "significance" just means there is a small difference between the two variants, so it's better to just choose one and the try the next challenger which might have a larger difference.
In the early stages of running A/B tests being 90% certain that one variant is superior is perfectly fine so long as you have another challenger ready. Conversely, In the later stages of a mature website when you're searching for minor gains you probably want a much higher level of certainty that then standard 95%.
In either case thinking in terms of arbitrary significance thresholds doesn't make that much sense for A/B testing.
But for incremental, smaller changes, calling early is probably gambling.
On top of this, logistic regression makes your units a lot more interpretable than just looking at differences in means. I.E. The odds of buying something are 1.1 when you are assigned in group B.
Correct A/B testing should involved starting with an A/A test to validate the setup, building a basic causal model of what you expect the treatment impact to be, controlling of covariates, and finally ensuring that when the causal factor is controlled for the results change as expected.
But even the "experts" I've read in this area largely focus on statistical details that honestly don't matter (and if they do the change you're proposing is so small that you shouldn't be wasting time on it).
In practice if you need "statistical significance" to determine if change has made an impact on your users you're already focused on problems that are too small to be worth your time.
I think the dumb underlying question I have is - how does one do experimental design
Edit: and if you aren’t seeing giant obvious improvements, try improving something else (I get the idea that my B is going to be so obvious that there is no need to worry about stats - if it’s not that’s a signal to chnage something else?
Yes, one can analyze A/B tests in a regression framework. In fact, CUPED is an equivalent to the linear regression with a single covariate.
Would it be better? It depends on the definition of "better". There are several factors to consider. Scientific rigor is one of them. So is the computational efficiency.
A/B tests are usually conducted at scale of thousands of randomization units (actually it's more like tens or hundreds of thousands). There are two consequences:
1. Computational efficiency is very important, especially if we take into account the number of experiments and the number of metrics. And pulling granular data into a Python environment and fitting a regression is much less efficient than calculating aggregated statistics like mean and variance.
2. I didn't check, but I'm pretty sure that, at such scale, logistic and linear regressions' results will be very close, if not equal.
And even if, for some reason, there is a real need to analyze a test using logistic model, multi-level model, or a clustered error, in tea-tasting, it's possible via custom metrics: https://tea-tasting.e10v.me/custom-metrics/
This is not true. You almost never need to perform logistic regression on individual observations. Consider that estimating a single Bernoulli rv on N observations is the same as estimate a single Binomial rv for k/N. Most common statistical software (e.g. statsmodels) will support this grouped format.
If all of our covariates a discrete categories (which is typically the case for A/B tests) then you only need to regression on the number of examples equal to the number of unique configurations of the variables.
That is if you're running an A/B test on 10 million users across 50 states and 2 variants you only need 100 observations for your final model.
Interesting, I didn't know this about statsmodels. But maybe documentation a bit misleading: "A nobs x k array where nobs is the number of observations and k is the number of regressors". Source: https://www.statsmodels.org/stable/generated/statsmodels.gen...
I would be grateful for the references on how to apply statsmodels for solving logistic model using only aggregated statistics. Or not statsmodels. Any references will do.
So that will be a bit different than r style formula's using cbind, but yes if you only have a few categories of data using weights makes sense. (Even many of sklearn's functions allow you to pass in weights.)
I have not worked out closed form for logit regression, but for Poisson regression you can get closed form for the incident rate ratio, https://andrewpwheeler.com/2024/03/18/poisson-designs-and-mi.... So no need to use maximum likelihood at all in that scenario.
[1] https://www.pymc.io/projects/examples/en/latest/generalized_...
If numpy is out of consideration so is the entire scientific Python ecosystem. Python is not a fast language and doing any kind of math heavy algorithm is going to suffer significant performance penalties.
Python packaging is a mess, but compared to issues with Torch or Nvidia stuff, Numpy has been a cakewalk whether using pip, conda, poetry, rye, etc.
https://en.wikipedia.org/wiki/Fisher%27s_exact_test?useskin=...
Alex Deng worked with Ron Kohavi at Microsoft Analysis and Experimentation Team and co-autored many important papers on the topic, including paper about CUPED.
https://matheusfacure.github.io/python-causality-handbook/la...