Comment by freehorse - Hacker Neue

I do not understand what the first tests are supposed to do. The author says the:

> Your hypothesis is: layout influences signup behavior.

I would expect that then the null hypothesis is that *layout does not influence signup behavior*. I would think that then an ANOVA (or an equivalent linear model) to be what tests this hypothesis, where you test the 4 layouts (or the 4 new layouts plus a control?) in one factor. If you get a significant p-value (no multiple tests required) you go on with post-hoc tests to look into comparisons between the different layouts (for 4 layouts, it should be 6 tests). But then you can use ways to control for multiple comparisons that are not as strict as just dividing your threshold by the number of comparisons, eg with Tukey's test.

But here I assume there is a control (as in some for users are still presented the old layout?) and each layout is compared to that control? If I would see that distribution of p-values I would just intuitively think that the experiment is underpowered. P-values from null tests are supposed to be distributed uniformly between 0 and 1, while these cluster around 0.05. It rather seems like a situation that it is hard to make inferences from because of issues in designing the experiment itself.

For example, I would rather have fewer layouts, driven by some expert design knowledge, rather than a lot of randomish layouts. The first increases statistical power, because the fewer tests you investigate, the less you have to adjust your p-values. But also, the fewer layouts you have, the more users you have per group (as the test is between groups) which also increases statistical power. The article is not wrong overall about how to control p-values etc, but I think that this knowledge is important not just to "do the right analysis" but, even more importantly, understand the limitations of an experimental design and structure it in a way that it may succeed in telling you something. To this end, g*power [0] is a useful tool that eg can let one calculate sample size in advance based on predicted effect size and power required.

[0] https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psy...

majormajor 3 days ago

The fraction of A/B tests I've seen personally that mentioned ANOVA at all is very small. Or thought that critically about experiment design. Understanding of p values is also generally poor; prob/stat education in engineering and business degrees seems to be the least-covered-or-respected type of math.

Even at places that want to ruthlessly prioritize velocity over rigor I think it would be better to at least switch things up and worry more about effect size than p-value. Don't bother waiting to see if marginal effects are "significant" statistically if they aren't significant from the POV of "we need to do things that can 10x our revenue since we're a young startup."

fho 3 days ago

> mentioned ANOVA at all is very small

That's because nobody learns how to do statistics and/or those who do are not really interested in it.

I taught statistics to biology students. Most them treated the statistics (and programming) courses like chores. Out of 300-ish students per year we had one or two that didn't leave uni mostly clueless about statistics.

TeMPOraL 3 days ago

FWIW, universities are pitching statistics the same way as every other subject, i.e. not at all. They operate under a delusion that students are deaperately interested in everything and grateful for the privilege of being taught by a prestigious institution. That may have been the case 100 years ago, but it hasn't been for decades now.

For me, stats was something I had to re-learn years after graduating, after I realized their importance (not just practical, but also epistemological). During university years, whatever interest I might have had, got extinguished the second the TA started talking about those f-in urns filled with colored balls.

fho 3 days ago

Also part of the problem:

> those f-in urns filled with colored balls.

I did my Abitur [1] in 2005, back then that used to be high school material.

When I was teaching statistics we had to cut more and more content from the courses in favor of getting people up to speed on content that they should have known from school.

[1] https://en.m.wikipedia.org/wiki/Abitur

TeMPOraL 3 days ago

I called them f-in, because they got really boring at that point. For me, every time high school curriculum touched on probability theory or statistics, it'd be urn o'clock. But then, come statistics at my university applied CS studies, there they were again. More advanced materials, but same mental model, as if it was something natural, or interesting, to people.

Also, calling them "urns". There are exactly two common usages of the word "urn" in Polish - the box you put your votes into during elections, and the vase for storing ashes of cremated people.

yusina 3 days ago

And you didn't have the mental capacity to abstract from the colored balls to whatever application domain you were interested in? Does everything have to come pre-digested for students so they don't have to do their own thinking?

stirfish 3 days ago

Hey yusina, that's pretty rude. What's a different way you could ask your question?

yusina 2 days ago

You're right, the phrasing was not ideal.

The point stands though.

TeMPOraL 3 days ago

I had, and still have. The problem is, most people are exposed to this stuff way before they have even a single application domain they're even remotely interested in.

It's really the same problem as with math in school in general ("whatever is this even useful for?") - most people don't like doing abstract, self-contained puzzles with no apparent utility, but being high-stakes (you're being graded on it).

yusina 2 days ago

> It's really the same problem as with math in school in general ("whatever is this even useful for?")

That argument is a strawman whenever it comes up because it applies to every subject. High jump? Napoleon wars? Molar weight of helium? English literature in the 19th century? What is any of that ever "useful" for? To understand the world which you live in. What a lack of education leads to is blatently obvious with the current U.S. administration. It's not about each school lesson directly translating into monetary value in a later job, neither w.r.t colored balls nor with knowing how the american civil war started.

fn-mote 3 days ago

> They operate under a delusion that students are desperately interested in everything

In the US, students are the paying customers. The consequence for not learning everything is lowered skills available for the job market (engineering) or life (philosophy?).

To me it is preferable that students who do not understand are not rated highly by the university (=do not get top marks), but “forcing” the students to learn statistics? That doesn’t make much sense.

Also, there’s nothing wrong with learning something after uni. Every skill I use in my job was developed post-degree. Really.

TeMPOraL 3 days ago

> In the US, students are the paying customers.

Only on paper. In many cases - I'd risk betting in vast majority of cases - the actual paying customers are parents.

> The consequence for not learning everything is lowered skills available for the job market (engineering) or life (philosophy?).

The ability to perceive and comprehend this kind of consequences is something that develops early in adulthood; some people get it in school, but others (again, I'd bet majority) only half-way through university or even later.

On paper, you have students who're paying for education. In reality, their parents are paying an expected fee for an expected stage of life of their kids.

enaaem 3 days ago

Instead of trying to make p-values work. What if we just stopped teaching p-values and confidence intervals, and just teach Bayesian credible intervals and log odds ratios? Are there problems that can only be solved with p-values?

This item has no comments currently.