Even at places that want to ruthlessly prioritize velocity over rigor I think it would be better to at least switch things up and worry more about effect size than p-value. Don't bother waiting to see if marginal effects are "significant" statistically if they aren't significant from the POV of "we need to do things that can 10x our revenue since we're a young startup."
That's because nobody learns how to do statistics and/or those who do are not really interested in it.
I taught statistics to biology students. Most them treated the statistics (and programming) courses like chores. Out of 300-ish students per year we had one or two that didn't leave uni mostly clueless about statistics.
For me, stats was something I had to re-learn years after graduating, after I realized their importance (not just practical, but also epistemological). During university years, whatever interest I might have had, got extinguished the second the TA started talking about those f-in urns filled with colored balls.
> those f-in urns filled with colored balls.
I did my Abitur [1] in 2005, back then that used to be high school material.
When I was teaching statistics we had to cut more and more content from the courses in favor of getting people up to speed on content that they should have known from school.
Also, calling them "urns". There are exactly two common usages of the word "urn" in Polish - the box you put your votes into during elections, and the vase for storing ashes of cremated people.
It's really the same problem as with math in school in general ("whatever is this even useful for?") - most people don't like doing abstract, self-contained puzzles with no apparent utility, but being high-stakes (you're being graded on it).
That argument is a strawman whenever it comes up because it applies to every subject. High jump? Napoleon wars? Molar weight of helium? English literature in the 19th century? What is any of that ever "useful" for? To understand the world which you live in. What a lack of education leads to is blatently obvious with the current U.S. administration. It's not about each school lesson directly translating into monetary value in a later job, neither w.r.t colored balls nor with knowing how the american civil war started.
In the US, students are the paying customers. The consequence for not learning everything is lowered skills available for the job market (engineering) or life (philosophy?).
To me it is preferable that students who do not understand are not rated highly by the university (=do not get top marks), but “forcing” the students to learn statistics? That doesn’t make much sense.
Also, there’s nothing wrong with learning something after uni. Every skill I use in my job was developed post-degree. Really.
Only on paper. In many cases - I'd risk betting in vast majority of cases - the actual paying customers are parents.
> The consequence for not learning everything is lowered skills available for the job market (engineering) or life (philosophy?).
The ability to perceive and comprehend this kind of consequences is something that develops early in adulthood; some people get it in school, but others (again, I'd bet majority) only half-way through university or even later.
On paper, you have students who're paying for education. In reality, their parents are paying an expected fee for an expected stage of life of their kids.
> Your hypothesis is: layout influences signup behavior.
I would expect that then the null hypothesis is that *layout does not influence signup behavior*. I would think that then an ANOVA (or an equivalent linear model) to be what tests this hypothesis, where you test the 4 layouts (or the 4 new layouts plus a control?) in one factor. If you get a significant p-value (no multiple tests required) you go on with post-hoc tests to look into comparisons between the different layouts (for 4 layouts, it should be 6 tests). But then you can use ways to control for multiple comparisons that are not as strict as just dividing your threshold by the number of comparisons, eg with Tukey's test.
But here I assume there is a control (as in some for users are still presented the old layout?) and each layout is compared to that control? If I would see that distribution of p-values I would just intuitively think that the experiment is underpowered. P-values from null tests are supposed to be distributed uniformly between 0 and 1, while these cluster around 0.05. It rather seems like a situation that it is hard to make inferences from because of issues in designing the experiment itself.
For example, I would rather have fewer layouts, driven by some expert design knowledge, rather than a lot of randomish layouts. The first increases statistical power, because the fewer tests you investigate, the less you have to adjust your p-values. But also, the fewer layouts you have, the more users you have per group (as the test is between groups) which also increases statistical power. The article is not wrong overall about how to control p-values etc, but I think that this knowledge is important not just to "do the right analysis" but, even more importantly, understand the limitations of an experimental design and structure it in a way that it may succeed in telling you something. To this end, g*power [0] is a useful tool that eg can let one calculate sample size in advance based on predicted effect size and power required.
[0] https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psy...