Why You Shouldn’t be Waiting for Statistical Significance
When deciding if a test can be stopped there are a number of things to consider.
- “How long should I run the test for?”
- “When will I know when the test has ended?”
- “How many conversions do I need?”
- “Do I need a minimum amount of traffic to end this test?”
These are common questions that are raised when working with companies on optimisation programmes. Unfortunately, as with most things in life, the answer isn’t as black and white as we might like it to be and no matter how many books on conversion conversion rate optimisation you read it won’t ever be an exact science.
In an ideal world every test would reach statistical significance in a sensible amount of time and a clear winner would be declared along with its’ associated uplift in conversion.
If the test has only been live for a few days and some experiments are already showing as being statistically significant it would not be advisable to stop the test (even if you have high traffic and good sample at your key metric). The reason for this is that almost all websites demonstrate different traffic levels and conversion rates throughout the week (and weekend). If the test has only been running for say Monday, Tuesday and Wednesday then at best the result you’re observing is only relevant for those three days of the week. It could be that the winning experiment for those three days actually performs worse than the control on other days.
If an AB test has been live for a week then that’s certainly an improvement over running the test for just a few days. That being said what we won’t necessarily know is if that particular week is a ‘typical’ week. For example there might have been campaigns (PPC, promo codes, TV advertising etc.) which might have influenced visitor activity, potentially skewing the type of visitor who arrives at the test page and specifically altering their motivation to purchase (they might only be buying as they have a 50% discount for example!).
For the above reasons it is recommended to run a test for at least two weeks, even if statistical significance and sample size is reached earlier. This will allow you to compare one complete week against another complete week and see if there are any strange peaks or troughs which may give you an indication of unusual visitor behaviour.
Perhaps you’ve run your test for two weeks and you have a clear winner which is showing a strongly statistically significant result (e.g. above 95% chance to beat control or less than 5% for negative lift). It isn’t enough to have just ticked these boxes because sample size needs to be considered. ConversionXL wrote a useful blog about how many conversions are required to stop a test.
Before the test was launched you should have defined what the key metric would be for the test (for example clicks on add to basket, visitors reaching payment page or perhaps conversion through to confirmation – i.e. site conversion). Before declaring a result you need to review how much sample (how many conversions) you have received through to your key metric, per experiment. For example if you’re running an ABn with 4 experiments and three of the experiments have a sample of around 20 conversions at your primary metric but one is showing 35 conversions it might seem obvious which one is the winner. The issue here is one of sample size. There just isn’t enough sample to be confident that the lift will sustain, even if it shows as being statistically significant. Leave the test running longer and use some online tools such as this one or Evan Miller’s tool to get an idea of how much sample should be sufficient.
Stabilisation Graphs (and other tools)
Webtrends Optimize shows graphical representations of how the experiments within a test are stabilising over time. The testing tool you are using may have a similar useful feature. This should not be used in isolation but instead in conjunction with both sample size and duration to help determine how near you are to test completion.
Back to our original question – “Should You Wait for Statistical Significance to Declare a Test Result?”. As mentioned earlier it isn’t an exact science and there isn’t a calculator out there that will be able to reliably predict how long a test will need to run for 100% of the time. The above tools (duration, sample size and stabilisation graphs) should be used in conjunction with each other to help determine when an optimal or winning experiment can be declared.
Author: Phil Williams
Phil is the founder of CRO Converts. He has had the opportuntity of creating successful testing and personalisation strategies for many of the UK and Europe’s leading brands.