Live testing, can you trust the results?

So, live testing is often seen as a sensible way to boost a website’s performance, and it borrows heavily from the scientific. Basically, you create different versions of a page or element – the default version and some alternatives – and then test them to see how they perform compared to the default. The version that does the best becomes the new standard, and the process repeats itself.

But there’s a catch, it’s worth noting that the results of these tests aren’t always as straightforward as they may seem. The tests themselves can be flawed, leading to false positives – and this problem runs deep. It becomes apparent during the experiment’s runtime, and it stems from a rather counter-intuitive belief.

The first thing that raises a red flag is the notion that design doesn’t play a big role in what works. Pages with minimal or no design have, in some cases, delivered better performance. It might seem a bit paradoxical, but it’s supported by the data, and it’s gained traction within the live testing community.

When running a test, you often see the same pattern repeating over and over again. One design might take the top spot for a week, only for another to replace it the next week. This keeps happening throughout the entire experiment, with the variants jostling for position. It’s also common for the top-performing variants to achieve a significant uplift, meaning that if the experiment were to end at that point, there would be a winner.

But some experiments behave very differently, with one variant consistently outperforming the others and showing a marked performance improvement throughout the campaign. We can divide experiments into two types – stable ones that show marked uplift and unstable ones that show only marginal uplift. The latter produce significant results, but the margins are small. The question is whether these results are ral or simply generated by the underlying noise (variance within the sample).

Simulating A B N tests

One way to answer this question is to run tests with a slight modification: making each variant identical. If there’s variation in the results, it must be a feature of the test, and we can simulate those to see if they’re enough to generate the behaviour.

In each experiment, there’s a metric that we want to manipulate, such as the number of people buying tickets. This metric is our baseline conversion rate. We also have several variants – the basic requirement is two, but live testing often involves many more.

To remove any sampling bias that could distort the result, we have to randomly assign each member of the population to one of the dummy experiences and perform a significance test to see if the variation in the number of converters within each group is significant compared to a default experience.

Here is the simulation, take it for a spin:
http://www.paperst.co.uk/MVT/mvt.htm

Randomisation

Since we can’t control who sees each variation, we have to assign them randomly. Some groups will have more converters, while others will have fewer. One of these groups will be our default. We calculate the uplift (a positive or negative figure) off the default. If the default has relatively few converters, the chances of a positive uplift increase. If it has more, the opposite is true. Since the number of converters is assigned randomly, there’s a range of variation due to sampling alone.

Significance Testing

So far so good. However, the problem arises when we are not just testing A against B but A against multiple variants, repeating the same approach. Let me explain further. Suppose I run a test on page A and page B, but I don’t get a significant result. I decide to rerun the test but still, no success. However, I’m determined to get the result I want, so I keep going until I get the desired outcome. Let’s say it takes me 20 attempts to achieve the desired result.

Now, the question arises, do I have a 1 in 20 chance of getting a result from a series of tests? In reality, no, I just kept going until I almost guaranteed that the event would occur. The first time I ran the experiment, I had a 1/20 chance (0.05). The second time I ran the experiment, I still have a 1/20 chance, and this never changes. However, now I have had two attempts, and the probability of getting a dud result in one of them is now 0.0975 (roughly 1/10).

When I run the experiment four times, the chance of getting a dud result increases to 0.185 (roughly 2/11). By the time I have run the experiment 20 times, the chance of getting a false positive result is 0.642 (about 2/3), and the likelihood keeps increasing with each subsequent test.

Conclusion

In conclusion, while live testing can be an effective tool to improve website performance, it’s essential to understand its limitations and the possibility of false positives. It’s crucial to ensure that the tests are properly designed and run to reduce the risk of generating misleading results.


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *