What you are getting wrong about AB Testing
Oct 29, 2017 09:00 · 740 words · 4 minutes read
AB Testing and Experimental Design
AB testing has become a staple of the internet today. Facebook’s mantra “move fast and break things” can be roughly translated as “run lots of AB tests”, and in my past experiences, data scientists and business analysts are routinely brought in to assess whether a new design is statistically better. In an era where fake news potentially represents a threat to global stability, it’s not surprising that the tech world has embraced the scientific method.
Or at least tried to.
The problem is that not everything can be controlled for scientific study, and many websites simply lack the proper experimental setup to run a useful AB test.
1. The significance of an insignificant result - misinterpreting insignificance
Consider the problem:
(A) 2343 visitors to the original website, 44 conversions (1.88% of visitors)
(B) 522 visitors to the test website, 12 conversion (2.30% of visitors)
p value: 0.53 (chi2)
The statistician in you might look at these numbers and determine that the p value is definitely not significant. Had website B lost even a few of conversions, it would be performing worse than website A.
But then you realize that the test website (B) is technically performing 22.4% better!
I have seen smart people look at similar numbers and confidentally tell their business partners that “the results are insignificant - the test failed.” In reality, that couldn’t be further from the truth: it was our experimental setup that failed. Why? Our experimental design was not setup to detect a reasonable lift. If your experimental design setup can’t detect a lift at a 5% improvement, then you might want to reconsider whether AB testing is right for your needs.
2. Small sample sizes
In the example above, our test wasn’t big enough to detect a 20% lift, and probably should not have been performed in the first place. One way to remedy this is with a larger sample size. If our results in the example above were an order of magnitude larger, then we could just barely see a significant result (p < 0.05):
23430 visitors to the original website, 440 conversions (1.88% of visitors)
5220 visitors to the test website, 120 conversion (2.30% of visitors)
p value: 0.047 (chi2)
While we can detect a 20% lift, we would have needed a much larger sample to detect a 5% lift. Despite the fact that most businesses would be very happy with a 5% improvement in conversion rate, we would be forced to call the result insignificant.
3. Controlling your experiment
Sometimes it is not possible to run a fully controlled experiment. For example, the testing process itself can influence the behavior of users - a new design might produce a lift simply because it’s new, and that initial lift might fade over time. Similarly, a new design might perform poorly because your users don’t like change, even if it improves usability and long term conversions. These lagging effects can cause differences between test results and real world performance, potential invalidating the test.
Changes are also not zero sum - just because the last test produced a signficant lift, it doesn’t mean that it would be beneficial in a future design. It also doesn’t mean that two changes which each provide a 5% lift would yield a 10.25% lift when coupled together. Because AB testing usually follows a series of incremental tweaks, the interactive effects of your changes are usually not accounted for, and you could get different design outcomes depending on the sequence in which changes were made.
If not AB testing, then what?
Ideally, data scientists should be approached in the design/planning stage to see if it would even be possible to run a test. If it’s not possible, the business needs to figure out new, potentially qualitative methods for assessing design changes.
The issue comes down to this: unless you are getting thousands of conversions, it will be hard to run an AB test that detects significance.
For smaller websites, you might want to reframe your goal so that you collect more data. Perhaps you measure clicks instead of conversions. If you can’t scale your data, you will need to find non-statistical methods for analyzing the problem. Maybe you add some customer experience tools (I like fullstory) to your website to watch users interact with the new design. Or simply apply design best practices and collect honest feedback on new designs as much as possible.