How to: Run an experiment
Part 3 of a four part series on how to identify, sell, run and report on experiments
How to: Run an experiment ← You’re Here
How to Experiment:
The easiest mistake to make with experimentation is to set the experiment up in a way where nothing is learned. The vast majority of the time, this happens due to improper setup.
In particular, there are three key areas to consider when it comes to designing an experiment.
The people in the experiment.
The structure of the experiment.
The tracking and measurement of the experiment.
This guide will dive into each area and what to consider at each level.
The people in the experiment:
When you set out to conduct an experiment you’re generally trying to prove that something is true for a large group of people.
The experiment is then testing your assumption about/on that population. In most cases, you can’t/shouldn’t test on the full population as it will have adverse effects on the business. If you send a new test email to all your users, for example, you might be exposing everyone to a poorer experience and you will also have to wait in order to be able to test a new version of the same email. To avoid this problem, you can build sample groups from the larger population and test on those groups.
Sampling:
By taking a sample group from the larger population, you can ensure that the group you’re experimenting with will yield results that will also be true for the full population. At the same time, by breaking up your population into smaller groups you’ll be able to run more iterations of the experiment than if you just tried on the full population from the start.
To create a sample you have to define two things:
The characteristics of the people in the sample.
Practically speaking what are the properties the population shares that make them distinct from other users?
Ideally, these characteristics are things you can easily use to segment your users in order to build the sample.
The amount of people you need in the sample to get accurate results.
The sample size calculation for the majority of experiments can be done using a simple sample size calculator (here’s one from SurveyMonkey that I’ve used in the past).
These calculators can help you ensure that your sample is big enough to generate results that are very likely to be representative of the results you’d get if you were exposing the full population to the experiment.
They also can help you better understand how statistical analysis is done and how to interpret your results.
What if my population is made up of lots of different groups?
In some cases, you might be dealing with a population made up of different sub-groups each with their own unique characteristics.
In this case, you have to ask yourself if your hypothesis could be influenced by the particular properties of those groups. If it could be, the next question is if it warrants a different experiment or if you can just have that differentiation be something you include in your analysis. In most cases, this differentiation can just be a variable in your analysis as it won’t materially affect the experiment design. If your experiment is directly designed for a specific sub-group then you probably should segregate the user types.
What if there are very few people in my population?
In some cases, the population is too small to build a proper sample. In these cases, you’re generally better off just testing with the full population as the sample size requirement will essentially be very close to the full population size.
Sometimes, you can shift your experiment to focus on frequency to build confidence. For example, exposing the same user group to the experience several times over the course of a long window of time. That being said, I’ve found that going with the full population instead has tended to be the right path in these scenarios.
Focusing on a very small number of users:
There are cases (as I’ll cover below) where you’re trying to understand how to build something new. In these cases, sometimes your best path is to focus on how to manually support your intended experience for a small number of people and learn from their experience.
The thing to keep in mind in this context is that any learnings you’re making with a small number of people have higher risk since there is a chance the people you’re working with are outliers and the group is too small to cover the full variations of behavior you’d get from the larger population.
This shouldn’t discourage you from attempting this approach as you can always bring in a bigger sample to try your new experience later on.
The structure of the experiment:
The structure of the experiment is the mechanism through which the experiment is conducted. There are some common templates that people gravitate towards for conducting experiments:
White Glove Service / Concierge Treatment:
Here you’re manually supporting an experience in order to understand what the same experience would look like and how it would perform if it were built.
Imagine you’re trying to understand if it’s worth building a recommendations feature for your product. In this design, you’d manually curate recommendations for a small number of users yourself. This would teach you both how to best make recommendations as well as the user impact of having recommendations in their experience.
Best suited for: Validating if something is worth building.
Downsides: You’re generally limited in your sample size since you’re manually supporting this experiment.
A/B/Multivariate Testing
These are scenarios where you’re running one or more versions of the same thing against one another to determine which one performs best. Within this format there are two main kinds of design:
Big swing experimentation
Here you’re testing out a completely different concept from the existing implementation.
Imagine you have an onboarding experience that has mediocre performance. In this design, you’re coming up with a completely different approach for your onboarding experience and testing it in parallel with your current onboarding flow. This allows you to try to establish a higher floor of performance than what you previously had. You won’t know exactly why it’s better (because you’re essentially testing a totally new version and can’t measure what about it made it better) but better is better so you’ll take it.
Best suited for: Cases where you feel like a big pivot is needed but aren’t ready to replace what you had.
Downsides: Your new version has a tendency to underperform if it’s going up against something that has been deeply optimized. You won’t learn why you got better/worse.
Variable based optimization
Here you’re focusing on a key variable, changing it, and testing the changed version against your existing version.
Imagine you have something that works pretty well but you feel like it can be better. This approach allows you to try to improve but still know exactly what led to the improvement by focusing changes on only a single variable.
Best suited for: Optimizing something that’s already working.
Downsides: If you’re trying to quickly improve something this generally won’t be an effective approach as it’s highly focused on a single variable and it can take many changes to improve significantly.
Buffet Style Experiment
Here you’re presenting the customer with lots of options in a cohesive way and seeing what options they pick.
Imagine you have multiple filters you’re thinking of building but you’re trying to figure out which one should be built first. In this experiment design, you are finding a way to expose the customer to all the options you’re thinking about and see which ones they gravitate to the most. You’ll know based on the engagement with the selection which ones are most immediately compelling to the user.
Best suited for: Figuring out what things users gravitate towards.
Downsides: It’s very easy to structure the experiment in a way where there’s a bias towards one of the options (leading customers to gravitate towards it0 which can mislead you.
Experiment Duration:
When it comes to experiment duration the key questions to consider are:
What is the length of time that it would require for all users in my experiment to complete the experience/experiment path?
Given that the necessary time has elapsed, have I achieved meaningful results?
For A/B types of experiments, you can use SurveyMonkey’s statistical significance calculator to determine if your results are significant.
Statistical significance will tell you that the results you’ve experienced with your sample group are likely to be experienced by the full population if that population were exposed to the same conditions.
For other kinds of experiments if you’ve built the sample correctly then you should be able to assume that the results you’ve seen with your experiment are representative of the results that you’d see with the full population.
If you don’t have clear results then you can either add more people to the experiment or run the experiment for longer. In cases where you have results that don’t seem better or worse than your baseline, you may be able to conclude that the new version is not better and focus on coming up with new experiments.
Experiment Medium and Fidelity:
For experiments around features, it can sometimes feel like you’re blocked from experimenting because you’re unable to deploy experiments in the same medium or with the same level of fidelity as your ultimate feature idea.
In many cases, you might be able to successfully experiment using other mediums (email being one of my favorites) you just have to ensure that it’s a reasonable assumption that the behavior would translate across mediums (ex: people would engage with this email in a similar way as they would engage with the in-product version of this feature).
For in-product experimentation, consider investing in feature flagging or other similar techniques to be capable of experimenting with select groups of users. In my experience, these capabilities have significantly increased the range of experiments available and have yielded the most accurate results since they allow you to get in-product engagement.
In terms of development try to have higher fidelity on the customer-facing portion of the experiment and hack together anything else that’s not visible (might as well because it’ll be cheaper and if the experiment doesn’t pan out you’ll likely have to throw it away).
Ultimately, it’s important to note that the experiment is a good enough proxy for a potential feature or process. It won’t ever perform exactly like the real thing, but it will ideally give you insight into how the real thing may behave if built.
The tracking and measurement of the experiment:
Sometimes in the excitement to try things we forget that without the right data at the end of the experiment, we won’t be able to take anything from it.
Before the experiment starts you should have a clear understanding of:
Key metrics and how to track them
What success and failure look like
Key metrics and how to track them
The experiment is essentially an exercise to extract data that is reliable for analysis and drawing conclusions. Without the data, the experiment is essentially throwaway.
Generally, in terms of metrics, I like to have:
One or two key metrics that are the clear guide for the experiment’s success or failure.
A few additional metrics that are interesting and could affect my understanding of the results.
These metrics should all be retrievable by you at the point where the experiment is finished. This means that before starting you should know exactly how you will retrieve these metrics.
The key is to set up your data analysis process (how you will acquire those metrics and analyze them) before the experiment starts and not during or after (can’t stress this enough). Every time I’ve neglected this rule I’ve found myself missing key data points in my analysis and had to do further experimentation.
Quantitative + Qualitative
Many times I like to couple my quantitative analysis with qualitative analysis for a more well-rounded picture. Here you can often include surveys as part of your experiment (at key points that don’t disrupt the experiment) and schedule interviews/sessions with customers to further discuss their experience with the experiment. If you have session tracking tools, those can be very valuable for understanding user behavior in your experiment (although nothing beats getting feedback from customers directly).
What success and failure look like
It’s always valuable to determine what success and failure look like for your experiment as well as determining if you’d be happy with the success state. This success/failure should be based on the key metric(s) you’ve selected and be something tangible like hitting a certain threshold on that metric(s).
Once you actually run the experiment, coming back to this success and failure state can be enlightening. Evaluating your results against your initial expectations can help clarify what you’ve learned and how your expectations have changed since running the experiment.