· NERVICO · digital-product · 11 min read
A/B Testing for Digital Products: A Practical Guide
How to design, execute, and analyze A/B tests that generate reliable product decisions. Applied statistics, common mistakes, tools, and real cases for product teams.
Most product teams believe they do A/B testing. What they actually do is change something, look at a number, and declare a winner. That is not experimentation. It is confirmation bias with extra steps.
A proper A/B test requires a clear hypothesis, a calculated sample size, a defined execution period, and rigorous statistical analysis. Without these elements, the conclusions are no more reliable than flipping a coin. Worse still: they provide a false sense of certainty that leads to wrong decisions made with confidence.
This article explains how to do A/B testing properly. Not the simplified version from marketing tutorials. The version that generates reliable product decisions.
Why A/B Testing Matters (And Why It Is Done Poorly)
The Problem With Opinion-Based Decisions
In most product teams, important decisions are made like this: someone proposes a change, it is debated in a meeting, the person with the most seniority or charisma wins, and it is implemented. If it works, it is attributed to good intuition. If it does not work, another explanation is sought.
A/B testing eliminates this model. Instead of debating who is right, both options are tested with real users and the data decides. It sounds simple. It is not.
Why Most A/B Tests Are Invalid
Insufficient sample size. The most frequent error. A test with 200 users per variant cannot detect differences smaller than 10-15% with reasonable statistical significance. Most product improvements are not 15%. They are 2-5%. To detect them you need thousands of users.
Peeking problem. You look at results before the test ends and declare a winner when you see a difference that âlooks significant.â This dramatically inflates the false positive rate. If you look at results 10 times during a test, your error rate is not 5%. It is 40%.
Poorly chosen metrics. You test a button color measuring clicks. The red button gets more clicks. You declare victory. But clicks do not measure conversion. Users clicked the red button because it was confusing, not because they wanted to buy from you.
Inadequate duration. A 3-day test captures day-of-week biases. A test that does not cover at least one complete user behavior cycle (typically 2-4 weeks) produces results that do not replicate.
The Statistical Fundamentals You Need to Understand
You do not need a PhD in statistics. You need to understand four concepts.
1. Statistical Significance (P-Value)
Statistical significance answers this question: âIf there were no real difference between the two variants, what is the probability of observing a difference as large as what I observed?â
The industry standard is a p-value of 0.05, meaning you accept a 5% probability of declaring a winner when there is no real difference (false positive or Type I error).
What the p-value does not tell you: it does not tell you the magnitude of the difference. It does not tell you whether the difference is relevant to your business. It does not tell you the probability that variant B is better than A.
2. Statistical Power
Statistical power answers: âIf there is a real difference, what is the probability that my test will detect it?â
The standard is 80% power, meaning that if variant B is truly better, you have an 80% probability of detecting it. The remaining 20% is the probability of a false negative (Type II error): variant B is better but the test does not detect it.
3. Minimum Sample Size
Sample size depends on three factors:
- Base conversion rate: if your current rate is 5%, you need more users than if it is 30%
- Minimum detectable effect: how much improvement do you need to detect? A 1% relative improvement requires far more users than a 20% relative improvement
- Desired significance and power: with p=0.05 and power=0.80, the standard numbers
Practical example:
| Base rate | Minimum detectable effect | Users per variant |
|---|---|---|
| 5% | 10% relative (5% to 5.5%) | 57,000 |
| 5% | 20% relative (5% to 6%) | 14,500 |
| 10% | 10% relative (10% to 11%) | 14,300 |
| 10% | 20% relative (10% to 12%) | 3,600 |
If your product has 1,000 daily active users and you need 57,000 per variant, you will need almost 4 months to complete the test. This is not a tooling problem. It is a scale problem you must consider before designing the test.
4. Confidence Interval
The confidence interval gives you the likely range of the real effect. âVariant B improves conversion between 2% and 8% with 95% confidenceâ is more useful than âvariant B has a p-value of 0.03.â
How to Design a Correct A/B Test
Step 1: Formulate a Clear Hypothesis
A hypothesis is not âlet us test if the green button converts better.â A hypothesis is:
âChanging the CTA from âSign upâ to âStart for freeâ will increase the registration rate on the landing page by at least 10% relative, because the word âfreeâ reduces perceived friction of commitment.â
The hypothesis must include: what you are changing, what metric you expect to change, in what direction, by how much (minimum), and why you believe it will happen.
Step 2: Choose the Primary Metric (And Only One)
The primary metric is what determines whether the test is a success or failure. It must be one. If you measure five metrics, the probability of a false positive rises from 5% to 23%.
Rules for choosing the primary metric:
- Must be directly connected to business value
- Must be influenceable by the change you are making
- Must have sufficient volume to reach significance in a reasonable time
Secondary metrics (you can have several but with caution):
- Guardrail metrics: must not worsen (for example, error rate, load time)
- Exploratory metrics: do not determine the test result but provide context
Step 3: Calculate the Required Sample Size
Before starting. Not after. Use a sample size calculator (Evan Miller, Optimizely, or any standard calculator). Enter your base rate, the minimum effect you want to detect, and the significance and power levels.
If the required sample size is larger than your available traffic in a reasonable period (2-4 weeks), you have three options: look for a larger effect (test more radical changes), switch to a metric with higher volume, or do not run the test and make the decision with other methods.
Step 4: Define the Test Duration
Minimum one full week to capture day-of-week variations. Ideally two weeks or more. Never end a test before reaching the calculated sample size, even if the results âlook clear.â
Step 5: Configure Randomization
Users must be assigned to variants randomly. This seems obvious but there are subtleties:
- Assignment must be persistent: a user who sees variant B today must see variant B tomorrow
- Assignment must be independent of other active tests
- Do not assign by session (a user could see both variants in different sessions)
Step 6: Execute Without Interfering
Once the test is running, do not touch it. Do not change the design of variant B midway through the test. Do not add traffic from a marketing campaign to only one variant. Do not look at results every hour.
Step 7: Analyze With Rigor
When the test reaches the calculated sample size and minimum duration, analyze:
- Does the primary metric have statistical significance?
- Does the confidence interval include the minimum effect you considered relevant?
- Did the guardrail metrics remain stable?
- Are there segment differences that merit investigation?
What to Test (And What Not To)
High-Impact Tests
Complete flows, not isolated elements. Testing a new onboarding flow against the current one has more impact potential than testing button color. High-impact tests change the user experience significantly.
Value propositions. How you communicate what your product does affects conversion more than any visual change. Test messages, not pixels.
Pricing and packaging. How you structure your plans and prices has a direct impact on conversion and revenue. But be careful: showing different prices to different users has ethical and legal implications.
Feature removal. Sometimes less is more. Test whether removing a rarely used feature improves the overall experience (less confusion, less cognitive load).
Tests That Are Probably a Waste of Time
Minor cosmetic changes. The exact button color, font size by 2px, border radius of a card. These changes rarely produce detectable effects and need enormous sample sizes.
Tests without hypotheses. âLet us test these two versions to see what happensâ is not a test. It is random exploration.
Tests on low-traffic pages. If the page receives 100 visits per day, it will take months to reach significance for any reasonable effect.
Tools for A/B Testing
For Teams Starting Out
- PostHog: open source, feature flags and A/B testing integrated, statistical analysis
- GrowthBook: open source, integrates with any data source, Bayesian analysis
For Growing Teams
- Statsig: automated statistical analysis, good integration with product analytics
- Eppo: warehouse-native, works with your existing data
- VWO: visual interface for frontend tests without code
For Mature Teams
- Optimizely: complete experimentation platform, advanced statistics
- LaunchDarkly + analysis tool: separates flag management from results analysis
The choice depends on three factors: test volume, team technical resources, and budget. If you run fewer than 5 tests per month, an open source tool is sufficient.
Advanced Mistakes That Experienced Teams Make
The Multiple Testing Problem
If you run 20 tests per month with p=0.05, one in twenty will show a false positive. By the end of the year, you will have declared âwinnersâ several features that improve nothing.
Solution: adjust your expectations. Not all tests will produce significant results. A ratio of 1 positive test per 5-7 tests is normal for mature teams.
The Peeking Problem in Detail
Every time you look at intermediate results and decide whether to continue, you are performing an implicit statistical test. If you look 10 times, your real false positive rate may be 30-40%.
Solutions:
- Define duration and sample size before starting and do not look until they are reached
- Use sequential testing methods designed to allow intermediate analysis
- Use Bayesian statistics, which do not have the same problem with continuous analysis
Simpsonâs Paradox
The test result changes when you segment the data. Variant B wins overall, but when you look by device, variant A wins on mobile and desktop. Variant B only wins in the aggregate because it received more mobile traffic (where rates are generally higher).
Solution: always review results by key segments (device, traffic source, new vs returning users). If results are inconsistent across segments, investigate before declaring a winner.
Novelty Effect and Primacy Effect
Novelty effect: users interact more with something new simply because it is new. The effect disappears over time. If you measure only the first two weeks, you overestimate the real impact.
Primacy effect: returning users are accustomed to the current version. Any change generates temporary friction. If you measure only the first two weeks, you underestimate the real impact.
Solution: extend test duration. Segment by new vs returning users. The real effect typically stabilizes after 2-3 weeks.
Building an Experimentation Culture
From One-Off Tests to an Experimentation Program
Isolated A/B testing generates one-off insights. An experimentation program generates cumulative learning.
Elements of a mature program:
- Prioritized experiment backlog. A list of hypotheses ordered by expected impact and implementation ease.
- Regular cadence. A minimum number of active tests per sprint or month.
- Results documentation. Each test documented with hypothesis, result, learning, and decision made.
- Learning reviews. Monthly review of what was learned from tests, including those that did not show significant results.
What to Do When You Do Not Have Enough Traffic
Most products do not have the traffic of Netflix or Booking.com. This does not mean you cannot experiment.
Alternatives to classic A/B testing:
- Qualitative tests: usability tests with 5-10 users give you insights into why something works or not, even though you cannot measure how much
- Fake door tests: measure interest in a feature before building it. Add a button that says âComing soonâ and measure how many users click
- Painted door tests: similar to fake door but with an explanation of what the feature would do and a form to express interest
- Pre/post analysis: measure metrics before and after a change. Less rigorous than an A/B test but better than measuring nothing
Conclusion
Proper A/B testing is one of the most powerful tools a product team has. And poorly done A/B testing is one of the most sophisticated forms of self-deception.
The difference between the two lies in rigor: clear hypotheses, calculated sample sizes, adequate durations, correct statistical analyses, and the discipline to accept that many tests will not produce significant results.
If your product has the necessary traffic, build a serious experimentation program. If it does not, use alternative validation methods and reserve A/B testing for decisions with the highest potential impact.
In both cases, stop declaring winners based on numbers that âlook good.â Data does not lie, but poorly done analyses do.
Need help designing your experimentation program?
At NERVICO we help product teams build rigorous A/B testing programs. From metric definition to results interpretation, we can help you make product decisions based on evidence, not opinions.