· NERVICO · digital-product  Â· 11 min read

A/B Testing for Digital Products: A Practical Guide

How to design, execute, and analyze A/B tests that generate reliable product decisions. Applied statistics, common mistakes, tools, and real cases for product teams.

How to design, execute, and analyze A/B tests that generate reliable product decisions. Applied statistics, common mistakes, tools, and real cases for product teams.

Most product teams believe they do A/B testing. What they actually do is change something, look at a number, and declare a winner. That is not experimentation. It is confirmation bias with extra steps.

A proper A/B test requires a clear hypothesis, a calculated sample size, a defined execution period, and rigorous statistical analysis. Without these elements, the conclusions are no more reliable than flipping a coin. Worse still: they provide a false sense of certainty that leads to wrong decisions made with confidence.

This article explains how to do A/B testing properly. Not the simplified version from marketing tutorials. The version that generates reliable product decisions.

Why A/B Testing Matters (And Why It Is Done Poorly)

The Problem With Opinion-Based Decisions

In most product teams, important decisions are made like this: someone proposes a change, it is debated in a meeting, the person with the most seniority or charisma wins, and it is implemented. If it works, it is attributed to good intuition. If it does not work, another explanation is sought.

A/B testing eliminates this model. Instead of debating who is right, both options are tested with real users and the data decides. It sounds simple. It is not.

Why Most A/B Tests Are Invalid

Insufficient sample size. The most frequent error. A test with 200 users per variant cannot detect differences smaller than 10-15% with reasonable statistical significance. Most product improvements are not 15%. They are 2-5%. To detect them you need thousands of users.

Peeking problem. You look at results before the test ends and declare a winner when you see a difference that “looks significant.” This dramatically inflates the false positive rate. If you look at results 10 times during a test, your error rate is not 5%. It is 40%.

Poorly chosen metrics. You test a button color measuring clicks. The red button gets more clicks. You declare victory. But clicks do not measure conversion. Users clicked the red button because it was confusing, not because they wanted to buy from you.

Inadequate duration. A 3-day test captures day-of-week biases. A test that does not cover at least one complete user behavior cycle (typically 2-4 weeks) produces results that do not replicate.

The Statistical Fundamentals You Need to Understand

You do not need a PhD in statistics. You need to understand four concepts.

1. Statistical Significance (P-Value)

Statistical significance answers this question: “If there were no real difference between the two variants, what is the probability of observing a difference as large as what I observed?”

The industry standard is a p-value of 0.05, meaning you accept a 5% probability of declaring a winner when there is no real difference (false positive or Type I error).

What the p-value does not tell you: it does not tell you the magnitude of the difference. It does not tell you whether the difference is relevant to your business. It does not tell you the probability that variant B is better than A.

2. Statistical Power

Statistical power answers: “If there is a real difference, what is the probability that my test will detect it?”

The standard is 80% power, meaning that if variant B is truly better, you have an 80% probability of detecting it. The remaining 20% is the probability of a false negative (Type II error): variant B is better but the test does not detect it.

3. Minimum Sample Size

Sample size depends on three factors:

  • Base conversion rate: if your current rate is 5%, you need more users than if it is 30%
  • Minimum detectable effect: how much improvement do you need to detect? A 1% relative improvement requires far more users than a 20% relative improvement
  • Desired significance and power: with p=0.05 and power=0.80, the standard numbers

Practical example:

Base rateMinimum detectable effectUsers per variant
5%10% relative (5% to 5.5%)57,000
5%20% relative (5% to 6%)14,500
10%10% relative (10% to 11%)14,300
10%20% relative (10% to 12%)3,600

If your product has 1,000 daily active users and you need 57,000 per variant, you will need almost 4 months to complete the test. This is not a tooling problem. It is a scale problem you must consider before designing the test.

4. Confidence Interval

The confidence interval gives you the likely range of the real effect. “Variant B improves conversion between 2% and 8% with 95% confidence” is more useful than “variant B has a p-value of 0.03.”

How to Design a Correct A/B Test

Step 1: Formulate a Clear Hypothesis

A hypothesis is not “let us test if the green button converts better.” A hypothesis is:

“Changing the CTA from ‘Sign up’ to ‘Start for free’ will increase the registration rate on the landing page by at least 10% relative, because the word ‘free’ reduces perceived friction of commitment.”

The hypothesis must include: what you are changing, what metric you expect to change, in what direction, by how much (minimum), and why you believe it will happen.

Step 2: Choose the Primary Metric (And Only One)

The primary metric is what determines whether the test is a success or failure. It must be one. If you measure five metrics, the probability of a false positive rises from 5% to 23%.

Rules for choosing the primary metric:

  • Must be directly connected to business value
  • Must be influenceable by the change you are making
  • Must have sufficient volume to reach significance in a reasonable time

Secondary metrics (you can have several but with caution):

  • Guardrail metrics: must not worsen (for example, error rate, load time)
  • Exploratory metrics: do not determine the test result but provide context

Step 3: Calculate the Required Sample Size

Before starting. Not after. Use a sample size calculator (Evan Miller, Optimizely, or any standard calculator). Enter your base rate, the minimum effect you want to detect, and the significance and power levels.

If the required sample size is larger than your available traffic in a reasonable period (2-4 weeks), you have three options: look for a larger effect (test more radical changes), switch to a metric with higher volume, or do not run the test and make the decision with other methods.

Step 4: Define the Test Duration

Minimum one full week to capture day-of-week variations. Ideally two weeks or more. Never end a test before reaching the calculated sample size, even if the results “look clear.”

Step 5: Configure Randomization

Users must be assigned to variants randomly. This seems obvious but there are subtleties:

  • Assignment must be persistent: a user who sees variant B today must see variant B tomorrow
  • Assignment must be independent of other active tests
  • Do not assign by session (a user could see both variants in different sessions)

Step 6: Execute Without Interfering

Once the test is running, do not touch it. Do not change the design of variant B midway through the test. Do not add traffic from a marketing campaign to only one variant. Do not look at results every hour.

Step 7: Analyze With Rigor

When the test reaches the calculated sample size and minimum duration, analyze:

  1. Does the primary metric have statistical significance?
  2. Does the confidence interval include the minimum effect you considered relevant?
  3. Did the guardrail metrics remain stable?
  4. Are there segment differences that merit investigation?

What to Test (And What Not To)

High-Impact Tests

Complete flows, not isolated elements. Testing a new onboarding flow against the current one has more impact potential than testing button color. High-impact tests change the user experience significantly.

Value propositions. How you communicate what your product does affects conversion more than any visual change. Test messages, not pixels.

Pricing and packaging. How you structure your plans and prices has a direct impact on conversion and revenue. But be careful: showing different prices to different users has ethical and legal implications.

Feature removal. Sometimes less is more. Test whether removing a rarely used feature improves the overall experience (less confusion, less cognitive load).

Tests That Are Probably a Waste of Time

Minor cosmetic changes. The exact button color, font size by 2px, border radius of a card. These changes rarely produce detectable effects and need enormous sample sizes.

Tests without hypotheses. “Let us test these two versions to see what happens” is not a test. It is random exploration.

Tests on low-traffic pages. If the page receives 100 visits per day, it will take months to reach significance for any reasonable effect.

Tools for A/B Testing

For Teams Starting Out

  • PostHog: open source, feature flags and A/B testing integrated, statistical analysis
  • GrowthBook: open source, integrates with any data source, Bayesian analysis

For Growing Teams

  • Statsig: automated statistical analysis, good integration with product analytics
  • Eppo: warehouse-native, works with your existing data
  • VWO: visual interface for frontend tests without code

For Mature Teams

  • Optimizely: complete experimentation platform, advanced statistics
  • LaunchDarkly + analysis tool: separates flag management from results analysis

The choice depends on three factors: test volume, team technical resources, and budget. If you run fewer than 5 tests per month, an open source tool is sufficient.

Advanced Mistakes That Experienced Teams Make

The Multiple Testing Problem

If you run 20 tests per month with p=0.05, one in twenty will show a false positive. By the end of the year, you will have declared “winners” several features that improve nothing.

Solution: adjust your expectations. Not all tests will produce significant results. A ratio of 1 positive test per 5-7 tests is normal for mature teams.

The Peeking Problem in Detail

Every time you look at intermediate results and decide whether to continue, you are performing an implicit statistical test. If you look 10 times, your real false positive rate may be 30-40%.

Solutions:

  • Define duration and sample size before starting and do not look until they are reached
  • Use sequential testing methods designed to allow intermediate analysis
  • Use Bayesian statistics, which do not have the same problem with continuous analysis

Simpson’s Paradox

The test result changes when you segment the data. Variant B wins overall, but when you look by device, variant A wins on mobile and desktop. Variant B only wins in the aggregate because it received more mobile traffic (where rates are generally higher).

Solution: always review results by key segments (device, traffic source, new vs returning users). If results are inconsistent across segments, investigate before declaring a winner.

Novelty Effect and Primacy Effect

Novelty effect: users interact more with something new simply because it is new. The effect disappears over time. If you measure only the first two weeks, you overestimate the real impact.

Primacy effect: returning users are accustomed to the current version. Any change generates temporary friction. If you measure only the first two weeks, you underestimate the real impact.

Solution: extend test duration. Segment by new vs returning users. The real effect typically stabilizes after 2-3 weeks.

Building an Experimentation Culture

From One-Off Tests to an Experimentation Program

Isolated A/B testing generates one-off insights. An experimentation program generates cumulative learning.

Elements of a mature program:

  1. Prioritized experiment backlog. A list of hypotheses ordered by expected impact and implementation ease.
  2. Regular cadence. A minimum number of active tests per sprint or month.
  3. Results documentation. Each test documented with hypothesis, result, learning, and decision made.
  4. Learning reviews. Monthly review of what was learned from tests, including those that did not show significant results.

What to Do When You Do Not Have Enough Traffic

Most products do not have the traffic of Netflix or Booking.com. This does not mean you cannot experiment.

Alternatives to classic A/B testing:

  • Qualitative tests: usability tests with 5-10 users give you insights into why something works or not, even though you cannot measure how much
  • Fake door tests: measure interest in a feature before building it. Add a button that says “Coming soon” and measure how many users click
  • Painted door tests: similar to fake door but with an explanation of what the feature would do and a form to express interest
  • Pre/post analysis: measure metrics before and after a change. Less rigorous than an A/B test but better than measuring nothing

Conclusion

Proper A/B testing is one of the most powerful tools a product team has. And poorly done A/B testing is one of the most sophisticated forms of self-deception.

The difference between the two lies in rigor: clear hypotheses, calculated sample sizes, adequate durations, correct statistical analyses, and the discipline to accept that many tests will not produce significant results.

If your product has the necessary traffic, build a serious experimentation program. If it does not, use alternative validation methods and reserve A/B testing for decisions with the highest potential impact.

In both cases, stop declaring winners based on numbers that “look good.” Data does not lie, but poorly done analyses do.


Need help designing your experimentation program?

At NERVICO we help product teams build rigorous A/B testing programs. From metric definition to results interpretation, we can help you make product decisions based on evidence, not opinions.

Request a free audit

Back to Blog

Related Posts

View All Posts »