Reading Results

Understand your experiment data, know when to stop a test, and make confident decisions.

Dashboard overview

Each experiment's results page shows:

  • Variant breakdown — conversion rates for each variant, side by side
  • Daily metrics chart — conversion trends over time
  • Statistical analysis — significance, lift, confidence intervals
  • Sample size — total exposures and conversions per variant

Results auto-refresh every 30 seconds. You can also manually refresh to see the latest data.

Key metrics explained

Conversion rate

The percentage of exposed users who completed your goal action:

Conversion Rate = (Conversions / Exposures) × 100

For example, if 1,000 users saw the blue button and 50 clicked it, the conversion rate is 5.0%.

Statistical significance (p-value)

The p-value answers: "If there were no real difference between variants, how likely would we see this result by chance?"

| p-value | Meaning | |---------|---------| | p < 0.01 | Very strong evidence of a real difference | | p < 0.05 | Strong evidence (industry standard threshold) | | p < 0.10 | Weak evidence — probably need more data | | p > 0.10 | Not significant — no reliable difference detected |

CADENCE uses a Z-test for proportions to calculate significance. This is the standard statistical test for comparing conversion rates between two groups.

What p < 0.05 means

A p-value below 0.05 means there's less than a 5% chance the observed difference is due to random variation. It does NOT mean there's a 95% chance the treatment is better — that's a common misconception.

Lift percentage

How much better (or worse) the treatment is compared to control:

Lift = ((Treatment Rate - Control Rate) / Control Rate) × 100

A lift of +15% means the treatment's conversion rate is 15% higher than control. A negative lift means control is winning.

Confidence intervals

The range of plausible values for the true lift. For example:

Lift: +12.3% (95% CI: +4.1% to +20.5%)

This means we're 95% confident the true improvement is somewhere between 4.1% and 20.5%. Narrower intervals = more precise estimates = more data.

Sample size

The number of users exposed to each variant. CADENCE shows both:

  • Exposures — users who saw the variant (from getVariant() calls)
  • Conversions — users who completed the goal (from trackConversion() calls)

When to stop a test

A test should run until ALL of these conditions are met:

  1. Statistical significance reached — p-value below 0.05
  2. Minimum sample size — at least 100 conversions per variant (not just exposures)
  3. Full business cycle — at least 7 days to account for day-of-week effects

Don't stop tests early

Stopping a test the moment it reaches significance is a form of p-hacking. Early results are volatile — a test that looks significant on day 3 may not be on day 7. Pre-set your test duration and stick to it.

What if I don't reach significance?

If after your planned duration the test isn't significant, that's a valid result. It means either:

  • There's no meaningful difference between variants
  • The difference is too small to detect with your sample size

Both are useful information. Document the result and move on.

Interpreting results

Clear winner

  • p-value < 0.05
  • Lift is meaningfully positive (e.g., > 5%)
  • Consistent trend over time in the daily chart

Action: Implement the winning variant. Document the learnings. Archive the test.

No clear winner

  • p-value > 0.10 after full test duration
  • Lift is near zero (e.g., -2% to +2%)

Action: Keep the simpler variant (usually control). The change doesn't matter enough to justify the complexity. Move on to a higher-impact test.

Surprising loser

  • p-value < 0.05
  • Lift is meaningfully negative

Action: Don't implement the treatment. This is valuable — you learned what NOT to do. Document why you think the treatment underperformed.

Borderline result

  • p-value between 0.05 and 0.10
  • Moderate lift (e.g., +5% to +10%)

Action: Consider extending the test if you have enough traffic. If not, treat it as inconclusive and make a judgment call based on the direction of the effect.

Communicating results

Raw p-values don't resonate with leadership. CADENCE's Impact View translates your test results into business language — estimated revenue, users retained, cumulative program value. If you're sharing results beyond your team, start there.

Impact View for executives

Instead of "p = 0.032 with +12% lift," Impact View shows "this test will generate an estimated $45K ARR." See the full Impact View guide to set it up.

Common mistakes

1. Stopping too early

The #1 mistake. Tests need time and data to produce reliable results. Set a minimum duration of 7 days and minimum sample size before checking.

2. P-hacking

Checking results repeatedly and stopping when you see significance. The more you check, the more likely you'll see a false positive. Set your criteria upfront and check at the end.

3. Multiple comparisons

Running 10 A/B/C/D tests simultaneously increases the chance of false positives. With 20 comparisons at p < 0.05, you'd expect 1 false positive by chance alone.

4. Ignoring practical significance

A result can be statistically significant but practically meaningless. A 0.1% lift with p = 0.03 isn't worth implementing if it adds complexity.

5. Not documenting results

Every test — win, lose, or draw — is a learning. Document what you tested, what happened, and what you learned. Your team's testing knowledge compounds over time.

What to do after a test

  1. Record the result — Update the experiment status in CADENCE (implement winner or archive)
  2. Share with the team — Use Impact View exports for leadership reviews
  3. Iterate — If the treatment won, test a more extreme version. If it lost, test a different approach.
  4. Update your backlog — Use results to inform your next testing priorities

Next steps