Reading Results
Understand your experiment data, know when to stop a test, and make confident decisions.
Dashboard overview
Each experiment's results page shows:
- Variant breakdown — conversion rates for each variant, side by side
- Daily metrics chart — conversion trends over time
- Statistical analysis — significance, lift, confidence intervals
- Sample size — total exposures and conversions per variant
Results auto-refresh every 30 seconds. You can also manually refresh to see the latest data.
Key metrics explained
Conversion rate
The percentage of exposed users who completed your goal action:
Conversion Rate = (Conversions / Exposures) × 100
For example, if 1,000 users saw the blue button and 50 clicked it, the conversion rate is 5.0%.
Statistical significance (p-value)
The p-value answers: "If there were no real difference between variants, how likely would we see this result by chance?"
| p-value | Meaning | |---------|---------| | p < 0.01 | Very strong evidence of a real difference | | p < 0.05 | Strong evidence (industry standard threshold) | | p < 0.10 | Weak evidence — probably need more data | | p > 0.10 | Not significant — no reliable difference detected |
CADENCE uses a Z-test for proportions to calculate significance. This is the standard statistical test for comparing conversion rates between two groups.
What p < 0.05 means
A p-value below 0.05 means there's less than a 5% chance the observed difference is due to random variation. It does NOT mean there's a 95% chance the treatment is better — that's a common misconception.
Lift percentage
How much better (or worse) the treatment is compared to control:
Lift = ((Treatment Rate - Control Rate) / Control Rate) × 100
A lift of +15% means the treatment's conversion rate is 15% higher than control. A negative lift means control is winning.
Confidence intervals
The range of plausible values for the true lift. For example:
Lift: +12.3% (95% CI: +4.1% to +20.5%)
This means we're 95% confident the true improvement is somewhere between 4.1% and 20.5%. Narrower intervals = more precise estimates = more data.
Sample size
The number of users exposed to each variant. CADENCE shows both:
- Exposures — users who saw the variant (from
getVariant()calls) - Conversions — users who completed the goal (from
trackConversion()calls)
When to stop a test
A test should run until ALL of these conditions are met:
- Statistical significance reached — p-value below 0.05
- Minimum sample size — at least 100 conversions per variant (not just exposures)
- Full business cycle — at least 7 days to account for day-of-week effects
Don't stop tests early
Stopping a test the moment it reaches significance is a form of p-hacking. Early results are volatile — a test that looks significant on day 3 may not be on day 7. Pre-set your test duration and stick to it.
What if I don't reach significance?
If after your planned duration the test isn't significant, that's a valid result. It means either:
- There's no meaningful difference between variants
- The difference is too small to detect with your sample size
Both are useful information. Document the result and move on.
Interpreting results
Clear winner
- p-value < 0.05
- Lift is meaningfully positive (e.g., > 5%)
- Consistent trend over time in the daily chart
Action: Implement the winning variant. Document the learnings. Archive the test.
No clear winner
- p-value > 0.10 after full test duration
- Lift is near zero (e.g., -2% to +2%)
Action: Keep the simpler variant (usually control). The change doesn't matter enough to justify the complexity. Move on to a higher-impact test.
Surprising loser
- p-value < 0.05
- Lift is meaningfully negative
Action: Don't implement the treatment. This is valuable — you learned what NOT to do. Document why you think the treatment underperformed.
Borderline result
- p-value between 0.05 and 0.10
- Moderate lift (e.g., +5% to +10%)
Action: Consider extending the test if you have enough traffic. If not, treat it as inconclusive and make a judgment call based on the direction of the effect.
Communicating results
Raw p-values don't resonate with leadership. CADENCE's Impact View translates your test results into business language — estimated revenue, users retained, cumulative program value. If you're sharing results beyond your team, start there.
Impact View for executives
Instead of "p = 0.032 with +12% lift," Impact View shows "this test will generate an estimated $45K ARR." See the full Impact View guide to set it up.
Common mistakes
1. Stopping too early
The #1 mistake. Tests need time and data to produce reliable results. Set a minimum duration of 7 days and minimum sample size before checking.
2. P-hacking
Checking results repeatedly and stopping when you see significance. The more you check, the more likely you'll see a false positive. Set your criteria upfront and check at the end.
3. Multiple comparisons
Running 10 A/B/C/D tests simultaneously increases the chance of false positives. With 20 comparisons at p < 0.05, you'd expect 1 false positive by chance alone.
4. Ignoring practical significance
A result can be statistically significant but practically meaningless. A 0.1% lift with p = 0.03 isn't worth implementing if it adds complexity.
5. Not documenting results
Every test — win, lose, or draw — is a learning. Document what you tested, what happened, and what you learned. Your team's testing knowledge compounds over time.
What to do after a test
- Record the result — Update the experiment status in CADENCE (implement winner or archive)
- Share with the team — Use Impact View exports for leadership reviews
- Iterate — If the treatment won, test a more extreme version. If it lost, test a different approach.
- Update your backlog — Use results to inform your next testing priorities
Next steps
- Impact View — Communicate results to leadership in business terms
- Creating Tests — Set up your next experiment
- Event Tracking — Deep dive into the event system
- Dashboard Guide — Navigate workspaces, projects, and experiments