Product Analytics
Frameworks for turning raw product data into ship/extend/kill decisions. Covers A/B testing, cohort retention, funnel analysis, and the statistical foundations needed to make those decisions with confidence.
Quick Reference
Category Rules Impact When to Use
A/B Test Evaluation 1 HIGH Comparing variants, measuring significance, shipping decisions
Cohort Retention 1 HIGH Feature adoption curves, day-N retention, engagement scoring
Funnel Analysis 1 HIGH Drop-off diagnosis, conversion optimization, stage mapping
Statistical Foundations 1 HIGH p-value interpretation, sample sizing, confidence intervals
Total: 4 rules across 4 categories
A/B Test Evaluation
Load rules/ab-test-evaluation.md for the full framework. Quick pattern:
Experiment: [Name]
Hypothesis: If we [change], then [primary metric] will [direction] by [amount] because [evidence or reasoning].
Sample size: [N per variant] — calculated for MDE=[X%], power=80%, alpha=0.05 Duration: [Minimum weeks] — never stop early (peeking bias)
Results: Control: [metric value] n=[count] Treatment: [metric value] n=[count] Lift: [+/- X%] p=[value] 95% CI: [lower, upper]
Decision: SHIP / EXTEND / KILL Rationale: [One sentence grounded in numbers, not gut feel]
Decision rules:
-
SHIP — p < 0.05, CI excludes zero, no guardrail regressions
-
EXTEND — trending positive but underpowered (add runtime, not reanalysis)
-
KILL — null result or guardrail degradation
See rules/ab-test-evaluation.md for sample size formulas, SRM checks, and pitfall list.
Cohort Retention
Load rules/cohort-retention.md for full methodology. Quick pattern:
-- Day-N retention cohort query SELECT DATE_TRUNC('week', first_seen) AS cohort_week, COUNT(DISTINCT user_id) AS cohort_size, COUNT(DISTINCT CASE WHEN activity_date = first_seen + INTERVAL '7 days' THEN user_id END) * 100.0 / COUNT(DISTINCT user_id) AS day_7_retention FROM user_activity GROUP BY 1 ORDER BY 1;
Retention benchmarks (SaaS):
-
Day 1: 40–60% is healthy
-
Day 7: 20–35% is healthy
-
Day 30: 10–20% is healthy
-
Flat curve after day 30 = product-market fit signal
See rules/cohort-retention.md for behavior-based cohorts, feature adoption curves, and engagement scoring.
Funnel Analysis
Load rules/funnel-analysis.md for full methodology. Quick pattern:
Funnel: [Name] — [Date Range]
Stage 1: [Aware / Land] → [N] users (entry) Stage 2: [Activate / Sign] → [N] users ([X]% from stage 1) Stage 3: [Engage / Use] → [N] users ([X]% from stage 2) ← biggest drop Stage 4: [Convert / Pay] → [N] users ([X]% from stage 3)
Overall conversion: [X]% Biggest drop-off: Stage 2→3 ([X]% loss) — investigate first
Optimization order: Fix the largest drop-off first. A 5-point improvement at a high-volume step is worth more than a 20-point improvement at a low-volume step.
See rules/funnel-analysis.md for segmented funnels, micro-conversion tracking, and prioritization patterns.
Statistical Foundations
Plain-English explanations of the stats every PM needs. Load references/stats-cheat-sheet.md for formulas and quick lookups.
p-value in plain English: The probability that you would see a result this extreme (or more extreme) if the change had zero effect. p=0.03 means a 3% chance you're looking at random noise. It does NOT mean "97% probability the change works."
Confidence interval in plain English: The range where the true effect probably lives. "Lift = +8%, 95% CI [+2%, +14%]" means you are fairly confident the real lift is somewhere between 2% and 14%. If the CI includes zero, you cannot claim a win.
Minimum Detectable Effect (MDE): The smallest lift you care about detecting. Setting MDE too small forces impractically large sample sizes. Anchor MDE to business value — if a 2% lift is not worth shipping, set MDE = 5%.
Statistical vs practical significance: A result can be statistically significant (p < 0.05) but practically meaningless (lift = 0.01%). Always check both. A 0.01% lift that costs 6 weeks of eng time is not a win.
Common Pitfalls
-
Peeking — stopping an experiment early because results look good inflates false-positive rate. Commit to a runtime before launch.
-
Multiple comparisons — testing 10 metrics at p < 0.05 means ~1 false positive by chance. Apply Bonferroni correction or pre-register your primary metric.
-
Sample Ratio Mismatch (SRM) — if variant group sizes differ from expected split by > 1%, your experiment is broken. Fix before analyzing results.
-
Novelty effect — new features get inflated engagement in week 1. Run experiments long enough to see settled behavior (minimum 2 full business cycles).
-
Simpson's paradox — aggregate results can reverse when segmented. Always check results by key segments (device, plan tier, geography).
Ship / Extend / Kill Framework
Signal Decision Action
p < 0.05, CI excludes zero, guardrails green SHIP Full rollout, update success metrics
Positive trend, underpowered (p = 0.10–0.15) EXTEND Add runtime, do not peek again
p > 0.15, flat or negative KILL Revert, document learnings, re-hypothesize
Guardrail regression, any p-value KILL Immediate revert regardless of primary metric
SRM detected INVALID Fix assignment bug, restart experiment
Related Skills
-
ork:product-frameworks — OKRs, KPI trees, RICE prioritization, PRD templates
-
ork:metrics-instrumentation — Event naming, metric definition, alerting setup
-
ork:brainstorm — Generate hypotheses and experiment ideas
-
ork:assess — Evaluate product quality and risks
References
-
rules/ab-test-evaluation.md — Hypothesis, sample size, significance, decision matrix
-
rules/cohort-retention.md — Cohort types, retention curves, SQL patterns
-
rules/funnel-analysis.md — Stage mapping, drop-off identification, optimization
-
references/stats-cheat-sheet.md — Formulas, test selection, power analysis
Version: 1.0.0 (March 2026)