Comparing Groups in Your Data: A/B Tests, Segments, and Cohorts
Use DataStoryBot to analyze group comparisons: test vs. control, region vs. region, cohort vs. cohort — with narrative explanations of what the differences mean.
Comparing Groups in Your Data: A/B Tests, Segments, and Cohorts
Most interesting data questions are comparison questions. Is the new checkout flow better than the old one? Do enterprise customers behave differently from SMBs? Is the January cohort retaining better than December's?
A single number in isolation is useless. Revenue was $2.3M. Good? Bad? You can't tell without a comparison — to last quarter, to the forecast, to the other product line. Comparison creates meaning.
This article shows how to use DataStoryBot to run group comparisons on your CSV data. Upload a file with group labels (test/control, region, cohort, plan tier), steer the analysis toward comparison, and get back a narrative that explains what the differences mean and whether they matter.
What Makes a Good Comparison
A valid comparison needs three things:
Clear groups. The data must have a column that defines the groups — a treatment flag for A/B tests, a region column, a signup date for cohorts. If groups aren't explicit in the data, you'll need to define them in the steering prompt.
A metric to compare. Revenue, conversion rate, time-to-completion, retention at day 30. The metric should be meaningful for the business question you're asking.
Enough data per group. Comparing two groups of 5 users each tells you nothing. Statistical significance requires sample size. DataStoryBot's Code Interpreter runs significance tests when the data supports them — but it can't manufacture statistical power from small samples.
A/B Test Analysis
The most common comparison: treatment vs. control.
Upload Your Test Data
curl -X POST https://datastory.bot/api/upload \
-F "file=@checkout_ab_test.csv"
Typical A/B test CSV structure:
user_id,variant,converted,revenue,session_duration_sec
u001,control,0,0,145
u002,treatment,1,89.50,203
u003,treatment,1,124.00,178
u004,control,1,67.00,156
...
Steer Toward Comparison
curl -X POST https://datastory.bot/api/analyze \
-H "Content-Type: application/json" \
-d '{
"containerId": "ctr_abc123",
"steeringPrompt": "This is an A/B test dataset. Compare the treatment group against the control group on conversion rate and revenue. Run a statistical significance test and report whether the difference is significant at p<0.05."
}'
The steering prompt does three things: identifies the comparison structure (treatment vs. control), specifies the metrics (conversion rate, revenue), and requests statistical rigor (significance test).
What You Get Back
[
{
"id": 1,
"title": "Treatment Increases Conversion Rate by 3.2 Percentage Points (p=0.003)",
"summary": "The treatment group converted at 18.7% vs. 15.5% for control — a 3.2pp lift that is statistically significant (chi-squared test, p=0.003, n=12,847). Revenue per user is also higher: $14.20 vs. $11.80 (t-test, p=0.011)."
},
{
"id": 2,
"title": "Treatment Effect Is Stronger for Mobile Users",
"summary": "Segmenting by device type, the treatment effect on conversion is 4.8pp for mobile users but only 1.1pp for desktop. The mobile lift is significant (p<0.001); the desktop lift is not (p=0.34)."
}
]
DataStoryBot doesn't just compute the means — it runs the appropriate statistical test (chi-squared for proportions, t-test for continuous metrics), reports the p-value, and identifies interaction effects across subgroups. The narrative tells you what the numbers mean, not just what they are.
The Full Narrative
curl -X POST https://datastory.bot/api/refine \
-H "Content-Type: application/json" \
-d '{
"containerId": "ctr_abc123",
"selectedStoryTitle": "Treatment Increases Conversion Rate by 3.2 Percentage Points (p=0.003)"
}'
The refined narrative includes:
- The headline result with statistical test details
- Effect size and confidence intervals
- Sample size per group
- Subgroup breakdowns (if the data supports them)
- Charts: conversion rate comparison bar chart, revenue distribution by group, and often a segment-level breakdown
Regional Comparison
Not every comparison is an experiment. Often you're comparing naturally occurring segments — regions, product lines, customer tiers.
import requests
BASE_URL = "https://datastory.bot/api"
# Upload
with open("sales_by_region.csv", "rb") as f:
upload = requests.post(f"{BASE_URL}/upload", files={"file": f})
container_id = upload.json()["containerId"]
# Analyze
stories = requests.post(f"{BASE_URL}/analyze", json={
"containerId": container_id,
"steeringPrompt": (
"Compare performance across regions. For each region, "
"analyze revenue, order count, average order value, "
"and customer count. Identify which regions outperform "
"or underperform the company average and explain why."
)
})
Regional comparisons are trickier than A/B tests because the groups aren't randomized. The West region might outperform because it has more customers, not because it's "better." DataStoryBot's narrative accounts for this by normalizing metrics (revenue per customer, rather than total revenue) and noting confounding factors.
Cohort Analysis
Cohort comparisons answer: "Are newer users behaving differently from older ones?"
stories = requests.post(f"{BASE_URL}/analyze", json={
"containerId": container_id,
"steeringPrompt": (
"This is user-level data with a signup_date column. "
"Create monthly cohorts based on signup_date and compare "
"them on: 30-day retention rate, average revenue per user, "
"and feature adoption rate. Identify which cohorts are "
"strongest and whether there's a trend over time."
)
})
Cohort analysis requires the Code Interpreter to construct the cohorts from raw data — grouping users by signup month, computing metrics for each cohort at equivalent time periods, and building the classic cohort retention table. This is exactly the kind of multi-step computation that Code Interpreter handles well.
The narrative output might look like:
The January 2026 cohort retains 12% better at day 30 than the October 2025 cohort (44% vs. 32%). This improvement coincides with the onboarding redesign shipped in late December. Cohorts before December show flat retention (30-33%), while January through March show a step change (42-46%). The effect is strongest in the first 7 days — suggesting the new onboarding reduces early churn rather than improving long-term engagement.
Before/After Comparison
When you don't have a control group but you do have a clear intervention date:
stories = requests.post(f"{BASE_URL}/analyze", json={
"containerId": container_id,
"steeringPrompt": (
"Compare metrics before and after March 1, 2026. "
"This date marks a major product change. Analyze "
"whether key metrics (conversion rate, session duration, "
"support tickets) changed significantly after the intervention. "
"Use the pre-period as the baseline and test for significance."
)
})
Before/after analysis is weaker than A/B testing — you can't separate the intervention effect from other factors that changed at the same time. DataStoryBot's narrative will note this limitation, typically with language like "the change coincides with the intervention but external factors may contribute."
Statistical Tests DataStoryBot Uses
The Code Interpreter selects the appropriate test based on the data:
| Comparison Type | Test | When Used |
|---|---|---|
| Two proportions (conversion rates) | Chi-squared test | A/B test on binary outcomes |
| Two group means | Welch's t-test | Revenue, duration, continuous metrics |
| Multiple group means | ANOVA + Tukey's HSD | 3+ regions or segments |
| Paired before/after | Paired t-test | Same users measured twice |
| Non-normal distributions | Mann-Whitney U test | Skewed revenue data, small samples |
You don't need to specify which test to use — the Code Interpreter inspects the data shape and selects appropriately. But if you know your data is non-normal (e.g., revenue data with heavy right skew), mention it in the steering prompt so the analysis skips parametric tests:
steering = (
"Revenue data is heavily right-skewed. Use non-parametric "
"tests (Mann-Whitney U) for comparing group medians rather "
"than t-tests on means."
)
Common Pitfalls
Comparing unequal group sizes. If your treatment group has 10,000 users and control has 500, the comparison is technically valid but the confidence intervals will be very different. DataStoryBot reports sample sizes per group — check them.
Multiple comparisons. Comparing 8 regions on 5 metrics gives you 40 comparisons. At p<0.05, you'd expect 2 false positives by chance alone. DataStoryBot's Code Interpreter sometimes applies Bonferroni or FDR correction — if it doesn't and you're running many comparisons, mention it in the steering prompt.
Simpson's paradox. An overall comparison might show one result while subgroup comparisons show the opposite. DataStoryBot often catches this by segmenting automatically, but not always. If you suspect confounding variables, ask for subgroup analysis explicitly.
Survivorship bias. Cohort analysis is especially susceptible — you're only measuring users who are still around. The January cohort might look better because the low-quality users already churned, leaving only engaged ones. The narrative should note the denominator at each measurement point.
Complete Python Example
import requests
BASE_URL = "https://datastory.bot/api"
def compare_groups(csv_path, groups_column, metrics, context=None):
"""Run a group comparison analysis."""
with open(csv_path, "rb") as f:
upload = requests.post(f"{BASE_URL}/upload", files={"file": f})
container_id = upload.json()["containerId"]
steering = (
f"Compare groups defined by the '{groups_column}' column. "
f"Analyze these metrics across groups: {', '.join(metrics)}. "
"Run appropriate statistical tests for each comparison. "
"Report effect sizes and significance levels."
)
if context:
steering += f" Additional context: {context}"
stories = requests.post(f"{BASE_URL}/analyze", json={
"containerId": container_id,
"steeringPrompt": steering
})
angles = stories.json()
print(f"Found {len(angles)} comparison stories:")
for a in angles:
print(f"\n [{a['id']}] {a['title']}")
print(f" {a['summary']}")
# Refine the primary comparison
report = requests.post(f"{BASE_URL}/refine", json={
"containerId": container_id,
"selectedStoryTitle": angles[0]["title"]
})
return report.json()
# A/B test
result = compare_groups(
"checkout_experiment.csv",
groups_column="variant",
metrics=["conversion_rate", "revenue", "cart_abandonment_rate"],
context="This is a randomized A/B test with equal allocation."
)
# Regional comparison
result = compare_groups(
"sales_q1.csv",
groups_column="region",
metrics=["revenue_per_customer", "order_frequency", "avg_order_value"],
context="Normalize by customer count since regions have very different sizes."
)
What to Read Next
For the statistical foundations behind group comparisons, how to use AI to analyze data covers the broader analytical framework.
For anomaly detection (finding what's wrong rather than what's different), see anomaly detection in CSV data.
For correlation analysis between continuous variables, read correlation discovery: which variables actually matter.
Or upload your own A/B test data to the DataStoryBot playground and steer it toward comparison to see how the narrative handles your specific dataset.
Ready to find your data story?
Upload a CSV and DataStoryBot will uncover the narrative in seconds.
Try DataStoryBot →