generalMarch 24, 20268 min read

Comparing Groups in Your Data: A/B Tests, Segments, and Cohorts

Use DataStoryBot to analyze group comparisons: test vs. control, region vs. region, cohort vs. cohort — with narrative explanations of what the differences mean.

By DataStoryBot Team

Comparing Groups in Your Data: A/B Tests, Segments, and Cohorts

Most interesting data questions are comparison questions. Is the new checkout flow better than the old one? Do enterprise customers behave differently from SMBs? Is the January cohort retaining better than December's?

A single number in isolation is useless. Revenue was $2.3M. Good? Bad? You can't tell without a comparison — to last quarter, to the forecast, to the other product line. Comparison creates meaning.

This article shows how to use DataStoryBot to run group comparisons on your CSV data. Upload a file with group labels (test/control, region, cohort, plan tier), steer the analysis toward comparison, and get back a narrative that explains what the differences mean and whether they matter.

What Makes a Good Comparison

A valid comparison needs three things:

Clear groups. The data must have a column that defines the groups — a treatment flag for A/B tests, a region column, a signup date for cohorts. If groups aren't explicit in the data, you'll need to define them in the steering prompt.

A metric to compare. Revenue, conversion rate, time-to-completion, retention at day 30. The metric should be meaningful for the business question you're asking.

Enough data per group. Comparing two groups of 5 users each tells you nothing. Statistical significance requires sample size. DataStoryBot's Code Interpreter runs significance tests when the data supports them — but it can't manufacture statistical power from small samples.

A/B Test Analysis

The most common comparison: treatment vs. control.

Upload Your Test Data

curl -X POST https://datastory.bot/api/upload \
  -F "file=@checkout_ab_test.csv"

Typical A/B test CSV structure:

user_id,variant,converted,revenue,session_duration_sec
u001,control,0,0,145
u002,treatment,1,89.50,203
u003,treatment,1,124.00,178
u004,control,1,67.00,156
...

Steer Toward Comparison

curl -X POST https://datastory.bot/api/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "containerId": "ctr_abc123",
    "steeringPrompt": "This is an A/B test dataset. Compare the treatment group against the control group on conversion rate and revenue. Run a statistical significance test and report whether the difference is significant at p<0.05."
  }'

The steering prompt does three things: identifies the comparison structure (treatment vs. control), specifies the metrics (conversion rate, revenue), and requests statistical rigor (significance test).

What You Get Back

[
  {
    "id": 1,
    "title": "Treatment Increases Conversion Rate by 3.2 Percentage Points (p=0.003)",
    "summary": "The treatment group converted at 18.7% vs. 15.5% for control — a 3.2pp lift that is statistically significant (chi-squared test, p=0.003, n=12,847). Revenue per user is also higher: $14.20 vs. $11.80 (t-test, p=0.011)."
  },
  {
    "id": 2,
    "title": "Treatment Effect Is Stronger for Mobile Users",
    "summary": "Segmenting by device type, the treatment effect on conversion is 4.8pp for mobile users but only 1.1pp for desktop. The mobile lift is significant (p<0.001); the desktop lift is not (p=0.34)."
  }
]

DataStoryBot doesn't just compute the means — it runs the appropriate statistical test (chi-squared for proportions, t-test for continuous metrics), reports the p-value, and identifies interaction effects across subgroups. The narrative tells you what the numbers mean, not just what they are.

The Full Narrative

curl -X POST https://datastory.bot/api/refine \
  -H "Content-Type: application/json" \
  -d '{
    "containerId": "ctr_abc123",
    "selectedStoryTitle": "Treatment Increases Conversion Rate by 3.2 Percentage Points (p=0.003)"
  }'

The refined narrative includes:

The headline result with statistical test details
Effect size and confidence intervals
Sample size per group
Subgroup breakdowns (if the data supports them)
Charts: conversion rate comparison bar chart, revenue distribution by group, and often a segment-level breakdown

Regional Comparison

Not every comparison is an experiment. Often you're comparing naturally occurring segments — regions, product lines, customer tiers.

import requests

BASE_URL = "https://datastory.bot/api"

# Upload
with open("sales_by_region.csv", "rb") as f:
    upload = requests.post(f"{BASE_URL}/upload", files={"file": f})
container_id = upload.json()["containerId"]

# Analyze
stories = requests.post(f"{BASE_URL}/analyze", json={
    "containerId": container_id,
    "steeringPrompt": (
        "Compare performance across regions. For each region, "
        "analyze revenue, order count, average order value, "
        "and customer count. Identify which regions outperform "
        "or underperform the company average and explain why."
    )
})

Regional comparisons are trickier than A/B tests because the groups aren't randomized. The West region might outperform because it has more customers, not because it's "better." DataStoryBot's narrative accounts for this by normalizing metrics (revenue per customer, rather than total revenue) and noting confounding factors.

Cohort Analysis

Cohort comparisons answer: "Are newer users behaving differently from older ones?"

stories = requests.post(f"{BASE_URL}/analyze", json={
    "containerId": container_id,
    "steeringPrompt": (
        "This is user-level data with a signup_date column. "
        "Create monthly cohorts based on signup_date and compare "
        "them on: 30-day retention rate, average revenue per user, "
        "and feature adoption rate. Identify which cohorts are "
        "strongest and whether there's a trend over time."
    )
})

Cohort analysis requires the Code Interpreter to construct the cohorts from raw data — grouping users by signup month, computing metrics for each cohort at equivalent time periods, and building the classic cohort retention table. This is exactly the kind of multi-step computation that Code Interpreter handles well.

The narrative output might look like:

The January 2026 cohort retains 12% better at day 30 than the October 2025 cohort (44% vs. 32%). This improvement coincides with the onboarding redesign shipped in late December. Cohorts before December show flat retention (30-33%), while January through March show a step change (42-46%). The effect is strongest in the first 7 days — suggesting the new onboarding reduces early churn rather than improving long-term engagement.

Before/After Comparison

When you don't have a control group but you do have a clear intervention date:

stories = requests.post(f"{BASE_URL}/analyze", json={
    "containerId": container_id,
    "steeringPrompt": (
        "Compare metrics before and after March 1, 2026. "
        "This date marks a major product change. Analyze "
        "whether key metrics (conversion rate, session duration, "
        "support tickets) changed significantly after the intervention. "
        "Use the pre-period as the baseline and test for significance."
    )
})

Before/after analysis is weaker than A/B testing — you can't separate the intervention effect from other factors that changed at the same time. DataStoryBot's narrative will note this limitation, typically with language like "the change coincides with the intervention but external factors may contribute."

Statistical Tests DataStoryBot Uses

The Code Interpreter selects the appropriate test based on the data:

Comparison Type	Test	When Used
Two proportions (conversion rates)	Chi-squared test	A/B test on binary outcomes
Two group means	Welch's t-test	Revenue, duration, continuous metrics
Multiple group means	ANOVA + Tukey's HSD	3+ regions or segments
Paired before/after	Paired t-test	Same users measured twice
Non-normal distributions	Mann-Whitney U test	Skewed revenue data, small samples

You don't need to specify which test to use — the Code Interpreter inspects the data shape and selects appropriately. But if you know your data is non-normal (e.g., revenue data with heavy right skew), mention it in the steering prompt so the analysis skips parametric tests:

steering = (
    "Revenue data is heavily right-skewed. Use non-parametric "
    "tests (Mann-Whitney U) for comparing group medians rather "
    "than t-tests on means."
)

Common Pitfalls

Comparing unequal group sizes. If your treatment group has 10,000 users and control has 500, the comparison is technically valid but the confidence intervals will be very different. DataStoryBot reports sample sizes per group — check them.

Multiple comparisons. Comparing 8 regions on 5 metrics gives you 40 comparisons. At p<0.05, you'd expect 2 false positives by chance alone. DataStoryBot's Code Interpreter sometimes applies Bonferroni or FDR correction — if it doesn't and you're running many comparisons, mention it in the steering prompt.

Simpson's paradox. An overall comparison might show one result while subgroup comparisons show the opposite. DataStoryBot often catches this by segmenting automatically, but not always. If you suspect confounding variables, ask for subgroup analysis explicitly.

Survivorship bias. Cohort analysis is especially susceptible — you're only measuring users who are still around. The January cohort might look better because the low-quality users already churned, leaving only engaged ones. The narrative should note the denominator at each measurement point.

Complete Python Example

import requests

BASE_URL = "https://datastory.bot/api"

def compare_groups(csv_path, groups_column, metrics, context=None):
    """Run a group comparison analysis."""

    with open(csv_path, "rb") as f:
        upload = requests.post(f"{BASE_URL}/upload", files={"file": f})
    container_id = upload.json()["containerId"]

    steering = (
        f"Compare groups defined by the '{groups_column}' column. "
        f"Analyze these metrics across groups: {', '.join(metrics)}. "
        "Run appropriate statistical tests for each comparison. "
        "Report effect sizes and significance levels."
    )
    if context:
        steering += f" Additional context: {context}"

    stories = requests.post(f"{BASE_URL}/analyze", json={
        "containerId": container_id,
        "steeringPrompt": steering
    })
    angles = stories.json()

    print(f"Found {len(angles)} comparison stories:")
    for a in angles:
        print(f"\n  [{a['id']}] {a['title']}")
        print(f"      {a['summary']}")

    # Refine the primary comparison
    report = requests.post(f"{BASE_URL}/refine", json={
        "containerId": container_id,
        "selectedStoryTitle": angles[0]["title"]
    })

    return report.json()

# A/B test
result = compare_groups(
    "checkout_experiment.csv",
    groups_column="variant",
    metrics=["conversion_rate", "revenue", "cart_abandonment_rate"],
    context="This is a randomized A/B test with equal allocation."
)

# Regional comparison
result = compare_groups(
    "sales_q1.csv",
    groups_column="region",
    metrics=["revenue_per_customer", "order_frequency", "avg_order_value"],
    context="Normalize by customer count since regions have very different sizes."
)

What to Read Next

For the statistical foundations behind group comparisons, how to use AI to analyze data covers the broader analytical framework.

For anomaly detection (finding what's wrong rather than what's different), see anomaly detection in CSV data.

For correlation analysis between continuous variables, read correlation discovery: which variables actually matter.

Or upload your own A/B test data to the DataStoryBot playground and steer it toward comparison to see how the narrative handles your specific dataset.

Ready to find your data story?

Upload a CSV and DataStoryBot will uncover the narrative in seconds.

Try DataStoryBot →