generalMarch 24, 20268 min read

Distribution Analysis: Understanding the Shape of Your Data

When your story is about how data is distributed — skewness, bimodality, long tails — and how DataStoryBot detects and explains these patterns.

By DataStoryBot Team

Distribution Analysis: Understanding the Shape of Your Data

Averages lie. "Average order value: $87" sounds straightforward until you learn that 50% of orders are under $42 and the average is pulled up by a handful of $2,000+ enterprise purchases. The average is technically correct and practically useless for understanding your customers.

Distribution analysis tells you what averages hide: where values cluster, how spread out they are, whether there are multiple groups mixed together, and where the extreme values sit. It's the difference between "average response time is 200ms" and "90% of requests complete under 150ms, but 3% take over 2 seconds — and those are all from the legacy API endpoint."

This article shows how to use DataStoryBot to analyze distributions in your CSV data, interpret the results, and steer the analysis toward the distribution patterns that matter most.

Why Shape Matters

The distribution shape determines which statistics are meaningful:

Normal (bell curve): Mean and standard deviation tell the full story. Most common in natural phenomena, less common in business data.

Right-skewed (long right tail): Revenue, income, order values, response times. The mean is higher than the median. Use median and percentiles instead of mean. The tail contains your most important customers (or your worst performance problems).

Left-skewed (long left tail): Customer satisfaction scores (most people are happy, a few are very unhappy), test scores (most pass, some fail). The mean is lower than the median.

Bimodal (two peaks): Often indicates two distinct populations mixed together. Session duration might be bimodal — quick visits (bounces) and engaged visits — with few sessions in between.

Uniform (flat): Equal probability across the range. Rare in practice but sometimes seen in synthetic data or well-balanced sampling.

Long-tail / power law: A few values are very large; most are small. Page views per URL, word frequency, city populations. The top 1% might account for 50%+ of the total.

Running Distribution Analysis

import requests

BASE_URL = "https://datastory.bot/api"

with open("order_data.csv", "rb") as f:
    upload = requests.post(f"{BASE_URL}/upload", files={"file": f})
container_id = upload.json()["containerId"]

stories = requests.post(f"{BASE_URL}/analyze", json={
    "containerId": container_id,
    "steeringPrompt": (
        "Analyze the distribution of order_value. Report: "
        "the shape (normal, skewed, bimodal, long-tail), "
        "mean vs. median, key percentiles (p10, p25, p75, p90, p99), "
        "and any notable clusters or gaps. "
        "Visualize with a histogram and box plot."
    )
})

The steering prompt explicitly requests distribution-focused analysis. Without it, DataStoryBot might default to trend or comparison analysis.

Example Output

[
  {
    "id": 1,
    "title": "Order Values Follow a Right-Skewed Distribution with a $29 Cluster",
    "summary": "Median order value is $42, but mean is $87 — a 2.1x ratio indicating strong right skew. 68% of orders fall between $15 and $75, but the top 5% (above $340) account for 38% of total revenue. A sharp cluster at $29 corresponds to the Basic subscription tier."
  },
  {
    "id": 2,
    "title": "Response Time Distribution Is Bimodal — Two Distinct Performance Profiles",
    "summary": "Response times cluster around 120ms (fast path, 78% of requests) and 1,800ms (slow path, 15% of requests). The gap between 500ms and 1,200ms contains only 3% of requests, suggesting two distinct code paths rather than random variation."
  }
]

Interpreting Distribution Stories

Skewness

The mean-to-median ratio is the quickest indicator:

Ratio (Mean/Median)	Shape	Example
~1.0	Symmetric	Height, IQ scores
1.2-2.0	Moderately skewed	Household income
> 2.0	Heavily skewed	Startup valuations

DataStoryBot's narrative will state the skewness direction and degree:

Order values are heavily right-skewed (skewness coefficient: 3.2). The mean ($87) is 2.1x the median ($42), indicating a small number of high-value orders pull the average up significantly. For reporting purposes, median is a more representative measure of the "typical" order.

Bimodality

Two peaks in the histogram usually mean two populations are mixed:

Session duration shows clear bimodality. Peak 1 at 8-15 seconds (45% of sessions — likely bounces or quick checks). Peak 2 at 180-300 seconds (28% of sessions — engaged users completing tasks). The valley between 45-120 seconds contains only 12% of sessions. This suggests two distinct usage patterns rather than a single continuous distribution.

The action item from a bimodal distribution is usually to segment: analyze each population separately rather than treating them as one group.

Long Tails

When a small percentage of values dominate the total:

The top 1% of customers generate 34% of revenue. The distribution follows an approximate power law — each doubling of the revenue threshold halves the number of customers above it. The bottom 50% of customers collectively generate only 8% of revenue.

Long-tail distributions require percentile-based analysis. Averages and standard deviations are meaningless when the distribution is this skewed.

Multi-Column Distribution Analysis

Analyze distributions across multiple columns simultaneously:

stories = requests.post(f"{BASE_URL}/analyze", json={
    "containerId": container_id,
    "steeringPrompt": (
        "Analyze the distribution of all numeric columns. "
        "For each column, report the shape, central tendency "
        "(mean, median), spread (IQR, standard deviation), "
        "and notable outliers. Create a summary table and "
        "individual histograms for the most interesting columns."
    )
})

This produces a diagnostic overview — useful for understanding a new dataset before diving into specific analyses.

Distribution Comparison Across Groups

The most actionable distribution analysis compares groups:

stories = requests.post(f"{BASE_URL}/analyze", json={
    "containerId": container_id,
    "steeringPrompt": (
        "Compare the distribution of order_value across customer segments "
        "(the 'segment' column). Use overlaid histograms or box plots "
        "to visualize differences. Test whether the distributions are "
        "statistically different using a Kolmogorov-Smirnov test."
    )
})

The narrative might reveal:

Enterprise order values have a dramatically different distribution than SMB. Enterprise: median $340, IQR $180-$620, right-skewed but with a clear mode at $500 (annual plan price). SMB: median $42, IQR $25-$78, heavily right-skewed with a long tail. The distributions are statistically distinct (KS test, D=0.67, p<0.001). Treating these segments as a single population masks the underlying behavior.

This kind of finding — that two segments have fundamentally different distributions — often changes how you build metrics. Average order value is meaningless across segments; you need segment-specific metrics.

Detecting Outliers from Distributions

Distribution analysis naturally surfaces outliers:

stories = requests.post(f"{BASE_URL}/analyze", json={
    "containerId": container_id,
    "steeringPrompt": (
        "Identify outliers in the distribution of transaction_amount. "
        "Use both IQR method (1.5x IQR beyond Q1/Q3) and z-score "
        "method (beyond 3 standard deviations). List the outlier "
        "values and check if they cluster in any dimension "
        "(time period, customer segment, product category)."
    )
})

For a deeper treatment of outlier detection, see anomaly detection in CSV data.

Practical Use Cases

Pricing Analysis

Upload transaction data, steer toward distribution of order values. The distribution reveals:

Where customers naturally cluster (price sensitivity points)
Whether your pricing tiers align with actual purchasing behavior
The revenue impact of the tail (high-value customers)

Performance Monitoring

Upload response time data, steer toward distribution analysis. The distribution reveals:

The actual user experience at various percentiles (p50, p95, p99)
Whether there's bimodality (two code paths with different performance)
The severity of the tail (how bad is the worst experience?)

Customer Segmentation

Upload customer metrics, steer toward multi-variable distribution. The distribution reveals:

Natural clusters in customer behavior
Whether your current segments are real (bimodal distributions) or artificial
Which metrics differentiate segments most clearly

Complete Python Example

import requests

BASE_URL = "https://datastory.bot/api"

def analyze_distribution(csv_path, column, context=None):
    """Run a distribution analysis on a specific column."""

    with open(csv_path, "rb") as f:
        upload = requests.post(f"{BASE_URL}/upload", files={"file": f})
    container_id = upload.json()["containerId"]

    steering = (
        f"Analyze the distribution of the '{column}' column in depth. "
        "Report: shape classification, mean vs. median, "
        "percentiles (p5, p10, p25, p50, p75, p90, p95, p99), "
        "skewness and kurtosis, outlier count using IQR method, "
        "and any clusters or gaps. Create a histogram with KDE "
        "overlay and a box plot."
    )
    if context:
        steering += f" Context: {context}"

    stories = requests.post(f"{BASE_URL}/analyze", json={
        "containerId": container_id,
        "steeringPrompt": steering
    })
    angles = stories.json()

    report = requests.post(f"{BASE_URL}/refine", json={
        "containerId": container_id,
        "selectedStoryTitle": angles[0]["title"]
    })

    return report.json()

# Analyze order value distribution
result = analyze_distribution(
    "orders.csv",
    column="order_value",
    context="E-commerce orders. Most customers are consumers, "
            "but ~5% are resellers with bulk orders."
)
print(result["narrative"])

What to Read Next

For trend analysis (how values change over time vs. how they're distributed), see how to find trends in your data automatically.

For group comparisons that go beyond distributions, read comparing groups in your data: A/B tests, segments, and cohorts.

For the steering prompt techniques that focus analysis on distributions, see prompt engineering for data analysis.

Or upload a dataset to the DataStoryBot playground and try a distribution-focused steering prompt to see what shape your data takes.

Ready to find your data story?

Upload a CSV and DataStoryBot will uncover the narrative in seconds.

Try DataStoryBot →