generalMarch 24, 20267 min read

CSV Analysis at Scale: Processing Large Datasets via API

Strategies for analyzing large CSV files with DataStoryBot — the 50 MB limit, chunking strategies, pre-aggregation techniques, and when to pre-process before uploading.

By DataStoryBot Team

CSV Analysis at Scale: Processing Large Datasets via API

DataStoryBot handles most CSV files without issue. A few thousand rows, a few dozen columns, a few megabytes — upload, analyze, done. But real-world data gets big. Transaction logs with millions of rows. IoT sensor data sampled every second. CRM exports with hundreds of columns.

The API has a 50 MB file upload limit. The container has a 300-second execution timeout. And Code Interpreter works best with datasets that fit comfortably in memory. This article covers the strategies for working within these constraints when your data exceeds them.

Understanding the Limits

File size: 50 MB maximum per upload. A CSV with 1 million rows and 10 columns is roughly 50-80 MB depending on data types. You'll hit the limit around 500K-1M rows for typical business data.

Execution time: Code Interpreter has approximately 300 seconds of execution time. Complex operations on large datasets — correlation matrices, group-by aggregations with many groups, chart generation with thousands of data points — can hit this limit.

Memory: The container has limited RAM. A 50 MB CSV expands to roughly 200-400 MB in memory as a pandas DataFrame (pandas stores data less compactly than CSV text). Analysis operations that create copies (sorting, grouping, pivoting) multiply memory usage further.

Chart rendering: Matplotlib can't usefully plot 1 million individual data points. Even if it could render them, the resulting image would be unreadable. Charts need aggregated or sampled data.

Strategy 1: Pre-Aggregation

The most effective strategy. Aggregate your data to the granularity the analysis needs before uploading.

Before (raw data, 2M rows, 120 MB):

timestamp,user_id,event,page,duration_ms
2026-03-01 00:00:01,u001,pageview,/home,3200
2026-03-01 00:00:01,u002,pageview,/pricing,1800
...

After (daily aggregation, 90 rows, 12 KB):

date,total_pageviews,unique_users,avg_duration_ms,bounce_rate
2026-03-01,45230,12340,2850,0.34
2026-03-02,47891,13100,2730,0.31
...

import pandas as pd

# Pre-aggregate locally
df = pd.read_csv("raw_events.csv")
df["date"] = pd.to_datetime(df["timestamp"]).dt.date

daily = df.groupby("date").agg(
    total_pageviews=("event", "count"),
    unique_users=("user_id", "nunique"),
    avg_duration_ms=("duration_ms", "mean"),
    bounce_rate=("duration_ms", lambda x: (x < 5000).mean())
).reset_index()

daily.to_csv("daily_aggregated.csv", index=False)
# Upload daily_aggregated.csv to DataStoryBot

Pre-aggregation reduces 2M rows to 90 rows — a 22,000x reduction. The analysis is faster, cheaper, and produces better results because the signal is concentrated.

When to aggregate to what level:

Daily metrics: aggregate to daily
Weekly reports: aggregate to weekly
Regional comparisons: aggregate to region
Product analysis: aggregate to product × period

The rule: aggregate to the level of your question, not the level of your data.

Strategy 2: Sampling

When you need row-level detail but the full dataset is too large, sample it.

import pandas as pd

df = pd.read_csv("transactions.csv")  # 5M rows

# Random sample — good for general pattern analysis
sample = df.sample(n=50000, random_state=42)
sample.to_csv("transactions_sample.csv", index=False)

# Stratified sample — preserves group proportions
from sklearn.model_selection import train_test_split
sample, _ = train_test_split(
    df, train_size=50000, stratify=df["region"], random_state=42
)
sample.to_csv("transactions_stratified.csv", index=False)

Sampling works for:

Distribution analysis (the shape is preserved)
Correlation analysis (linear relationships are preserved)
General trend detection (with enough samples per time period)

Sampling doesn't work for:

Anomaly detection (rare events might not appear in the sample)
Exact counts or totals (sample totals need to be scaled)
Small-group analysis (a group with 100 rows in the full dataset might have 1 row in a 1% sample)

Use stratified sampling when your analysis involves group comparisons. Plain random sampling can under-represent small groups.

Strategy 3: Column Selection

Wide datasets (100+ columns) waste upload bandwidth and analysis time on columns you don't need.

# Select only relevant columns before upload
relevant_cols = [
    "date", "region", "product", "revenue",
    "units_sold", "customer_segment"
]
df = pd.read_csv("full_export.csv", usecols=relevant_cols)
df.to_csv("selected_columns.csv", index=False)

If you're not sure which columns matter, upload a small sample of the full-width data first and run a data quality audit:

steering = (
    "This is a sample of a larger dataset. For each column, "
    "report: data type, null rate, unique values, and whether "
    "the column appears useful for analysis. Recommend which "
    "columns to keep for a sales performance analysis."
)

Then upload the full dataset with only the recommended columns.

Strategy 4: Chunking

For analyses that need to scan the full dataset but can be decomposed into independent parts:

import pandas as pd
import requests

BASE_URL = "https://datastory.bot/api"

def analyze_by_region(csv_path):
    """Analyze a large CSV by splitting into regional chunks."""
    df = pd.read_csv(csv_path)
    regions = df["region"].unique()

    results = {}
    for region in regions:
        chunk = df[df["region"] == region]
        chunk_path = f"/tmp/chunk_{region}.csv"
        chunk.to_csv(chunk_path, index=False)

        with open(chunk_path, "rb") as f:
            upload = requests.post(f"{BASE_URL}/upload", files={"file": f})
        container_id = upload.json()["containerId"]

        stories = requests.post(f"{BASE_URL}/analyze", json={
            "containerId": container_id,
            "steeringPrompt": f"Analyze performance for the {region} region."
        })

        results[region] = stories.json()

    return results

Chunking works when your analysis can be partitioned along a natural dimension (region, product line, time period). It doesn't work for cross-chunk analysis — computing correlations across all regions requires the full dataset.

Strategy 5: Two-Pass Analysis

For datasets that need both summary-level and detail-level analysis:

Pass 1: Full dataset, aggregated. Upload the pre-aggregated version. Get the big-picture findings — trends, comparisons, anomalies at the summary level.

Pass 2: Focused slice, raw. Based on Pass 1's findings, extract and upload the specific slice that needs detail-level analysis.

# Pass 1: Aggregated analysis
daily_agg = aggregate_to_daily(full_data)
# Upload daily_agg → DataStoryBot → finds "March 15 had a revenue anomaly"

# Pass 2: Detail drill-down
march_15 = full_data[full_data["date"] == "2026-03-15"]
# Upload march_15 → DataStoryBot → finds "the anomaly was driven by 3 refunded orders"

This mimics how human analysts work: start with the dashboard (aggregated), then drill into the detail. Two API calls instead of one, but each operates on manageable data.

Compression and Format Tips

Remove unnecessary precision. Floats with 15 decimal places waste bytes. Round to the precision you need:

df["revenue"] = df["revenue"].round(2)
df["percentage"] = df["percentage"].round(4)

Drop empty columns. Columns that are 100% null add size without value:

df = df.dropna(axis=1, how="all")

Use efficient date formats. 2026-03-15 is 10 bytes. 2026-03-15T00:00:00.000000+00:00 is 32 bytes. If you don't need sub-day precision, use the short format.

Compress column names. customer_acquisition_cost_in_usd → cac_usd. Mention the full names in the steering prompt:

steering = "Column cac_usd = Customer Acquisition Cost in USD. ..."

When to Use a Different Tool

DataStoryBot is optimized for CSV analysis up to tens of millions of data points (after pre-aggregation). Beyond that, consider:

SQL databases for ad-hoc queries on billions of rows. Run your query, export the result as a CSV, then upload to DataStoryBot for narrative and visualization.

Spark/Dask for distributed computation on datasets that don't fit in memory. Pre-process with Spark, aggregate, then hand off to DataStoryBot.

Dedicated BI tools (Looker, Tableau) for interactive dashboards on large datasets with predefined metrics. Use DataStoryBot for the narrative analysis that dashboards can't provide.

The sweet spot for DataStoryBot: datasets where the interesting insights require statistical analysis and narrative explanation, not just counting and filtering. Pre-aggregate the counting and filtering work, then upload the condensed result for the smart analysis.

What to Read Next

For the basic CSV analysis workflow that this article extends, see how to analyze a CSV file automatically.

For handling data quality issues in your CSVs, read handling messy CSVs: nulls, mixed types, and encoding issues.

For building automated pipelines that handle large data regularly, see automating weekly data reports with DataStoryBot.

Ready to find your data story?

Upload a CSV and DataStoryBot will uncover the narrative in seconds.

Try DataStoryBot →