generalMarch 24, 202612 min read

From Code Interpreter Prototype to Production API

How DataStoryBot went from a Code Interpreter experiment to a production API — architecture decisions, error handling, and lessons learned.

By DataStoryBot Team

From Code Interpreter Prototype to Production API

The prototype took a weekend. Three API calls — create container, upload CSV, call Responses API — and you get narrative plus charts. It felt like cheating.

The production system took three months. Not because Code Interpreter is hard to use, but because production means handling every edge case, recovering from every failure mode, and delivering consistent results when OpenAI's infrastructure has a bad hour.

This article covers the specific decisions and lessons from building DataStoryBot: what we changed, what we underestimated, and what we'd do differently.

The Prototype Was Twelve Lines

The original proof of concept was literally this:

const container = await openai.containers.create({
  expires_after: { anchor: "last_activity", minutes: 20 }
});

await openai.containers.files.create(container.id, {
  file: fs.createReadStream(csvPath)
});

const response = await openai.responses.create({
  model: "gpt-4o",
  input: [{ role: "user", content: "Analyze this data and create charts." }],
  tools: [{ type: "code_interpreter", container: { id: container.id } }]
});

console.log(response.output);

This works. It genuinely produces useful analysis. But it will fall over in production in roughly fifteen different ways, and the prototype taught us nothing about which ones would hurt most.

Architecture Decision One: Separate the Upload from the Analysis

The first instinct is to bundle everything into a single endpoint: upload CSV, analyze, return results. The problem is that analysis takes 30-90 seconds, and the file upload is a prerequisite with a different failure profile.

We split them:

POST /api/upload — creates the container, uploads the CSV, extracts file metadata, returns containerId
POST /api/analyze — takes containerId, runs the Responses API call, returns story candidates
POST /api/refine — takes containerId and a selected story title, produces the full narrative and charts

This separation matters for three reasons.

You can pre-validate. When the file arrives at /upload, you check size, encoding, and whether pandas can actually parse it before touching the Responses API. A corrupted CSV or a file with 50 MB of content fails fast and cheaply, not 60 seconds into a model call.

You can show progress. A combined endpoint forces the UI to show a spinner for 90 seconds. Splitting means the upload step completes in 5-10 seconds, giving the user confirmation that their data arrived before the slow part begins.

The failure modes are different. Upload failures are mostly 4xx — file too large, bad encoding, not a CSV. Analysis failures are mostly timeouts, model errors, and empty results. Handling them in the same catch block produces terrible error messages.

Architecture Decision Two: Dual-Phase Analysis

The initial prompt asked for everything at once: "analyze the data, find three insights, generate charts for each." This created a model call that tried to do too much in one shot.

The production system uses two Responses API calls per analysis session:

Phase 1 — Story discovery. Ask GPT-4o to examine the data and identify three candidate story angles. The model writes exploratory code, reads the output, and returns structured JSON describing what it found: a title for each story, a two-sentence summary, and suggested chart types. This call typically completes in 25-40 seconds.

Phase 2 — Deep dive. With the user's selected story title, ask GPT-4o to write a full analysis of that specific angle: multi-paragraph narrative, 2-4 charts with consistent styling, and a filtered dataset if relevant. This call takes 30-60 seconds.

Why not run both phases automatically? Because phase 1 often surfaces angles the user didn't expect and might want to pivot toward. Showing candidates adds value. And if phase 2 fails, you still have phase 1 results — the user can retry a different story.

Architecture Decision Three: Container Lifecycle as First-Class Concern

The 20-minute container TTL is not a limitation to work around. It is the core operational constraint every design decision must account for.

The TTL resets on activity. Any API call that references the container — uploading a file, running a Responses API call, downloading a file — resets the 20-minute clock. An active session can run indefinitely. But the moment there is a 20-minute gap, the container and everything in it is gone.

This produced several concrete design choices:

Store the original file path, not just the container ID. When a container expires and the user tries to refine a different story, you can re-upload automatically rather than forcing the user to repeat the file selection step.

// In the session store
interface AnalysisSession {
  containerId: string;
  originalFilePath: string; // kept so we can re-upload on expiry
  stories: StoryCandidate[];
  createdAt: Date;
  lastActivityAt: Date;
}

Download charts immediately. Generated chart files exist only in the container. If the user comes back after 20 minutes to download a chart they saw earlier, it is gone. Download and store everything you care about in your own storage as soon as the Responses API call completes.

async function downloadAndStore(containerId: string, fileIds: string[]) {
  const stored = [];
  for (const fileId of fileIds) {
    const content = await openai.containers.files.content(containerId, fileId);
    const buffer = Buffer.from(await content.arrayBuffer());
    // Store in S3, GCS, or local disk — anywhere persistent
    const storedKey = await yourStorage.put(`charts/${fileId}.png`, buffer);
    stored.push({ fileId, storedKey });
  }
  return stored;
}

Detect expiry explicitly. A 404 on any container-scoped request means the container is gone. Catch it distinctly from other 404s.

The Challenges That Actually Bit Us

Non-Deterministic Output

The same CSV, the same prompt, different results every run. Not wrong results — just different ones. Different column chosen for the x-axis. Different color scheme. Different framing of the same finding.

This is largely fine for the core use case: the user wants to understand their data, and any valid analysis path serves that goal. But it creates problems in two specific places.

The first is automated testing. If you test that "the revenue trend chart has the title 'Monthly Revenue'", you will have a flaky test suite. We moved to semantic validation: does the narrative contain specific column names? Does the chart output contain a PNG with non-trivial file size? Does the structured JSON have the expected fields?

The second is retry logic. If an analysis fails and you retry, you may get different results — not necessarily better ones. We limit analysis retries to one, and if the first retry also fails, we surface the error rather than retrying again.

For the complete guide to the Code Interpreter architecture, including how the model iterates on its own code when it hits errors, see the linked article.

Container Expiry Mid-Analysis

The most disruptive failure mode we encountered was a container expiring during the /refine call. Phase 1 (story discovery) completed. The user spent two minutes reading the candidates and selecting one. Phase 2 started. The container expired.

Why did it expire? The timer was running during the two minutes the user was reading. If they took longer than 20 minutes to select, the container would definitely be gone. Even at two minutes, there were edge cases.

The fix was to make the refine endpoint handle mid-session expiry with automatic re-upload:

async function refineWithRecovery(
  session: AnalysisSession,
  selectedStoryTitle: string
) {
  try {
    return await callRefineApi(session.containerId, selectedStoryTitle);
  } catch (error) {
    if (isContainerExpired(error)) {
      // Re-upload and re-analyze transparently
      const newContainerId = await reuploadFile(session.originalFilePath);
      const newStories = await runPhaseOne(newContainerId);
      session.containerId = newContainerId;
      session.stories = newStories;
      // Re-run refine with the new container
      return await callRefineApi(newContainerId, selectedStoryTitle);
    }
    throw error;
  }
}

This adds latency on recovery — the user waits for a re-upload and re-analysis. But it avoids a broken experience.

Large File Handling

Files above 15-20 MB start causing problems. The upload itself is slow. The model spends time just reading the file into a pandas DataFrame. Analysis of a 50 MB CSV with a million rows can time out before producing any output.

Our production handling:

Pre-validate size. Reject files above 50 MB at upload time with a clear error. For files between 15-50 MB, warn the user that analysis may be slow.

Detect row count before analysis. After upload but before the Responses API call, run a quick line count. If the file has more than 500,000 rows, the system prompt includes an instruction to sample the data before loading it fully:

# Injected into the system prompt for large files
# The file has approximately 800,000 rows.
# Load only a representative sample for analysis:
# df = pd.read_csv('/path/to/file.csv', nrows=50000)
# or use skiprows to sample evenly across the file.

Time the analysis call. Set a hard timeout of 120 seconds on the Responses API call. If it exceeds that, surface a timeout error rather than letting the request hang indefinitely.

For more on sandboxed execution limits and what they mean for large files, including memory constraints inside the container, see the linked article.

Production Patterns

Health Checks That Actually Test the Stack

A health check endpoint that returns 200 because your Node.js process is running is useless. The meaningful question is whether the OpenAI stack is healthy.

Our health check runs a minimal Code Interpreter call on every request — a tiny synthetic CSV, a single analysis instruction:

app.get("/health", async (req, res) => {
  const start = Date.now();
  try {
    await runSyntheticAnalysis(); // 3-row CSV, simple prompt
    res.json({
      status: "healthy",
      latencyMs: Date.now() - start,
      timestamp: new Date().toISOString()
    });
  } catch (error) {
    res.status(503).json({
      status: "degraded",
      error: error.message,
      latencyMs: Date.now() - start
    });
  }
});

This runs every minute via an uptime monitor. When OpenAI has a bad hour, the health check catches it within a minute and triggers alerts. The synthetic analysis takes about 15-20 seconds, which tells you something useful about the current tail latency.

Graceful Degradation

When the health check is failing, new requests should fail fast rather than waiting 120 seconds to discover that OpenAI is degraded. We keep a circuit breaker in front of every Responses API call:

class CircuitBreaker {
  private failures = 0;
  private lastFailure: Date | null = null;
  private readonly threshold = 5;
  private readonly resetAfterMs = 60_000;

  isOpen(): boolean {
    if (this.failures < this.threshold) return false;
    if (this.lastFailure && Date.now() - this.lastFailure.getTime() > this.resetAfterMs) {
      this.failures = 0;
      return false;
    }
    return true;
  }

  recordFailure() {
    this.failures++;
    this.lastFailure = new Date();
  }

  recordSuccess() {
    this.failures = Math.max(0, this.failures - 1);
  }
}

When the circuit is open, /analyze returns 503 immediately with a message telling the user to try again in a few minutes. The alternative — 10 users all hanging for 120 seconds on a failing API — consumes resources and produces a terrible experience.

Structured Logging

Every API call gets a correlation ID, and every log line includes it. When something goes wrong, you can reconstruct the entire session from the logs:

const logger = {
  info(sessionId: string, event: string, data: Record<string, unknown>) {
    console.log(JSON.stringify({
      level: "info",
      sessionId,
      event,
      ...data,
      ts: new Date().toISOString()
    }));
  }
};

// Usage
logger.info(sessionId, "container.created", { containerId: container.id });
logger.info(sessionId, "upload.completed", { fileId, fileSizeBytes, rowCount });
logger.info(sessionId, "analysis.started", { containerId, steeringPrompt: !!steering });
logger.info(sessionId, "analysis.completed", { storyCount: stories.length, elapsedMs });

Log the container ID with every line. It is the key that ties together everything that happened in a session, and it is what you need when debugging a user complaint about a specific analysis.

Idempotency

Retry logic on the client is only safe if the server handles duplicate requests correctly. If a client retries /analyze because it got a timeout, and the original request was still processing, you may get two analysis sessions running simultaneously against the same container.

We handle this with idempotency keys. The client generates a UUID per request and passes it as a header. The server caches in-progress and completed analyses by idempotency key for 10 minutes. A duplicate request returns the cached result rather than starting a second analysis.

const inProgress = new Map<string, Promise<AnalysisResult>>();

app.post("/api/analyze", async (req, res) => {
  const idempotencyKey = req.headers["idempotency-key"] as string;

  if (idempotencyKey && inProgress.has(idempotencyKey)) {
    const result = await inProgress.get(idempotencyKey);
    return res.json(result);
  }

  const analysisPromise = runAnalysis(req.body);

  if (idempotencyKey) {
    inProgress.set(idempotencyKey, analysisPromise);
    analysisPromise.finally(() => {
      setTimeout(() => inProgress.delete(idempotencyKey), 600_000);
    });
  }

  res.json(await analysisPromise);
});

For a complete treatment of retry strategies and how to structure them safely, see error handling and retry patterns for data analysis APIs.

Lessons Learned

The model's self-correction is more reliable than you expect. Code Interpreter often writes code that fails on the first execution — wrong column name, unexpected data type, missing value handling. It reads the traceback, corrects the code, and retries. This happens automatically and invisibly. We wasted time adding application-level retry logic for analysis errors before we understood how much the model already handles internally.

Design for the 20-minute TTL from day one, not as an afterthought. Every architecture decision that ignores the container expiry will come back as a production bug. The session model, the storage of original file paths, the automatic re-upload on expiry — all of these are easier to build in from the start than to retrofit.

Timeouts need to be set at multiple layers. The HTTP client timeout, the request timeout middleware, and the circuit breaker threshold all need to agree on what "too slow" means. Mismatched timeouts produce mysterious behavior where requests seem to fail for no reason.

Log the container ID everywhere. It is the session identifier for everything that happens between upload and chart download. Every log line that lacks it will make debugging harder.

Non-determinism is a feature, not a bug — until it isn't. Embrace it for the core analysis. But be explicit about where you need determinism (test assertions, reproducible reports) and handle those cases specifically.

The prototype is the easy part. The work is in the last ten percent — the expiry handling, the retry logic, the health checks, the logging. That work is where production systems are actually built.

Ready to find your data story?

Upload a CSV and DataStoryBot will uncover the narrative in seconds.

Try DataStoryBot →