generalMarch 24, 202611 min read

Optimizing Code Interpreter Prompts for Reliable JSON Output

The prompt engineering behind getting Code Interpreter to return parseable JSON alongside charts — with DataStoryBot's extraction patterns.

By DataStoryBot Team

Optimizing Code Interpreter Prompts for Reliable JSON Output

Code Interpreter returns text. It also generates charts. What it does not natively do is return machine-parseable JSON alongside both. If you need to display a chart, surface structured metadata about findings, and store results in a database — all from a single analysis call — you have an extraction problem.

This article covers the prompt engineering techniques and parsing patterns that make this work reliably. The examples draw from DataStoryBot's production pipeline, which needs to extract narrative text, chart metadata, key findings, and confidence signals from every Code Interpreter response.

The Fundamental Problem

When Code Interpreter finishes an analysis, the response object looks like this:

{
  "output": [
    {
      "type": "message",
      "content": [
        {
          "type": "output_text",
          "text": "I analyzed your sales data and found three key trends...\n\nRevenue grew 23% quarter-over-quarter, driven primarily by the West region..."
        }
      ]
    },
    {
      "type": "code_interpreter_call",
      "code": "import matplotlib.pyplot as plt\n...",
      "outputs": [
        {
          "type": "image",
          "image_url": "..."
        }
      ]
    }
  ]
}

The text is free-form prose. There is no structured metadata about what the model found, how confident it is, or which chart corresponds to which finding. If you want to render a data story with titled chart sections, store finding summaries in a database, or pass key metrics to a downstream system — you need to engineer that structure into the output yourself.

There are three approaches, each with different trade-offs.

Approach 1: Ask for JSON in the System Prompt

The simplest technique: instruct the model to include a JSON block in its response, and then extract it during parsing.

Before (naive prompt)

system_prompt = """
You are a data analyst. Analyze the uploaded CSV and identify the most important findings.
Generate charts to visualize the key patterns.
"""

Typical output:

I analyzed the sales data for Q1 2026. Revenue grew from $1.2M in January to $1.8M in March,
representing a 50% increase. The West region was the top performer, contributing 42% of total revenue.
I've generated a line chart showing the monthly trend and a bar chart breaking down revenue by region.

Parseable? No. You can display the prose, but you cannot extract {"q1_revenue": 1800000, "top_region": "West", "growth_pct": 50} from this without brittle regex or a second LLM call.

After (structured output prompt)

system_prompt = """
You are a data analyst. Analyze the uploaded CSV and identify the most important findings.
Generate charts to visualize the key patterns.

After your analysis, output a JSON block with the following structure. Output ONLY valid JSON
inside the code fence — no prose inside the block:

```json
{
  "summary": "One sentence summary of the most important finding",
  "key_metrics": {
    "metric_name": "metric_value"
  },
  "findings": [
    {
      "title": "Short finding title",
      "detail": "One to two sentence explanation",
      "chart_index": 0
    }
  ],
  "data_quality": {
    "row_count": 0,
    "null_percentage": 0.0,
    "warnings": []
  }
}

Include this JSON block at the end of your response, after all prose and charts. """


**Typical output:**

I analyzed the sales data for Q1 2026. Revenue grew 50% from January to March...

[line chart] [bar chart]

{
  "summary": "Revenue grew 50% in Q1 2026, driven by West region outperformance",
  "key_metrics": {
    "q1_revenue": "$1.8M",
    "growth_rate": "50%",
    "top_region": "West",
    "top_region_share": "42%"
  },
  "findings": [
    {
      "title": "Strong Q1 Revenue Growth",
      "detail": "Revenue increased from $1.2M in January to $1.8M in March, with acceleration in the final month.",
      "chart_index": 0
    },
    {
      "title": "West Region Dominates",
      "detail": "The West region contributed 42% of total Q1 revenue, nearly double the next-highest region.",
      "chart_index": 1
    }
  ],
  "data_quality": {
    "row_count": 92,
    "null_percentage": 1.1,
    "warnings": []
  }
}


Now you have both the prose narrative and a parseable structure in a single response.

## Extraction Patterns

Once the model returns a response with an embedded JSON block, you need to reliably extract it. Several patterns handle this, with different robustness characteristics.

### Pattern 1: Code Fence Extraction (most common)

The model wraps JSON in a ```` ```json ```` code fence. This is the most stable marker to target:

```python
import re
import json

def extract_json_from_response(text: str) -> dict | None:
    # Match ```json ... ``` blocks
    pattern = r"```json\s*([\s\S]*?)\s*```"
    matches = re.findall(pattern, text, re.MULTILINE)

    if not matches:
        return None

    # Use the last match — the structured output block at the end
    candidate = matches[-1].strip()

    try:
        return json.loads(candidate)
    except json.JSONDecodeError:
        return None

Using the last match is important. If the model includes any JSON examples in its prose (explaining what it found), the extraction code fences might capture those too. The structured output block should always be last.

Pattern 2: Delimiter-Based Extraction (more explicit)

You can make the model use a custom delimiter that is unlikely to appear in prose:

system_prompt = """
...your existing instructions...

After your analysis, output your structured findings between these exact delimiters:
---STRUCTURED_OUTPUT_START---
{ your json here }
---STRUCTURED_OUTPUT_END---
"""

Extraction:

def extract_delimited_json(text: str) -> dict | None:
    start_marker = "---STRUCTURED_OUTPUT_START---"
    end_marker = "---STRUCTURED_OUTPUT_END---"

    start_idx = text.find(start_marker)
    end_idx = text.find(end_marker)

    if start_idx == -1 or end_idx == -1:
        return None

    json_str = text[start_idx + len(start_marker):end_idx].strip()

    try:
        return json.loads(json_str)
    except json.JSONDecodeError:
        return None

Custom delimiters are more robust against hallucinated code blocks in the prose, but the model occasionally forgets them — especially on long analyses where the instruction appears early in the system prompt. Mitigate this by repeating the delimiter requirement at the end of the prompt: "Remember to include the structured output block between the delimiters at the end of your response."

Pattern 3: Fallback Chain

Neither pattern is perfectly reliable in isolation. DataStoryBot uses a fallback chain:

def extract_structured_output(text: str) -> dict | None:
    # Try code fence first
    result = extract_json_from_response(text)
    if result:
        return result

    # Try delimiter extraction
    result = extract_delimited_json(text)
    if result:
        return result

    # Last resort: find any top-level JSON object
    # This is brittle but catches cases where the model
    # forgot the wrapper but still output valid JSON
    json_pattern = r'\{[\s\S]*\}'
    matches = re.findall(json_pattern, text)
    for candidate in reversed(matches):  # last is most likely the structured block
        try:
            parsed = json.loads(candidate)
            # Validate it looks like our schema
            if "findings" in parsed or "summary" in parsed:
                return parsed
        except json.JSONDecodeError:
            continue

    return None

DataStoryBot's Schema Design

The schema you ask for matters as much as the extraction mechanism. A poorly designed schema produces inconsistent values; a well-designed one constrains the model into predictable output.

DataStoryBot's production schema for a story response:

STORY_OUTPUT_SCHEMA = {
    "summary": "string — one sentence, present tense, specific numbers",
    "headline": "string — 8 words max, punchy finding statement",
    "findings": [
        {
            "title": "string — 4-6 words",
            "narrative": "string — 2-3 sentences, explain significance not just observation",
            "metric": "string | null — the primary number this finding rests on",
            "direction": "up | down | neutral | mixed",
            "chart_type": "line | bar | scatter | histogram | heatmap | none",
            "chart_index": "integer — 0-based index into the generated charts list"
        }
    ],
    "data_profile": {
        "row_count": "integer",
        "column_count": "integer",
        "date_range": "string | null — ISO dates if temporal data present",
        "null_rate": "float — 0.0 to 1.0"
    },
    "confidence": "high | medium | low",
    "confidence_reason": "string — what limits confidence, if not high"
}

Three design principles are embedded here:

Use enums for categorical fields. direction: "up | down | neutral | mixed" is far more reliable than asking for a free-form description of direction. The model picks from the list. If you ask for a free-form field, you get "increased significantly," "showed an upward trend," "grew," and "rose" — all meaning the same thing but requiring normalization logic downstream.

Include chart_index explicitly. Code Interpreter generates charts in sequence. If you want to associate finding #2 with the second chart, the model needs to track that mapping. Without an explicit chart_index field, you are left guessing.

Make confidence explicit and require a reason. A confidence: "low" with confidence_reason: "Only 45 rows of data — trend findings may not be statistically meaningful" gives you something to surface to users. Without it, all outputs look equally authoritative.

Handling Malformed JSON

Even with good prompts, Code Interpreter occasionally returns malformed JSON. Common failure modes:

Trailing commas: {"key": "value",} — invalid JSON, valid in JavaScript. Python's json.loads will reject this.

Truncated blocks: Long JSON objects sometimes get cut off if the response approaches token limits.

Escaped quotes inside strings: {"summary": "Revenue \"grew\" by 50%"} — this is valid JSON, but the model sometimes forgets to escape and writes "Revenue "grew" by 50%" instead.

Handling these with a repair step before json.loads:

import re

def repair_json(text: str) -> str:
    # Remove trailing commas before } or ]
    text = re.sub(r',\s*([}\]])', r'\1', text)

    # This is the limit of safe regex repair — beyond this,
    # use a library or request a retry
    return text

def safe_json_loads(text: str) -> dict | None:
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        repaired = repair_json(text)
        try:
            return json.loads(repaired)
        except json.JSONDecodeError:
            return None

For truncation: if your extraction returns None after all fallback patterns, the best recovery path is a retry with an explicit instruction:

retry_message = (
    "Your previous response was missing the structured JSON output block. "
    "Please output ONLY the JSON block now, with no prose. "
    "Use this exact schema: ..."
)

A targeted retry is cheaper than re-running the full analysis.

Prompt Placement Matters

Where you put the JSON instruction in your system prompt affects compliance. A few tested observations:

Put the schema at the end of the system prompt. The model processes the full system prompt, but the last instructions have higher recency weight. If your JSON schema instruction is buried in the middle of a long prompt, the model completes the analysis and then forgets to append the block.

Repeat the critical instruction. At the end of your system prompt, after the full schema definition, add a short reminder: "Always end your response with the structured JSON block wrapped in a ```json code fence. This is required." Repetition is not elegant, but it reduces omission rate by roughly half in testing.

Keep the schema definition concise. A 50-line schema definition in the system prompt is harder for the model to follow than a 15-line one. If your needs are complex, break the analysis into two calls: one for the analysis and charts, a second to extract structure from the first call's output.

Using a Second Pass for Extraction

For applications where JSON reliability is critical, the most robust architecture uses two API calls:

async def analyze_with_guaranteed_json(
    container_id: str,
    file_id: str,
    steering_prompt: str
) -> tuple[str, dict]:

    # First call: run the actual analysis, get prose + charts
    analysis_response = await client.responses.create(
        model="gpt-4o",
        tools=[{"type": "code_interpreter", "container": {"id": container_id}}],
        input=[
            {"role": "system", "content": ANALYSIS_SYSTEM_PROMPT},
            {"role": "user", "content": [
                {"type": "input_file", "file_id": file_id},
                {"type": "input_text", "text": steering_prompt}
            ]}
        ]
    )

    prose = extract_text_from_response(analysis_response)

    # Second call: extract structure from the prose
    extraction_response = await client.responses.create(
        model="gpt-4o-mini",  # smaller model is fine for extraction
        input=[
            {"role": "system", "content": EXTRACTION_SYSTEM_PROMPT},
            {"role": "user", "content": f"Extract structured data from this analysis:\n\n{prose}"}
        ]
    )

    # This second call uses gpt-4o-mini with a tight schema prompt
    # Failure rate is near zero because the model is just parsing known prose
    structured = json.loads(extraction_response.output_text)

    return prose, structured

The two-call pattern has a cost: latency and tokens. But it decouples the analysis quality from the extraction reliability. The analysis model can focus entirely on the data; the extraction model can focus entirely on parsing. Use this pattern when the JSON is going into a database or driving downstream logic, and a bad parse would cause a visible failure.

For DataStoryBot's use case — rendering a data story UI — the single-call approach with a fallback chain is sufficient. A missing JSON block results in a degraded UI experience, not a broken transaction.

Testing Your Extraction

Test extraction against a set of real responses before deploying. Key cases to cover:

Normal response with JSON block at the end
Response where the model added prose after the JSON block
Response where the JSON block has trailing commas
Response where the model used single quotes instead of double quotes
Response where the JSON block was truncated mid-object
Response with no JSON block at all (complete omission)
Response with multiple JSON blocks (model included examples in prose)

Build a test fixture of actual Code Interpreter responses — not mocked strings — and run your extraction logic against them. The failure modes are distribution-specific; synthetic test cases will not surface them.

What to Read Next

For the full Code Interpreter API surface that these prompts work against, see OpenAI Code Interpreter for Data Analysis: A Complete Guide.

For the Responses API workflow that delivers the prompts to the container, see Building a Code Interpreter Workflow with the Responses API.

For steering prompt patterns that shape what the model analyzes — as opposed to how it structures its output — see Prompt Engineering for Data Analysis: Steering Prompts That Work.

Ready to find your data story?

Upload a CSV and DataStoryBot will uncover the narrative in seconds.

Try DataStoryBot →