Streaming Structured Output Is Still Broken

Structured output has become a de facto expectation for LLM APIs, and streaming helps overcome long model latencies, but streaming structured output is still broken across all major providers.

I compared OpenAI, Anthropic, and Gemini, and each breaks developer expectations differently: parsed results disappear, token stats vanish, or you end up with unnecessary boilerplate.

TL;DR: We can stream text. We can produce structured output. But streaming structured output (incrementally, cleanly, and consistently) is still broken across all major APIs.

Motivation

Had this use case where an agent loop takes 8 to 10 seconds for each step. By streaming the "action" first and letting the rest (like memory or long term context) finish in the background, I can bring step time down to ~5 s, almost a 2× speedup for free.

prepare_context (1 to 2 s)
llm_step → action, memory, history (≈ 5 s)
multi_act (2 to 3 s)

Streaming makes this possible, if it worked consistently.

Gemini

Non-streaming works fine:

resp = client.models.generate_content(
    model="gemini-3-pro",
    contents="Return a JSON array of 3 todos (title, done).",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=list[Todo],
    ),
)
print(resp.parsed)

But the streaming variant breaks expectations:

for ev in client.models.generate_content_stream(
    model="gemini-2.0-flash",
    contents="Stream a JSON array of 3 todos.",
    config=types.GenerateContentConfig(response_mime_type="application/json"),
):
    print(ev.text)

You only get raw text chunks; no parsed output, no final token statistics. Switching from non-streaming to streaming silently breaks the API contract.

OpenAI and Anthropic

OpenAI does better: it exposes event streams (response.output_text.delta, etc.), but you still have to manually rebuild JSON incrementally. Anthropic streams partial JSON too, but it's not easily compatible with Pydantic. I've used ijson and now JsonRiver (which helps), but routing events manually adds a lot of boilerplate.

Pydantic-AI

Since most structured output already goes through Pydantic, it feels like the right foundation. Example with minimal nesting and a boolean field:

from pydantic_ai import Agent
from pydantic import BaseModel
from typing import Optional
import asyncio, datetime

class Contact(BaseModel):
    email: Optional[str] = None
    verified: bool = False

class Profile(BaseModel):
    name: str
    dob: Optional[datetime.date] = None
    contact: Optional[Contact] = None
    bio: Optional[str] = None

agent = Agent(
    "openai:gpt-4o",
    output_type=Profile,
    system_prompt="Extract a user profile (name, dob, contact, bio).",
)

async def main():
    async with agent.run_stream(
        "I'm Ben Parker, born 1990-01-28. My email is ben@dailybugle.com (verified)."
    ) as result:
        async for partial in result.stream_output():
            print(partial)

asyncio.run(main())

This works, but you can't tell whether a field is default or updated, and nested changes are hard to route incrementally.

LangDiff

LangDiff uses event hooks per field. It's flexible but verbose, focused more on frontend interoperability and type safety than efficient streaming.

ui = Article(sections=[])
response = ArticleGenerationResponse()

@response.section_titles.on_append
def on_section_title_append(title, index):
    ui.sections.append(Section(title="", content="", done=False))

    @title.on_append
    def on_title_append(chunk):
        ui.sections[index].title += chunk

Useful pattern, but not practical for daily structured output.

What's Missing

None of these fully solve the problem. This isn't a model limitation; it's an interface issue. An ideal solution should:

separate concerns cleanly
avoid boilerplate and manual routing
remain debuggable
support nested Pydantic models
enable parallel consumption

I'm experimenting with a few ideas. Feel free to mail me any ideas/feedback.