Streaming AI Response

Shows AI response token-by-token as it generates instead of waiting for the full reply, dramatically improving perceived speed.

🌿

When to use this

For any user-facing AI response longer than 1 second. Streaming makes 10s feel like 2s. Skip for very short responses or batch processing.

aistreamingssewebsocketuxperceived-latency

✨ Built using these library patterns:

streaming-response

What I assumed

I made these guesses to fill gaps. Let me know if any are wrong.

Flow diagram

Step-by-step recipe

Copy this and paste into Cursor, Claude Code, or v0.

PATTERN: Streaming AI Response
INPUT: user_message, llm_request_params
OUTPUT: streamed_tokens (incremental), final_complete_response

STEPS:
  1. User submits message
  2. Show "AI is responding..." indicator + empty response container
  3. Open SSE (Server-Sent Events) connection from server to client
  4. Server calls LLM with stream: true
  5. LLM yields tokens one at a time (or small chunks)
  6. Server forwards each token to client via SSE event
  7. Client appends each token to response container in real-time
  8. Show subtle "typing" cursor at end of streaming text
  9. IF stream completes successfully → close SSE, finalize message in DB
  10. IF stream interrupted → save partial, mark "incomplete", offer retry

ERROR_HANDLING:
  - Connection drops mid-stream → save partial, show "Connection lost — [Continue]" button
  - LLM API error mid-stream → flush what we have, append "[error: continued differently]"
  - User navigates away → stop billing for unused tokens (abort the upstream call)
  - Slow tokens (LLM lagging) → show "thinking..." state if no token in 5s

EXTENSION_POINTS:
  - Tool calls embedded in stream (composable_with: ["tool-calling"])
  - Citation links rendered as they appear (composable_with: ["rag-retrieval"])
  - Stop button to abort (saves cost on long unwanted responses)

States — how things change

State	Description	Transitions
Awaiting message	Idle	Submitted→Streaming
Streaming	Tokens flowing from LLM to client	Token received→Streaming Stream ended→Complete Connection lost→Partial saved
Complete	Full response received, persisted	User sends another→Streaming
Partial saved	Stream interrupted, partial response visible with retry option	User clicks Continue→Streaming User dismisses→Awaiting message

Easy-to-miss situations

The kinds of edge cases that break demos.

What if the user closes the tab during a long response?
medium
LLM keeps generating tokens you're paying for.
Suggested handling: On SSE disconnect, server detects and aborts upstream LLM call (use AbortController). Saves cost and frees rate limit slot.
What if Markdown/code blocks render incorrectly while streaming?
medium
Half-rendered triple-backtick looks broken.
Suggested handling: Use a streaming-aware Markdown renderer (e.g., react-markdown with custom plugin). Show raw text until block closes, then re-render. Or hide-until-complete for code blocks.
What if the LLM gets stuck mid-response (no tokens for 30s)?
medium
User stares at frozen text, doesn't know if it'll continue.
Suggested handling: After 5s no token, show "AI is taking longer than usual..." After 30s, offer "Stop / Retry" buttons. After 60s, auto-fail with retry CTA.
What if streaming is enabled but the response is super short (1 word)?
low
Streaming overhead (SSE setup) > actual benefit. Feels janky.
Suggested handling: Stream anyway for consistency (UI doesn't know length in advance). Make the typewriter cursor subtle so short responses don't look weird. Don't add artificial delay.
What if behind a corporate proxy that buffers SSE?
low
Tokens arrive in big chunks instead of streaming, defeating purpose.
Suggested handling: Send keep-alive ping every 1s in stream. Set Content-Type to text/event-stream and X-Accel-Buffering: no headers. Test in corporate environments before launch.

Composes well with

Combine these patterns when you need a richer flow.

chat-loop tool-calling rag-retrieval

Build a flow starting from this pattern →

Streaming AI Response

What if the user closes the tab during a long response?

What if Markdown/code blocks render incorrectly while streaming?

What if the LLM gets stuck mid-response (no tokens for 30s)?

What if streaming is enabled but the response is super short (1 word)?

What if behind a corporate proxy that buffers SSE?

Composes well with