AI
Streaming AI Response
Shows AI response token-by-token as it generates instead of waiting for the full reply, dramatically improving perceived speed.
When to use this
For any user-facing AI response longer than 1 second. Streaming makes 10s feel like 2s. Skip for very short responses or batch processing.
What I assumed
I made these guesses to fill gaps. Let me know if any are wrong.
Flow diagram
Step-by-step recipe
Copy this and paste into Cursor, Claude Code, or v0.
PATTERN: Streaming AI Response
INPUT: user_message, llm_request_params
OUTPUT: streamed_tokens (incremental), final_complete_response
STEPS:
1. User submits message
2. Show "AI is responding..." indicator + empty response container
3. Open SSE (Server-Sent Events) connection from server to client
4. Server calls LLM with stream: true
5. LLM yields tokens one at a time (or small chunks)
6. Server forwards each token to client via SSE event
7. Client appends each token to response container in real-time
8. Show subtle "typing" cursor at end of streaming text
9. IF stream completes successfully โ close SSE, finalize message in DB
10. IF stream interrupted โ save partial, mark "incomplete", offer retry
ERROR_HANDLING:
- Connection drops mid-stream โ save partial, show "Connection lost โ [Continue]" button
- LLM API error mid-stream โ flush what we have, append "[error: continued differently]"
- User navigates away โ stop billing for unused tokens (abort the upstream call)
- Slow tokens (LLM lagging) โ show "thinking..." state if no token in 5s
EXTENSION_POINTS:
- Tool calls embedded in stream (composable_with: ["tool-calling"])
- Citation links rendered as they appear (composable_with: ["rag-retrieval"])
- Stop button to abort (saves cost on long unwanted responses)
States โ how things change
| State | Description | Transitions |
|---|---|---|
| Awaiting message | Idle |
|
| Streaming | Tokens flowing from LLM to client |
|
| Complete | Full response received, persisted |
|
| Partial saved | Stream interrupted, partial response visible with retry option |
|
Easy-to-miss situations
The kinds of edge cases that break demos.
What if the user closes the tab during a long response?
mediumLLM keeps generating tokens you're paying for.
Suggested handling: On SSE disconnect, server detects and aborts upstream LLM call (use AbortController). Saves cost and frees rate limit slot.
What if Markdown/code blocks render incorrectly while streaming?
mediumHalf-rendered triple-backtick looks broken.
Suggested handling: Use a streaming-aware Markdown renderer (e.g., react-markdown with custom plugin). Show raw text until block closes, then re-render. Or hide-until-complete for code blocks.
What if the LLM gets stuck mid-response (no tokens for 30s)?
mediumUser stares at frozen text, doesn't know if it'll continue.
Suggested handling: After 5s no token, show "AI is taking longer than usual..." After 30s, offer "Stop / Retry" buttons. After 60s, auto-fail with retry CTA.
What if streaming is enabled but the response is super short (1 word)?
lowStreaming overhead (SSE setup) > actual benefit. Feels janky.
Suggested handling: Stream anyway for consistency (UI doesn't know length in advance). Make the typewriter cursor subtle so short responses don't look weird. Don't add artificial delay.
What if behind a corporate proxy that buffers SSE?
lowTokens arrive in big chunks instead of streaming, defeating purpose.
Suggested handling: Send keep-alive ping every 1s in stream. Set Content-Type to text/event-stream and X-Accel-Buffering: no headers. Test in corporate environments before launch.
Composes well with
Combine these patterns when you need a richer flow.