โ† Pattern library

AI

Streaming AI Response

Shows AI response token-by-token as it generates instead of waiting for the full reply, dramatically improving perceived speed.

๐ŸŒฟ

When to use this

For any user-facing AI response longer than 1 second. Streaming makes 10s feel like 2s. Skip for very short responses or batch processing.

aistreamingssewebsocketuxperceived-latency
โœจ Built using these library patterns:
streaming-response

What I assumed

I made these guesses to fill gaps. Let me know if any are wrong.

    Flow diagram

    Step-by-step recipe

    Copy this and paste into Cursor, Claude Code, or v0.

    PATTERN: Streaming AI Response
    INPUT: user_message, llm_request_params
    OUTPUT: streamed_tokens (incremental), final_complete_response
    
    STEPS:
      1. User submits message
      2. Show "AI is responding..." indicator + empty response container
      3. Open SSE (Server-Sent Events) connection from server to client
      4. Server calls LLM with stream: true
      5. LLM yields tokens one at a time (or small chunks)
      6. Server forwards each token to client via SSE event
      7. Client appends each token to response container in real-time
      8. Show subtle "typing" cursor at end of streaming text
      9. IF stream completes successfully โ†’ close SSE, finalize message in DB
      10. IF stream interrupted โ†’ save partial, mark "incomplete", offer retry
    
    ERROR_HANDLING:
      - Connection drops mid-stream โ†’ save partial, show "Connection lost โ€” [Continue]" button
      - LLM API error mid-stream โ†’ flush what we have, append "[error: continued differently]"
      - User navigates away โ†’ stop billing for unused tokens (abort the upstream call)
      - Slow tokens (LLM lagging) โ†’ show "thinking..." state if no token in 5s
    
    EXTENSION_POINTS:
      - Tool calls embedded in stream (composable_with: ["tool-calling"])
      - Citation links rendered as they appear (composable_with: ["rag-retrieval"])
      - Stop button to abort (saves cost on long unwanted responses)
    

    States โ€” how things change

    StateDescriptionTransitions
    Awaiting messageIdle
    • Submittedโ†’Streaming
    StreamingTokens flowing from LLM to client
    • Token receivedโ†’Streaming
    • Stream endedโ†’Complete
    • Connection lostโ†’Partial saved
    CompleteFull response received, persisted
    • User sends anotherโ†’Streaming
    Partial savedStream interrupted, partial response visible with retry option
    • User clicks Continueโ†’Streaming
    • User dismissesโ†’Awaiting message

    Easy-to-miss situations

    The kinds of edge cases that break demos.

    • What if the user closes the tab during a long response?

      medium

      LLM keeps generating tokens you're paying for.

      Suggested handling: On SSE disconnect, server detects and aborts upstream LLM call (use AbortController). Saves cost and frees rate limit slot.

    • What if Markdown/code blocks render incorrectly while streaming?

      medium

      Half-rendered triple-backtick looks broken.

      Suggested handling: Use a streaming-aware Markdown renderer (e.g., react-markdown with custom plugin). Show raw text until block closes, then re-render. Or hide-until-complete for code blocks.

    • What if the LLM gets stuck mid-response (no tokens for 30s)?

      medium

      User stares at frozen text, doesn't know if it'll continue.

      Suggested handling: After 5s no token, show "AI is taking longer than usual..." After 30s, offer "Stop / Retry" buttons. After 60s, auto-fail with retry CTA.

    • What if streaming is enabled but the response is super short (1 word)?

      low

      Streaming overhead (SSE setup) > actual benefit. Feels janky.

      Suggested handling: Stream anyway for consistency (UI doesn't know length in advance). Make the typewriter cursor subtle so short responses don't look weird. Don't add artificial delay.

    • What if behind a corporate proxy that buffers SSE?

      low

      Tokens arrive in big chunks instead of streaming, defeating purpose.

      Suggested handling: Send keep-alive ping every 1s in stream. Set Content-Type to text/event-stream and X-Accel-Buffering: no headers. Test in corporate environments before launch.

    Composes well with

    Combine these patterns when you need a richer flow.

    Build a flow starting from this pattern โ†’