โ† Pattern library

AI

RAG Document Retrieval

Adds your private documents to the AI's knowledge by finding relevant chunks via embedding similarity and injecting them into the prompt before the AI answers.

๐ŸŒณ

When to use this

When the AI needs to answer based on your specific data (company docs, support articles, product catalog) that wasn't in its training.

airagretrievalembeddingvectorknowledge
โœจ Built using these library patterns:
rag-retrieval

What I assumed

I made these guesses to fill gaps. Let me know if any are wrong.

    Flow diagram

    Step-by-step recipe

    Copy this and paste into Cursor, Claude Code, or v0.

    PATTERN: RAG Document Retrieval
    INPUT: user_query, document_corpus
    OUTPUT: ai_response (grounded in retrieved docs)
    
    SETUP_STEPS (one-time per document):
      1. Split document into chunks (200-500 tokens each, overlap 50)
      2. Generate embedding for each chunk via embedding model
      3. Store chunks + embeddings in vector DB (pgvector, Pinecone)
      4. Repeat when documents change
    
    QUERY_STEPS (per user message):
      1. User sends query
      2. Generate embedding for the query
      3. Search vector DB for top-K similar chunks (cosine similarity)
      4. IF top similarity < threshold (e.g., 0.7) โ†’ no relevant docs found
      5. Construct prompt: system + retrieved chunks (cited) + user query
      6. Send to LLM with instruction "Answer based on the provided context. If not in context, say so."
      7. LLM generates response with citations to chunks
      8. Render response with clickable citations linking to source docs
    
    ERROR_HANDLING:
      - No chunks above threshold โ†’ tell user "I don't have info on that, try rephrasing or contact support"
      - Chunk too long for context window โ†’ trim or skip
      - Embedding API down โ†’ cache last queries, fall back to keyword search
      - Citations fabricated by LLM โ†’ validate cited chunk_id exists, strip fakes
    
    EXTENSION_POINTS:
      - Hybrid search (vector + keyword) for better recall
      - Reranking with cross-encoder before passing to LLM
      - Multi-query expansion (LLM generates 3 query variants)
      - Tool calling on retrieved data (composable_with: ["tool-calling"])
    

    States โ€” how things change

    StateDescriptionTransitions
    Awaiting queryIdle, no active retrieval
    • Query receivedโ†’Embedding query
    Embedding queryConverting query text to vector
    • Embeddedโ†’Searching chunks
    Searching chunksVector DB similarity search
    • Relevant foundโ†’LLM grounding
    • Nothing relevantโ†’No-info response
    LLM groundingLLM generating response with retrieved context
    • Response readyโ†’Awaiting query
    No-info responseTelling user no relevant info
    • User rephrasesโ†’Embedding query

    Easy-to-miss situations

    The kinds of edge cases that break demos.

    • What if the LLM cites chunks that don't exist (hallucination)?

      high

      User trusts citations, follows broken links, loses faith.

      Suggested handling: Validate every cited chunk_id exists in your DB before rendering. Strip fake citations and add a note. Track hallucination rate per model.

    • What if a relevant document was updated, but the embedding wasn't refreshed?

      high

      AI returns outdated info confidently.

      Suggested handling: Auto re-embed on document save. Show "Last updated: X" on each cited chunk. Have a daily integrity check that flags stale embeddings.

    • What if the user asks about something completely unrelated to your docs?

      medium

      AI either makes things up OR returns weak chunks with low relevance.

      Suggested handling: Set a strict similarity threshold (0.7+). Below it, refuse to answer with "I don't have info on that โ€” try rephrasing or check our help center."

    • What if a sensitive document is retrieved for the wrong user?

      high

      User A's RAG query returns User B's data, privacy breach.

      Suggested handling: Filter retrieval at the DB level by user_id/tenant_id. Don't trust prompt-based isolation. Test multi-tenant scenarios in QA.

    • What if your document corpus has 1M+ chunks?

      medium

      Search latency degrades, recall drops without proper indexing.

      Suggested handling: Use approximate nearest neighbor index (HNSW in pgvector, IVF in others). Pre-filter by metadata (user, language, date) BEFORE vector search to reduce search space.

    Composes well with

    Combine these patterns when you need a richer flow.

    Build a flow starting from this pattern โ†’