RAG Document Retrieval

Adds your private documents to the AI's knowledge by finding relevant chunks via embedding similarity and injecting them into the prompt before the AI answers.

🌳

When to use this

When the AI needs to answer based on your specific data (company docs, support articles, product catalog) that wasn't in its training.

airagretrievalembeddingvectorknowledge

✨ Built using these library patterns:

rag-retrieval

What I assumed

I made these guesses to fill gaps. Let me know if any are wrong.

Flow diagram

Step-by-step recipe

Copy this and paste into Cursor, Claude Code, or v0.

PATTERN: RAG Document Retrieval
INPUT: user_query, document_corpus
OUTPUT: ai_response (grounded in retrieved docs)

SETUP_STEPS (one-time per document):
  1. Split document into chunks (200-500 tokens each, overlap 50)
  2. Generate embedding for each chunk via embedding model
  3. Store chunks + embeddings in vector DB (pgvector, Pinecone)
  4. Repeat when documents change

QUERY_STEPS (per user message):
  1. User sends query
  2. Generate embedding for the query
  3. Search vector DB for top-K similar chunks (cosine similarity)
  4. IF top similarity < threshold (e.g., 0.7) → no relevant docs found
  5. Construct prompt: system + retrieved chunks (cited) + user query
  6. Send to LLM with instruction "Answer based on the provided context. If not in context, say so."
  7. LLM generates response with citations to chunks
  8. Render response with clickable citations linking to source docs

ERROR_HANDLING:
  - No chunks above threshold → tell user "I don't have info on that, try rephrasing or contact support"
  - Chunk too long for context window → trim or skip
  - Embedding API down → cache last queries, fall back to keyword search
  - Citations fabricated by LLM → validate cited chunk_id exists, strip fakes

EXTENSION_POINTS:
  - Hybrid search (vector + keyword) for better recall
  - Reranking with cross-encoder before passing to LLM
  - Multi-query expansion (LLM generates 3 query variants)
  - Tool calling on retrieved data (composable_with: ["tool-calling"])

States — how things change

State	Description	Transitions
Awaiting query	Idle, no active retrieval	Query received→Embedding query
Embedding query	Converting query text to vector	Embedded→Searching chunks
Searching chunks	Vector DB similarity search	Relevant found→LLM grounding Nothing relevant→No-info response
LLM grounding	LLM generating response with retrieved context	Response ready→Awaiting query
No-info response	Telling user no relevant info	User rephrases→Embedding query

Easy-to-miss situations

The kinds of edge cases that break demos.

What if the LLM cites chunks that don't exist (hallucination)?
high
User trusts citations, follows broken links, loses faith.
Suggested handling: Validate every cited chunk_id exists in your DB before rendering. Strip fake citations and add a note. Track hallucination rate per model.
What if a relevant document was updated, but the embedding wasn't refreshed?
high
AI returns outdated info confidently.
Suggested handling: Auto re-embed on document save. Show "Last updated: X" on each cited chunk. Have a daily integrity check that flags stale embeddings.
What if the user asks about something completely unrelated to your docs?
medium
AI either makes things up OR returns weak chunks with low relevance.
Suggested handling: Set a strict similarity threshold (0.7+). Below it, refuse to answer with "I don't have info on that — try rephrasing or check our help center."
What if a sensitive document is retrieved for the wrong user?
high
User A's RAG query returns User B's data, privacy breach.
Suggested handling: Filter retrieval at the DB level by user_id/tenant_id. Don't trust prompt-based isolation. Test multi-tenant scenarios in QA.
What if your document corpus has 1M+ chunks?
medium
Search latency degrades, recall drops without proper indexing.
Suggested handling: Use approximate nearest neighbor index (HNSW in pgvector, IVF in others). Pre-filter by metadata (user, language, date) BEFORE vector search to reduce search space.

Composes well with

Combine these patterns when you need a richer flow.

chat-loop tool-calling streaming-response

Build a flow starting from this pattern →

RAG Document Retrieval

What if the LLM cites chunks that don't exist (hallucination)?

What if a relevant document was updated, but the embedding wasn't refreshed?

What if the user asks about something completely unrelated to your docs?

What if a sensitive document is retrieved for the wrong user?

What if your document corpus has 1M+ chunks?

Composes well with