AI
RAG Document Retrieval
Adds your private documents to the AI's knowledge by finding relevant chunks via embedding similarity and injecting them into the prompt before the AI answers.
When to use this
When the AI needs to answer based on your specific data (company docs, support articles, product catalog) that wasn't in its training.
What I assumed
I made these guesses to fill gaps. Let me know if any are wrong.
Flow diagram
Step-by-step recipe
Copy this and paste into Cursor, Claude Code, or v0.
PATTERN: RAG Document Retrieval
INPUT: user_query, document_corpus
OUTPUT: ai_response (grounded in retrieved docs)
SETUP_STEPS (one-time per document):
1. Split document into chunks (200-500 tokens each, overlap 50)
2. Generate embedding for each chunk via embedding model
3. Store chunks + embeddings in vector DB (pgvector, Pinecone)
4. Repeat when documents change
QUERY_STEPS (per user message):
1. User sends query
2. Generate embedding for the query
3. Search vector DB for top-K similar chunks (cosine similarity)
4. IF top similarity < threshold (e.g., 0.7) โ no relevant docs found
5. Construct prompt: system + retrieved chunks (cited) + user query
6. Send to LLM with instruction "Answer based on the provided context. If not in context, say so."
7. LLM generates response with citations to chunks
8. Render response with clickable citations linking to source docs
ERROR_HANDLING:
- No chunks above threshold โ tell user "I don't have info on that, try rephrasing or contact support"
- Chunk too long for context window โ trim or skip
- Embedding API down โ cache last queries, fall back to keyword search
- Citations fabricated by LLM โ validate cited chunk_id exists, strip fakes
EXTENSION_POINTS:
- Hybrid search (vector + keyword) for better recall
- Reranking with cross-encoder before passing to LLM
- Multi-query expansion (LLM generates 3 query variants)
- Tool calling on retrieved data (composable_with: ["tool-calling"])
States โ how things change
| State | Description | Transitions |
|---|---|---|
| Awaiting query | Idle, no active retrieval |
|
| Embedding query | Converting query text to vector |
|
| Searching chunks | Vector DB similarity search |
|
| LLM grounding | LLM generating response with retrieved context |
|
| No-info response | Telling user no relevant info |
|
Easy-to-miss situations
The kinds of edge cases that break demos.
What if the LLM cites chunks that don't exist (hallucination)?
highUser trusts citations, follows broken links, loses faith.
Suggested handling: Validate every cited chunk_id exists in your DB before rendering. Strip fake citations and add a note. Track hallucination rate per model.
What if a relevant document was updated, but the embedding wasn't refreshed?
highAI returns outdated info confidently.
Suggested handling: Auto re-embed on document save. Show "Last updated: X" on each cited chunk. Have a daily integrity check that flags stale embeddings.
What if the user asks about something completely unrelated to your docs?
mediumAI either makes things up OR returns weak chunks with low relevance.
Suggested handling: Set a strict similarity threshold (0.7+). Below it, refuse to answer with "I don't have info on that โ try rephrasing or check our help center."
What if a sensitive document is retrieved for the wrong user?
highUser A's RAG query returns User B's data, privacy breach.
Suggested handling: Filter retrieval at the DB level by user_id/tenant_id. Don't trust prompt-based isolation. Test multi-tenant scenarios in QA.
What if your document corpus has 1M+ chunks?
mediumSearch latency degrades, recall drops without proper indexing.
Suggested handling: Use approximate nearest neighbor index (HNSW in pgvector, IVF in others). Pre-filter by metadata (user, language, date) BEFORE vector search to reduce search space.
Composes well with
Combine these patterns when you need a richer flow.