Retrieval-augmented generation

Created: by Pradeep Gowda Updated: May 15, 2024 Tagged: RAG · llm

RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs’ generative process. [1]

Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information. Implementing RAG in an LLM-based question answering system has two main benefits: It ensures that the model has access to the most current, reliable facts, and that users have access to the model’s sources, ensuring that its claims can be checked for accuracy and ultimately trusted.

See the Survey paper - Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. “Retrieval-augmented generation for large language models: A survey,” 2024. https://arxiv.org/abs/2312.10997.

Developing RAG Applications

Self RAG

Self RAG = For familiar topics, answer quickly; for unfamiliar ones, open the reference book to look them up, quickly find the relevant parts, sort and summarize them in your mind, then answer on the exam paper.

Self RAG
Self RAG

via and https://selfrag.github.io

Asai, Akari, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. “Self-RAG: Learning to retrieve, generate, and critique through self-reflection,” 2023. https://arxiv.org/abs/2310.11511.

We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (SELF-RAG) that enhances an LM’s quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that SELF- RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, SELF-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.

Hierarchical cluster and indexing RAG

An emerging technique to better represent your data for RAG/LLM applications is to only chunk the data, but also hierarchically cluster and index it. – via

Read: Salmon Run: Hierarchical (and other) Indexes using LlamaIndex for RAG Content Enrichment

RAG from Scratch


RAG From Scratch: Indexing w/ RAPTOR

Sarthi, Parth, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval,” 2024. https://arxiv.org/abs/2401.18059.

novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy.

Deepdive – Building long context RAG with RAPTOR from scratch - YouTube; langchain/cookbook/RAPTOR.ipynb @ langchain-ai/langchain; via

RAG-enhanced MetaGPT


Building and Evaluating Advanced RAG Applications - DeepLearning.AI

“In this course, we’ll explore:”

  • Two advanced retrieval methods: Sentence-window retrieval and auto-merging retrieval that perform better compared to the baseline RAG pipeline. 
  • Evaluation and experiment tracking: A way evaluate and iteratively improve your RAG pipeline’s performance. 
  • The RAG triad: Context Relevance, Groundedness, and Answer Relevance, which are methods to evaluate the relevance and truthfulness of your LLM’s response.


Designing RAGS

Design choices you need to build high-performing RAG systems, across 5 main pillars (ISRSE):

  1. Indexing: Embedding external data into a vector representation.
  2. Storing: Persisting the indexed embeddings in a database.
  3. Retrieval: Finding relevant pieces in the stored data.
  4. Synthesis: Generating answers to user queries.
  5. Evaluation: Quantifying how good the RAG system is.
 Designing RAGS  
Designing RAGs


RAGs and Long Context LLMs

RAG for Long Context LLMs aka “Is RAG Really Dead” talk by Lance Martin of LangChainAI.

RAG Queries


llama_parse  is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks. Notebook Example for an insurance document query. Product page with screenshots of how to use it.


Personal data

Hands-On RAG guide for personal data with Vespa and LLamaIndex | Vespa Blog

See also

Reward Models

To research


  1. https://research.ibm.com/blog/retrieval-augmented-generation-RAG