What Is RAG and How Does It Work?

Large language models are everywhere, revolutionizing fields such as education, content generation, and even scientific publishing. However, these models have their limitations when generating accurate and relevant responses.

The limitations of large language models generating accurate and relevant responses can be mitigated by adopting the retrieval-augmented generation approach. For today’s blog, we will tackle everything in this blog about RAG, or Retrieval-Augmented Generation.

What is a retrieval-augmented generation?

Retrieval-augmented generation is a promising approach to integrating external data retrieval into the generative process to enhance accuracy and relevance. We get it; it’s a lot to process, but don’t worry. Let’s better understand RAG.

Let’s break it down and make it more relevant:

Imagine you have a 5-year-old son who is curious about dinosaurs. He asks you, “What is the name of the largest dinosaur?” At the top of your head, you say, “I read somewhere that it’s the T-rex.” This is an answer you believe you know from when you were five, probably decades ago.

Now, there are things about this question that are considered problematic.

First, the source: even if you confidently said you read it somewhere, there is no source or evidence to back up the answer to the question.

Second, even if you remember something you learned in the past, chances are this could be out of date, as in science, there is always a discovery that may have revealed new information about the largest dinosaur. These two concerns are precisely the challenges that retrieval-augmented generation aims to address.

Now, what would have happened if you had looked up the answer on a reputable source like Britannica? Then you would have been able to say, “Ah, the Patagotitan is the largest dinosaur that’s currently known.” This keeps changing as paleontologists and archeologists discover more fossils and proof of early existence.

Now that you have grounded all the information and have not hallucinated or made up an answer, your answer to the question is more credible and backed up by evidence. But you may still ask, “What does this have to do with LLMs or large language models?” Let’s translate the scenario into a large language model.

How would a large language model answer this question? Let us say you have a user prompting the question about the biggest dinosaur that ever existed. A large language model would say, “Okay, I have been trained, and from what I know in my parameters during training, the answer is the T-Rex.” We already know that the answer is wrong, but the large language model is confident that “T-rex” is the answer.

Let’s see what happens when you add the retrieval augmented part to the equation. What would happen? This means that now, instead of solely relying on what the LLM knows, we are adding a content store that can be opened like the internet or closed like a collection of documents. The LLM first talks to the content store and asks to retrieve information relevant to the user’s query. And now, with this retrieval-augmented answer, the response is not T-rex anymore; we know that it is the Patagotitan.

How Does RAG Work?

RAG has two phases: retrieval and content generation. In the retrieval phase, algorithms search for and retrieve snippets of information relevant to the user’s prompt or question. The retrieved context can come from multiple data sources, such as document repositories, databases, or APIs. The retrieved context is then provided as input to a generator model, typically a large language model (LLM). The generator model uses the retrieved context to inform its generated text output, producing a response grounded in relevant facts and knowledge.

A document collection, knowledge library, and user-submitted queries are converted to numerical representations using embedding language models to make the formats compatible. Embedding is the process by which text is given a numerical representation in a vector space. RAG model architectures compare the embeddings of user queries within the vector of the knowledge library. The original user prompt is then appended with relevant context from similar documents within the knowledge library. This augmented prompt is then sent to the foundation model.

What does this look like? Let’s look at the diagram below using the scenario we indicated in the first part of this blog:

First, the user prompts the LLM with their question, and if we were to prompt a generative model, it would give you a response to what it is currently trained to answer. However, the RAG framework will retrieve all the relevant information to the question and only generate the answer.

Hopefully, you can see how RAG can help the two LLM challenges we have stated before. Let’s start with the out-of-date part; instead of having to retrain your model if new information comes up, all you have to do is augment your data store with new information or update the information so the next time the user prompts the question, you already have the accurate details available.

As for the second problem, which is the source, since the large language model is now being instructed to pay attention to the primary source of data before responding, it is well-known to provide evidence. This will make it less likely to hallucinate or leak data because it is less likely to rely only on the information learned during training. This also allows us to get the model to have a behavior that can be positive, which is knowing when to say “I don’t know,” or when the information you seek is not available.

To Recap

Retrieval-Augmented Generation represents a significant step forward in language models’ capabilities, addressing previous systems’ limitations by integrating dynamic, external information sources into the generative process.

RAG is particularly useful for addressing knowledge cutoff and hallucination risks, and has many applications in question answering, chatbots, and customer service. By following best practices for implementing RAG, businesses can improve the accuracy and reliability of their LLM-generated responses, improving the overall user experience.

As this technology develops, we can expect even more sophisticated interactions from our AI systems, blurring the lines between human and machine-generated content.

More Related to This