Evaluating and Optimizing RAG Applications

How to use Galileo Evaluate with RAG applications
Galileo Evaluate enables you to evaluate and optimize your Retrieval-Augmented Generation (RAG) application with out-of-the-box Tracing and Analytics.

Getting Started

The first step in evaluating your application is creating an evaluation run. To do this, run your evaluation set (e.g. a set of inputs that mimic the inputs you expect to get from users) through your RAG system and create a prompt run.

Logging experiments with Langchain callbacks

If you're using LangChain, follow these instructions or this code example to learn how to do this.

Logging experiments with Python logger (for non-Langchain set-ups)

If you're using a different orchestration library or not using one at all, follow these instructions on how to log your run.

Keeping track of what changed in your experiment

As you start experimenting, you're going to want to keep track of what you're attempting with each experiment. To do so, use Prompt Tags. Prompt Tags are tags you can add to the run (e.g. "embedding_model" = "voyage-2", "embedding_model" = "text-embedding-ada-002").
Prompt Tags will help you remember what you tried with each experiment. Read more about how to add Prompt Tags here.

Tracing your Retrieval System

Once you log your evaluation runs, you can go to the Galileo Console to analyze your workflow executions. For each execution, you'll be able to see what the input into the workflow was and what the final response was, as well as any intermediate results.
Clicking on any row will open the Expanded View for that node. The Retriever Node will show you all the chunks that your retriever returned. Once you start debugging your executions, this will allow you to trace poor-quality responses back to the step that went wrong.

Evaluating and Optimizing the performance of your RAG application

Galileo has out-of-the-box Guardrail Metrics to help you assess and evaluate the quality of your application. In addition, Galileo supports user-defined custom metrics. When logging your evaluation run, make sure to include the metrics you want computed for your run.
For RAG applications, we recommend using the following:

Context Adherence

Context Adherence (fka Groundedness) measures whether your model's response was purely based on the context provided, i.e. the response didn't state any facts not contained in the context provided. For RAG users, Context Adherence is a measurement of hallucinations.
If a response is grounded in the context (i.e. it has a value of 1 or close to 1), it only contains information given in the context. If a response is not grounded (i.e. it has a value of 0 or close to 0), it's likely to contain facts not included in the context provided to the model.
To fix low Context Adherence values, we recommend (1) ensuring your context DB has all the necessary info to answer the question, and (2) adjusting the prompt to tell the model to stick to the information it's given in the context.
Note: This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute.
Context Relevance
Context Relevance measures how relevant (or similar) the context provided was to the user query. This metric requires {context} and {query} slots in your data, as well as embeddings for them (i.e. {context_embedding}, {query_embedding}.
Context Relevance is a relative metric. High Context Relevance values indicate significant similarity or relevance. Low Context Relevance values are a sign that you need to augment your knowledge base or vector DB with additional documents, modify your retrieval strategy, or use better embeddings.
If Context Adherence is your precision metric for RAG, Completeness is your recall. In other words, it tries to answer the question: "Out of all the information in the context that's pertinent to the question, how much was covered in the answer?"
Low Completeness values indicate there's relevant information to the question included in your context that was not included in the model's response.
Chunk Attribution
Chunk Attribution is a chunk-level metric that denotes whether a chunk was or wasn't used by the model in generating the response. Attribution helps you more quickly identify why the model said what it did, without needing to read over the whole context.
Additionally, Attribution helps you optimize your retrieval strategy.
Chunk Utilization
Chunk Utilization measures how much of the text included in your chunk was used by the model to generate a response. Chunk Utilization helps you optimize your chunking strategy.
Non-RAG specific Metrics
Other metrics such as Uncertainty and Correctness might be useful as well. If these don't cover all your needs, you can always write custom metrics.

Iterative Experimentation

Now that you've identified something wrong with your RAG application, try to change your retriever logic, prompt template, or model settings and re-run your evaluation under the same project. Your project view will allow you to quickly compare evaluation runs and see which configuration of your system worked best.