Monitoring your RAG application

Galileo Observe allows you to monitor your Retrieval-Augmented Generation (RAG) application with out-of-the-box Tracing and Analytics.

Getting Started

The first step is to integrate Galileo Observe into your application code. If you're using Langchain, follow the integration instructions here. If you're not using Langchain, or you're using a different kind of orchestration service, follow these instructions on how to log your run. For any RAG or multi-step application, make sure to log your retriever node as well as your LLM node.

Tracing your Retrieval System

Once you start logging your data to Galileo Observe, you can go to the Galileo Console to analyze your workflow executions. For each execution, you'll be able to see what the original input and the final output of the workflow were, as well as all the steps that were taken in between.

Clicking on any row will open the Expanded View for that node. The Retriever Node will show you all the chunks that your retriever returned. Once you start debugging your executions, this will allow you to trace poor-quality responses back to the step that went wrong.

Evaluating the performance of your RAG application

Galileo has out-of-the-box Guardrail Metrics to help you assess and evaluate the quality of your application. In addition, Galileo supports user-defined custom metrics. When logging your evaluation run, make sure to include the metrics you want computed for your run.

For RAG applications, we recommend using the following:

Context Adherence

Context Adherence (fka Groundedness) measures whether your model's response was purely based on the context provided, i.e. the response didn't state any facts not contained in the context provided. For RAG users, Context Adherence is a measurement of hallucinations.

If a response is grounded in the context (i.e. it has a value of 1 or close to 1), it only contains information given in the context. If a response is not grounded (i.e. it has a value of 0 or close to 0), it's likely to contain facts not included in the context provided to the model.

To fix low Context Adherence values, we recommend (1) ensuring your context DB has all the necessary info to answer the question, and (2) adjusting the prompt to tell the model to stick to the information it's given in the context.

Note: This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute.

Context Relevance

Context Relevance measures how relevant (or similar) the context provided was to the user query. This metric requires {context} and {query} slots in your data, as well as embeddings for them (i.e. {context_embedding}, {query_embedding}.

Context Relevance is a relative metric. High Context Relevance values indicate significant similarity or relevance. Low Context Relevance values are a sign that you need to augment your knowledge base or vector DB with additional documents, modify your retrieval strategy, or use better embeddings.

Completeness

If Context Adherence is your precision metric for RAG, Completeness is your recall. In other words, it tries to answer the question: "Out of all the information in the context that's pertinent to the question, how much was covered in the answer?"

Low Completeness values indicate there's relevant information to the question included in your context that was not included in the model's response.

Chunk Attribution

Chunk Attribution is a chunk-level metric that denotes whether a chunk was or wasn't used by the model in generating the response. Attribution helps you more quickly identify why the model said what it did, without needing to read over the whole context.

Additionally, Attribution helps you optimize your retrieval strategy.

Chunk Utilization

Chunk Utilization measures how much of the text included in your chunk was used by the model to generate a response. Chunk Utilization helps you optimize your chunking strategy.

Non-RAG specific Metrics

Other metrics such as Uncertainty and Correctness might be useful as well. If these don't cover all your needs, you can always write custom metrics.

Last updated