Understand Galileo's Completeness Metric

The metric is intended for RAG workflows.

Definition: Measures how thoroughly your model's response covered the relevant information available in the context provided.

Completeness and Context Adherence are closely related, and designed to complement one another:

  • Context Adherence answers the question, "is the model's response consistent with the information in the context?"

  • Completeness answers the question, "is the relevant information in the context fully reflected in the model's response?"

In other words, if Context Adherence is "precision," then Completeness is "recall."

Consider this simple, stylized example that illustrates the distinction:

  • User query: "Who was Galileo Galilei?"

  • Context: "Galileo Galilei was an Italian astronomer."

  • Model response: "Galileo Galilei was Italian."

This response would receive a perfect Context Adherence score: everything the model said is supported by the context.

But this is not an ideal response. The context also specified that Galileo was an astronomer, and the user probably wants to know that information as well.

Hence, this response would receive a low Completeness score. Tracking Completeness alongside Context Adherence allows you to detect cases like this one, where the model is "too reticent" and fails to mention relevant information.

Calculation: Completeness is computed by sending additional requests to your LLM, using a carefully engineered chain-of-thought prompt that asks the model to determine what fraction of relevant information was covered in the response. The metric requests multiple distinct responses to this prompt, each of which produces an explanation along with a final numeric score between 0 and 1.

The Completeness score is an average over the individual scores.

We also surface one of the generated explanations. The surfaced explanation is chosen from the response whose individual score was closest to the average score over all the responses. For example, if we make 3 requests and receive the scores [0.4, 0.5, 0.6], the Completeness score will be 0.5, and the explanation from the second response will be surfaced.

Usefulness: To fix low Completeness values, we recommend adjusting the prompt to tell the model to include all the relevant information it can find in the provided context.

Deep dive: to read more about the research behind this metric, see RAG Quality Metrics.

Note: This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute.

Last updated