Choosing your Guardrail Metrics

How to choose and understand your guardrail metrics

Galileo Metrics

Galileo has built a menu of Guardrail Metrics for you to choose from. These metrics are tailored to your use case and are designed to help you evaluate your prompts and models.

Galileo's Guardrail Metrics are a combination of industry-standard metrics (e.g. BLEU, ROUGE-1, Perplexity) and an outcome of Galileo's in-house ML Research Team (e.g. Uncertainty, Correctness, Context Adherence).

Here's a list of the metrics supported today

  • Uncertainty: Measures the model's certainty in its generated responses. Uncertainty works at the response level as well as at the token level. It has shown a strong correlation with hallucinations or made-up facts, names, or citations.

  • Context Adherence - Measures whether your model's response was purely based on the context provided. This metric is intended for RAG users. This metric is computed by prompting a GPT model, and thus requires additional LLM calls to compute.

  • Completeness - Evaluates how comprehensively the response addresses the question using all the relevant information from the provided context. If Context Adherence is your RAG 'Precision' metric, Completeness is your RAG 'Recall'. This metric is also computed by prompting a GPT model, and thus requires additional LLM calls to compute.

  • Chunk Attribution - Measures the number of chunks a model uses when generating an output. By optimizing the number of chunks a model is retrieving, teams can improve output quality and system performance and avoid the excess costs of including unused chunks in prompts to LLMs. This metric requires Galileo to be hooked into your retriever step.

  • Chunk Utilization - Measures how much of each chunk was used by a model when generating an output, and helps teams rightsize their chunk size. This metric requires Galileo to be hooked into your retriever step.

  • Correctness - Measures whether the facts stated in the response are based on real facts. This metric requires additional LLM calls. Combined with Uncertainty, Factuality is a good way of uncovering Hallucinations.

  • Context Relevance - Measures how relevant the context provided was to the user query. This metric is intended for RAG users. This metric requires {context} and {query} slots in your data, as well as embeddings for them (i.e. {context_embedding}, {query_embedding}.

  • Private Identifiable Information - This Guardrail Metric surfaces any instances of PII in your model's responses. We surface whether your text contains any credit card numbers, social security numbers, phone numbers, street addresses and email addresses.

  • Toxicity - Measures whether the model's responses contained any abusive, toxic or foul language.

  • Tone - Classifies the tone of the response into 9 different emotion categories: neutral, joy, love, fear, surprise, sadness, anger, annoyance, and confusion.

  • Sexism - Measures how 'sexist' a comment might be perceived ranging in the values of 0-1 (1 being more sexist).

  • BLEU & ROUGE-1 - These metrics measure n-gram similarities between your Generated Responses and your Target output. These metrics require a {target} column in your dataset.

  • More coming very soon.

A more thorough description of all Guardrail Metrics can be found here.

Custom Metrics

To set up custom metrics for Galileo Observe projects, please see instructions and sample code snippet here.

Last updated