Choosing your Guardrail Metrics

How to choose and understand your guardrail metrics

Galileo Metrics

Galileo has built a menu of Guardrail Metrics for you to choose from. These metrics are tailored to your use case and are designed to help you evaluate your prompts and models.

Galileo's Guardrail Metrics are a combination of industry-standard metrics (e.g. BLEU, ROUGE-1, Perplexity) and an outcome of Galileo's in-house ML Research Team (e.g. Uncertainty, Correctness, Context Adherence).

Here's a list of the metrics supported today

Output Quality Metrics:

  • Uncertainty: Measures the model's certainty in its generated responses. Uncertainty works at the response level as well as at the token level. It has shown a strong correlation with hallucinations or made-up facts, names, or citations.

  • Correctness - Measures whether the facts stated in the response are based on real facts. This metric requires additional LLM calls. Combined with Uncertainty, Factuality is a good way of uncovering Hallucinations.

  • BLEU & ROUGE-1 - These metrics measure n-gram similarities between your Generated Responses and your Target output. These metrics require a {target} column in your dataset.

  • Prompt Perplexity - Measure the perplexity of a prompt. Previous research has shown that as perplexity decreases, generations tend to increase in quality.

RAG Quality Metrics:

  • Context Adherence - Measures whether your model's response was purely based on the context provided. This metric is intended for RAG users. This metric is computed by prompting an LLM, and thus requires additional LLM calls to compute.

  • Completeness - Measures how thoroughly your model's response covered relevant information from the context provided. This metric is intended for RAG users. This metric is computed by prompting an LLM, and thus requires additional LLM calls to compute.

  • Chunk Attribution - Measures which individual chunks retrieved in a RAG workflow influenced your model's response. This metric is intended for RAG users. This metric is computed by prompting an LLM, and thus requires additional LLM calls to compute.

  • Chunk Utilization - For each chunk retrieved in a RAG workflow, measures the fraction of the chunk text that influenced your model's response. This metric is intended for RAG users. This metric is computed by prompting an LLM, and thus requires additional LLM calls to compute.

  • Context Relevance - Measures how relevant the context provided was to the user query. This metric is intended for RAG users. This metric requires {context} and {query} slots in your data, as well as embeddings for them (i.e. {context_embedding}, {query_embedding}.

Safety Metrics:

  • Private Identifiable Information - This Guardrail Metric surfaces any instances of PII in your model's responses. We surface whether your text contains any credit card numbers, social security numbers, phone numbers, street addresses, and email addresses.

  • Toxicity - Measures whether the model's responses contained any abusive, toxic, or foul language.

  • Tone - Classifies the tone of the response into 9 different emotion categories: neutral, joy, love, fear, surprise, sadness, anger, annoyance, and confusion.

  • Sexism - Measures how 'sexist' a comment might be perceived ranging in the values of 0-1 (1 being more sexist).

  • Prompt Injection - Detects and classifies various categories of prompt injection attacks.

  • More coming very soon.

A more thorough description of all Guardrail Metrics can be found here.

If you want to set up your custom metrics, please see instructions here.

Last updated