Galileo Metrics

Galileo has built a menu of Guardrail Metrics for you to choose from. These metrics are tailored to your use case and are designed to help you evaluate your prompts and models.

Galileo’s Guardrail Metrics are a combination of industry-standard metrics and a product of Galileo’s in-house AI Research Team (e.g. Uncertainty, Correctness, Context Adherence).

Here’s a list of some of the metrics supported today:

  • Context Adherence - Measures whether your model’s response was grounded on the context provided. This metric is intended for RAG or context-based use cases and is a good measure for hallucinations.

  • Completeness - Evaluates how comprehensively the response addresses the question using all the relevant information from the provided context. If Context Adherence is your RAG ‘Precision’ metric, Completeness is your RAG ‘Recall’.

  • Chunk Attribution - Measures the number of chunks a model uses when generating an output. By optimizing the number of chunks a model is retrieving, teams can improve output quality and system performance and avoid the excess costs of including unused chunks in prompts to LLMs. This metric requires Galileo to be hooked into your retriever step.

  • Chunk Utilization - Measures how much of each chunk was used by a model when generating an output, and helps teams rightsize their chunk size. This metric requires Galileo to be hooked into your retriever step.

  • Instruction Adherence - Measures whether your model’s response was grounded on the context provided. This metric is intended for RAG or context-based use cases and is a good measure for hallucinations.

  • Correctness - Measures whether the facts stated in the response are based on real facts. This metric requires additional LLM calls. Combined with Uncertainty, Factuality is a good way of uncovering Hallucinations.

  • Prompt Injections - Identifies any adversarial attacks or prompt injections.

  • Private Identifiable Information - This Guardrail Metric surfaces any instances of PII in your model’s responses. We surface whether your text contains any credit card numbers, social security numbers, phone numbers, street addresses and email addresses.

  • Toxicity - Measures whether the model’s responses contained any abusive, toxic or foul language.

  • Tone - Classifies the tone of the response into 9 different emotion categories: neutral, joy, love, fear, surprise, sadness, anger, annoyance, and confusion.

  • Sexism - Measures how ‘sexist’ a comment might be perceived ranging in the values of 0-1 (1 being more sexist).

  • Uncertainty - Measures the model’s certainty in its generated responses. Uncertainty works at the response level as well as at the token level. It has shown a strong correlation with hallucinations or made-up facts, names, or citations.

  • and more.

A more thorough description of all Guardrail Metrics can be found here.

Custom Metrics

To set up custom metrics for Galileo Observe projects, please see instructions and sample code snippet here.