Galileo
Search
K

Guardrail Metrics

Guardrail Metrics are helpful in evaluating and monitoring LLM models and prompts in production.
Galileo's Guardrail Metrics are a combination of industry-standard metrics (e.g. BLEU, ROUGE-1, Perplexity) and an outcome of Galileo's in-house state-of-the-art ML Research (e.g. Uncertainty, Factuality, Groundedness).
Below is a list of the metrics currently supported, their definitions, and how they're calculated:
Uncertainty
Uncertainty measures how much the model is deciding randomly between multiple ways of continuing the output.
Uncertainty is measured at both the token level and the response level.
At the token level, low Uncertainty means the model is fairly confident about what to say next, given the preceding tokens
  • Low Uncertainty means the model is fairly confident about what to say next, given the preceding tokens
  • High Uncertainty means the model is unsure what to say next, given the preceding tokens
Uncertainty at the response level is simply the maximum token-level Uncertainty, over all the tokens in the model's response.
Some types of LLM hallucinations – particularly made-up names, citations, and URLs – are strongly correlated with Uncertainty. Monitoring Uncertainty can help you pinpoint these types of errors.
Groundedness
Groundedness measures whether your model's response was purely based on the context provided, i.e. the response didn't state any facts not contained in the context provided.
In RAG use cases, the context consists of the retrieved documents.
Groundedness is a measurement of closed-domain hallucinations: cases where your model said things that were not provided in the context.
This contrasts with open-domain hallucinations – factual errors that don't relate to any specific documents or context. You can measure open-domain hallucinations using our Uncertainty and Factuality metrics.
If a response is grounded in the context (i.e. it has a value of 1 or close to 1), it only contains information given in the context. If a response is not grounded (i.e. it has a value of 0 or close to 0), it's likely to contain facts not included in the context provided to the model.
The numeric value of the score expresses our algorithm's level of confidence. Values near 0.5 indicate that our algorithm in unsure whether or not the response is grounded, while values near 1 or 0 indicate high confidence that the result was grounded (near 1) or not grounded (near 0).
The explanation of why something was deemed grounded can be seen upon hovering over the metric value.
To fix low Groundedness values, we recommend (1) ensuring your context DB has all the necessary info to answer the question, and (2) adjusting the prompt to tell the model to stick to the information it's given in the context.
Note: This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute.
Factuality
Factuality measures whether the facts stated in the response are accurate.
Combined with Uncertainty, Factuality is a good way of uncovering open-domain hallucinations: factual errors that don't relate to any specific documents or context. If you're using RAG, you can supplement these with Groundedness, which detects deviations from the provided context.
If the response is factual (i.e. it has a value of 1 or close to 1), the information is believed to be correct. If a response is not factual (i.e. it has a value of 0 or close to 0), it's likely to contain factual errors.
The numeric value of the score expresses our algorithm's level of confidence. Values near 0.5 indicate that our algorithm in unsure whether or not the response is factual, while values near 1 or 0 indicate high confidence that the result was factual (near 1) or not factual (near 0).
The explanation why something was deemed factual or not can be seen upon hovering over the metric value.
Note: This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute.
Context Relevance
Context Relevance measures how relevant (or similar) the context provided was to the user query. This metric requires {context} and {query} slots in your data, as well as embeddings for them (i.e. {context_embedding}, {query_embedding}.
Context Relevance is a relative metric. High Context Relevance values indicate significant similarity or relevance. Low Context Relevance values can be a sign that you need to augment your knowledge base or vector DB with additional documents, modify your retrieval strategy or use better embeddings.
Private Identifiable Information
This Guardrail Metric surfaces any instances of PII in your model's responses. We use regular expressions to flag any text that may include phone numbers, social security numbers, credit card numbers, street addresses, and email addresses.
Toxicity
Toxicity measures whether the model's responses contained any abusive, toxic, or foul language. We leverage the unitary/toxic-bert model to output a score between 0-1 on how toxic a specific text input is.
Tone
Galileo's Tone Guardrail performs sentiment analysis on your model's response. Each response is classified into one of 9 emotions: neutral, joy, love, fear, surprise, sadness, anger, annoyance, and confusion. We utilize the model IsaacZhy/roberta-large-goemotions which is trained on the Google GoEmotions dataset which has 28 different emotions. We subsample to 8 'parent' emotions to give you the most consumable insights possible.
Sexism
Galileo's Sexism Guardrail measures sexism or misognyny as a score from 0-1. We leverage the model annahaz/xlm-roberta-base-misogyny-sexism-tweets. Higher values indicate a higher probability that the response is perceived as sexist'
BLEU & ROUGE-1
These metrics measure n-gram similarities between your Generated Responses and your Target output. These metrics require a {target} column in your dataset.
More coming very soon.