Choosing your Guardrail Metrics

How to choose and understand your guardrail metrics

Galileo Metrics

Galileo has built a menu of Guardrail Metrics for you to choose from. These metrics are tailored to your use case and are designed to help you evaluate your prompts and models.
Galileo's Guardrail Metrics are a combination of industry-standard metrics (e.g. BLEU, ROUGE-1, Perplexity) and an outcome of Galileo's in-house ML Research Team (e.g. Uncertainty, Factuality, Groundedness).
Here's a list of the metrics supported today
  • Uncertainty: Measures the model's certainty in its generated responses. Uncertainty works at the response level as well as at the token level. It has shown a strong correlation with hallucinations or made-up facts, names, or citations.
  • Context Adherence - Measures whether your model's response was purely based on the context provided. This metric is intended for RAG users. This metric is computed by prompting a GPT model, and thus requires additional LLM calls to compute.
  • Correctness - Measures whether the facts stated in the response are based on real facts. This metric requires additional LLM calls. Combined with Uncertainty, Factuality is a good way of uncovering Hallucinations.
  • Context Relevance - Measures how relevant the context provided was to the user query. This metric is intended for RAG users. This metric requires {context} and {query} slots in your data, as well as embeddings for them (i.e. {context_embedding}, {query_embedding}.
  • Private Identifiable Information - This Guardrail Metric surfaces any instances of PII in your model's responses. We surface whether your text contains any credit card numbers, social security numbers, phone numbers, street addresses and email addresses.
  • Toxicity - Measures whether the model's responses contained any abusive, toxic or foul language.
  • Tone - Classifies the tone of the response into 9 different emotion categories: neutral, joy, love, fear, surprise, sadness, anger, annoyance, and confusion.
  • Sexism - Measures how 'sexist' a comment might be perceived ranging in the values of 0-1 (1 being more sexist).
  • BLEU & ROUGE-1 - These metrics measure n-gram similarities between your Generated Responses and your Target output. These metrics require a {target} column in your dataset.
  • More coming very soon.
A more thorough description of all Guardrail Metrics can be found here.

Custom Metrics

Galileo LLM Studio supports Custom Metrics (programmatic or GPT-based) for all your Prompt and Monitor projects. Depending on where, when and how you want these metrics to be executed, you have the option to choose between Custom Scorers and Registered Scorers.
Key Differences
Scorer Type
Custom Scorers
Registered Scorers
Available for runs
Created via the Python client
Created from the Python client and UI
Sharing across the organization
Outside Galileo
Accessible within the Galileo console
Scorer Definition
Within the notebook
As an independent Python file
Execution Environment
Within your Python environment
Python Libraries available
Any library within your virtual environment
Limited to a Galileo provided execution environment
Execution Resources
Any resources available to your local instance
Restricted by Galileo

Custom Scorers

For example, let's say we wanted to create a custom metric that measured the length of the response. In our Python environment, we would define an executor function, an aggregator function, and create a CustomScorer object.
def example_scorer(row) -> float:
return len(row.response)
def aggregator(scores, indices) -> dict:
return {'Total Response Length': sum(scores),
# You can have multiple aggregate summaries for your metric.
'Average Response Length': sum(scores)/len(scores)}
my_scorer = pq.CustomScorer(name='Response Length', executor=example_scorer, aggregator=aggregator)
To register your scorer, you would just pass it through your scorers parameter inside or pq.run_sweep:, 'my_dataset.csv', scorers=[my_scorer])
For more docs on custom metrics, visit our promptquality docs.
Once you complete a run, your Guardrail or Custom Metric values will start appearing in the Galileo UI to help you evaluate your LLM responses.

Registered Scorers

We also support registering a scorer such that it can be reused across various runs, projects and users within your organization.

Creating Your Registered Scorer

To define a registered scorer, create a Python file that has at least 2 functions and follow the function signatures as described below:
  1. 1.
    scorer_fn: The scorer function is provided the row-wise inputs and is expected to generate outputs for each response. The expected signature for this function is:
    def scorer_fn(*, index: Union[int, str], response: str, **kwargs: Any) -> Union[float, int, bool, str, None]:
    We support output of a floating points, integers, boolean values and strings. We also recommend ensuring your scorer_fn accepts any **kwargs so that your registered scorers are forward-compatible.
  2. 2.
    aggregator_fn: The aggregator function takes in an array of the row-wise outputs from your scorer and allows you to generate aggregates from those. The expected signature for the aggregator function is:
    def aggregator_fn(*, scores: List[Union[float, int, bool, str, None]]) -> Dict[str, Union[float, int, bool, str, None]]:
    For aggregated values that you want to output from your scorer, return them as key-value pairs with the key corresponding to the label and the value.

Registering Your Scorer

Once you've created your scorer file, you can register it with the name and the scorer file:
registered_scorer = pq.register_scorer(scorer_name="my-scorer", scorer_file="/path/to/scorer/")
The name you choose here will be the name with which the values for this scorer appear in the UI later.

Using Your Registered Scorer

To use your scorer during a prompt run (or sweep), simply pass it in alongside any of the other scorers:, scorers=[registered_scorer])
If you created your registered scorer in a previous session, you can also just pass in the name to the scorer instead of the object as:, scorers=["my-scorer"])

Execution Environment

Your scorer will be executed in a Python 3.9 environment. The Python libraries available for your use are:
If you are using an ML model to make predictions, please ensure it is <= 500MB in size and uses either scikit-learn or tensorflow. We recommend optimizing it by using the ONNX Runtime if it is a larger model.
Please note that we regularly update the minor and patch versions of these packages. Major version updates are infrequent but if a library is critical to your scorer, please let us know and we'll provide 1+ week of warning before updating the major versions for those.


For the same example scorer that we created using Custom Scorer for response lengths, here's its Registered Scorer equivalent.
  1. 1.
    Create a file:
from typing import Dict, List
def scorer_fn(*, response: str, **kwargs) -> int:
return len(response)
def aggregator_fn(*, scores: List[str]) -> Dict[str, int]:
return {
"Total Response Length": sum(scores),
"Average Response Length": sum(scores) / len(scores),
  1. 2.
    Register the scorer:
    pq.register_scorer("response_length", "")
  2. 3.
    Use the scorer in your prompt run:, scorers=["response_length"])