Registering and Using Custom Metrics

Registered Metrics enable the ability for your team to define the custom metrics (programmatic or GPT-based) for your Observe projects.

Creating Your Registered Scorer

To define a registered scorer, create a Python file that has the following functions and follow the function signatures as described below:

  1. scorer_fn: The scorer function is provided the row-wise inputs and is expected to generate outputs for each response. The expected signature for this function is:

    def scorer_fn(*, index: Union[int, str], response: str, **kwargs: Any) -> Union[float, int, bool, str, None]:
        ...

    We support output of a floating points, integers, boolean values, and strings. Your scorer_fn must accept **kwargs as the last parameter so that your registered scorer is forward-compatible.

  2. aggregator_fn: The aggregator function takes in an array of the row-wise outputs from your scorer and allows you to generate aggregates from those. The expected signature for the aggregator function is:

    def aggregator_fn(*, scores: List[Union[float, int, bool, str, None]]) -> Dict[str, Union[float, int, bool, str, None]]:
        ...

    For aggregated values that you want to output from your scorer, return them as key-value pairs with the key corresponding to the label and the value.

  3. (Optional, but recommended) score_type: The scorer_type function is used to define the Type of the score that your scorer generates. The expected signature for this function is:

    def score_type() -> Type[float] | Type[int] | Type[str] | Type[bool]:
        ...

    Note that the return type is a Type object like float, not the actual type itself. Defining this function is necessary for sorting and filtering by scores to work correctly. If you don't define this function, the scorer is assumed to generate float scores by default.

  4. (Optional) scoreable_node_types_fn: If you want to restrict your scorer to only run on specific node types, you can define this function which returns a list of node types that your scorer should run on. The expected signature for this function is:

    def scoreable_node_types_fn() -> List[str]:
        ...

    If you don't define this function, your scorer will run on llm and chat nodes by default.

Registering Your Scorer

Once you've created your scorer file, you can register it with the name and the scorer file with our Python package promptquality:

import promptquality as pq

pq.login({YOUR_GALILEO_URL})
registered_scorer = pq.register_scorer(scorer_name="My Scorer", scorer_file="/path/to/scorer/file.py")

Execution Environment

Your scorer will be executed in a Python 3.10 environment. The Python libraries available for your use are:

numpy~=1.26.4
pandas~=2.2.2
pydantic~=2.7.1
scikit-learn~=1.4.2
tensorflow~=2.16.1

Please note that we regularly update the minor and patch versions of these packages. Major version updates are infrequent but if a library is critical to your scorer, please let us know and we'll provide 1+ week of warning before updating the major versions for those.

The name you choose here will be the name with which the values for this scorer appear in the UI later.

Using Your Registered Scorer

All your Registered Scorers will be shown under the Custom Metrics section of your Project Settings. The On/Off switch turns them on and off.

When your metrics are on, your registered scorer will be executed on new samples that get logged to Galileo Observe (Note: scorers don't run retroactively, so past samples will not be scored). For each added Scorer, you'll see a new column in your Data view.

Example

For the same example scorer that we created using Custom Scorer for response lengths, here's its Registered Scorer equivalent.

  1. Create a scorer.py file:

from typing import List, Dict, Type


def scorer_fn(*, response: str, **kwargs) -> int:
    return len(response)


def aggregator_fn(*, scores: List[str]) -> Dict[str, int]:
    return {
        "Total Response Length": sum(scores),
        "Average Response Length": sum(scores) / len(scores),
    }
    
def score_type() -> Type:
    return int

def scoreable_node_types_fn() -> List[str]:
    return ["llm", "chat"]
  1. Register the scorer:

    pq.register_scorer("response_length", "scorer.py")

Last updated