Zero-Shot Prompting
Clone this notebook to create this run in your Galileo cluster: https://colab.research.google.com/drive/1LfFEe8MlZuKU_a41z8SEH0to4OOfrWnO
In this example, we will demonstrate how to integrate a topic detection model into a Galileo run through a Galileo CustomMetric.
Setup: Install Library and Set Up Variables
We will use promptquality
, the Python client to interact with Galileo’s GenAI Studio: Evaluate.
! pip install promptquality
Next, we will set our Galileo cluster url, API key, and project name in order to define where we want to log our results.
import os
import promptquality as pq
from google.colab import userdata
### Set variables and env variables ###
os.environ['GALILEO_API_KEY'] = GALILEO_API_KEY = userdata.get('GALILEO_API_KEY_DEMO')
os.environ['GALILEO_CONSOLE_URL'] = GALILEO_CONSOLE_URL = 'https://console.demo.rungalileo.io/'
GALILEO_PROJECT_NAME = 'hotpotqa_topicdetection'
# 🔭🌕 Logging in to the console
config = pq.login(os.environ['GALILEO_CONSOLE_URL'])
Construct Dataset: Subsample of HotpotQA
We will be using (a subsample of) HotpotQA, a public Q&A dataset with question, context, and ground truths aliases. HotpotQA has easy, medium, and hard tasks that are challenging even for the most modern LLM releases.
In lieu of evaluating model responses against the ground truths, we can leverage Galileo’s metrics to gauge hallucinations.
import urllib.request
import pandas
import json
def parse_context(context):
parsed_context = ""
for item in context:
title = item[0]
contents = " ".join(item[1])
parsed_context += f"{title}: {contents}\n"
return parsed_context.strip()
url = 'http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_dev_fullwiki_v1.json'
with urllib.request.urlopen(url) as urlo:
json_data = json.load(urlo)
data = pandas.DataFrame(json_data)
data['parsed_context'] = data['context'].apply(parse_context)
dataset = {
'question': data['question'].iloc[0:50].tolist(),
'context': data['parsed_context'].iloc[0:50].tolist()
}
Define our Classification Pipeline
We will use a checkpoint for bart-large that has been trained on the MultiNLI (MNLI) dataset, which is a dataset of sentence pairs annotated with textual entailment information. This makes it ideal as an off-the-shelf zero-shot topic classification model.
from transformers import pipeline
# HuggingFace will expect an environment variable 'HF_TOKEN' to download the model
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
pipe = pipeline(model="facebook/bart-large-mnli")
Implementing our Pipeline as a Galileo CustomMetric
We will define a small space of candidate labels for this zero-shot topic detection task.
We also define an executor
and aggregator
function. The executor
is a row-level calculator, while the aggregator
consolidates all of the calculated row values.
In this example, I want to publish both the top topic and its score, so my executor
will serialize the JSON into a string. Then, my aggregator
will evaluate that string to parse the numeric label score for aggregation.
When we invoke our run, the executor
and aggregator
will be computed within your Python runtime / notebook / application.
import ast
candidate_labels=["sports", "music", "science", "history", "technology"]
# The executor function is a row-level calculation function.
def executor_topicdetect(row) -> str:
pipe_out = pipe(row.response, candidate_labels=candidate_labels)
return json.dumps({'top_label': pipe_out['labels'][0], 'top_score': pipe_out['scores'][0]})
# The aggregator function takes in row-level calculations of all rows, in order to perform some kind of aggregation calculation (eg mean, median, P95, etc)
def aggregator_topicdetect(scores, indices) -> float:
scores_parse = [float(ast.literal_eval(score)['top_score']) for score in scores]
return {'Average Topic Score (Top Label)': sum(scores_parse) / len(scores_parse) }
Galileo Evaluate
Finally, we will define the metrics we are interested in (Galileo’s metrics) as well as our CustomMetric (Galileo’s CustomScorer class that take our executor
and aggregator
as inputs).
We will also define our prompt template with placeholders {context}
and {question}.
These will be replaced by values with the same keys in our dataset
.
metrics = [
pq.Scorers.context_adherence,
pq.Scorers.correctness,
pq.Scorers.latency,
pq.Scorers.tone,
pq.Scorers.sexist,
pq.Scorers.pii,
pq.Scorers.prompt_perplexity,
pq.CustomScorer(name='Top Topic', executor=executor_topicdetect, aggregator=aggregator_topicdetect)
]
template = """
You are a knowledgeable assistant capable of answering a wide range of questions accurately and clearly. Given the following context, provide detailed and informative answers.
Context:
{context}
Question: {question}
"""
# run our dataset
pq.run(project_name = GALILEO_PROJECT_NAME,
template = template,
dataset = dataset,
scorers = metrics,
settings = pq.Settings(model_alias=pq.SupportedModels.chat_gpt))
The run() execution will return a URL for you to inspect your run in the Galileo Evaluate UI.
In the below run view, you can see that UI publishes the row-level and aggregate calculations based on our CustomMetric.
Was this page helpful?