Galileo Evaluate® : Rapid Evaluation of Prompts, Chains and RAG systems

Galileo Evaluate is a powerful bench for rapid, collaborative experimentation across LLMs, Prompts, RAG parameters, Vector Context, Retrievers, Chains/Agents and more. Over time, Galileo becomes your team's centralized Prompt Store with automatic version control.

Core features

  • Prompt Templating and Versioning - One place to create, manage, and track all versions of your templates

  • Evaluation - Evaluate your prompts and mitigate your hallucinations using Galileo's Guardrail Metrics.

  • Experiment tracking - compare different runs and choose the best as per metrics

Using the Prompt Playground

Creating Prompt Run

Create a Prompt Run to evaluate the model's response to a template and a number of inputs. You can create runs through Galileo's Python library promptquality or through the Evaluate UI.

Below is a walkthrough of how to create a run through the UI:

Choosing your Template, Model, and Tune Settings

The first thing you do is choose a Model and Tune Settings. The Evaluate UI lets you query popular LLM APIs. For custom or self-hosted LLMs, you need to use the Python client.

Then, you select a template. You can create a new template or load an existing one. All of your templates and their versions are tracked and can be updated from here. To add a variable slot to your template, wrap it in curly brackets (e.g. "{topic}). You can upload a CSV file with a list of values for your slots, or manually enter values through the DATA section.

Selecting metrics

Galileo has built a menu of Guardrail Metrics for you to choose from. These metrics are tailored to your use case and are designed to help you evaluate your prompts and models. Galileo's Guardrail Metrics are a combination of industry-standard metrics (e.g. BLEU, ROUGE-1, Perplexity) and an outcome of Galileo's in-house ML Research Team (e.g. Uncertainty, Factuality, Groundedness).

For the list of metrics and their definitions, see the Guardrail Store section.

The same set of Guardrail Metrics can be used to monitor your LLM App once it's in production. See Galileo Observe for more details.

For more information on selecting metrics through the promptquality library see Choosing your metrics. For more information on creating custom metrics see Registering and Using Custom Metrics.

Creating Runs Programmatically

Alternatively, you can use promptquality to create runs through your Python notebook. After running pip install promptquality, you can create prompt runs. Upon execution, you will log into the environment {YOUR_GALILEO_URL} using your API key.

import promptquality as pq


template = "Explain {topic} to me like I'm a 5 year old"

data = {"topic": ["Quantum Physics", "Politics", "Large Language Models"]}'my_first_project',
       settings=pq.Settings(model_alias='ChatGPT (16K context)',

Expected output for this request should look like this:

Go to {YOUR_GALILEO_URL} to generate a new API Key 🔐 
Enter your API Key: {YOUR_KEY} 
👋 You have logged into 🔭 Galileo ({YOUR_GALILEO_URL}) as {YOUR_USERNAME}.
Prompt run complete!: 100% x/x [00:09<00:00, 1.19s/it]
🔭 View your prompt run on the Galileo console at: {SOME_URL}
PromptMetrics(total_responses=x, average_hallucination=y, ...)

Compare runs

Get a single-page view to easily compare metrics across different runs and choose the best configuration. This makes the prompt engineering process scientific and systematic.

Evaluating runs

Using a combination of the Guardrail Metrics chosen earlier and manual inspection, you can compare your Prompt Runs and find the best Prompt-Model-Settings combination for your use case. Galileo automatically suggests one with a crown icon. You can change the crowned run as you see fit.

Fix Your LLM Hallucinations

Hallucination holds different meanings and implications depending on various situations and individuals. In the context of closed-book question answering, hallucinations refer to the accuracy or correctness of information provided in the response. This relates to whether all the facts presented are indeed factually correct. In open-book scenarios, hallucinations are associated with how well the response aligns with the given context or the information provided in the documents. Essentially, hallucinations occur when models generate responses without having substantial knowledge or confidence in their accuracy or certainty.

To assist you in identifying and quantifying hallucinations, Galileo offers a wide range of comprehensive Hallucination Metrics. Additionally, Galileo's Guardrail Metrics are designed to provide insights into the areas where the model may encounter difficulties and help shed light on the reasons behind potential struggles.

Enterprise users often have their own unique interpretations of what can be considered as hallucinations. Galileo accommodates such diversity by providing support for Custom Metrics. It incorporates Human Feedback and Ratings, giving you the ability to customize Galileo Prompt according to your specific requirements and the particular definition of hallucinations relevant to your use case.

Galileo's Experimentation and Evaluation features enable you to systematically iterate on your prompts and models. This ensures a rigorous and scientific approach towards enhancing the quality of responses and effectively addressing challenges related to hallucinations.

  • Hallucinations can have different meanings depending on the situation

  • In closed-book question answering, hallucinations pertain to the accuracy and correctness of the facts provided in a response.

  • In open-book scenarios, hallucinations relate to the coherence of information and how well the facts align with the given context or documents.

  • Hallucinations occur when models lack sufficient knowledge or confidence, resulting in uncertain responses.

  • Galileo offers a comprehensive set of Hallucination Metrics to identify and measure hallucinations.

  • Galileo's Guardrail Metrics illuminate areas where the model may encounter difficulties, providing insights into the underlying reasons for hallucinations.

  • Enterprise users have unique interpretations of hallucinations, and Galileo supports Custom Metrics to incorporate their specific definitions and requirements.

  • Galileo enables the integration of Human Feedback and Ratings to customize the Prompt, aligning with the user's particular definition of hallucinations.

  • Galileo's Experimentation and Evaluation features facilitate a systematic approach to iterate on prompts and models, ensuring a rigorous and scientific process to tackle hallucination challenges and enhance response quality.

Last updated