Evaluate Prompts, Chains and RAG system

Galileo Evaluate is a powerful bench for rapid, collaborative experimentation across LLMs, Prompts, RAG parameters, Vector Context, Retrievers, Chains/Agents and more. Over time, Galileo becomes your team's centralized Prompt Store with automatic version control.

Core features

  • Prompt Templating and Versioning - One place to create, manage, and track all versions of your templates
  • Evaluation - Evaluate your prompts and mitigate your hallucinations using Galileo's Guardrail Metrics.
  • Experiment tracking - compare different runs and choose the best as per metrics

How to get started

Creating Prompt Run

Create a Prompt Run to evaluate the model's response to a template and a number of inputs. You can create runs through Galileo's Python library promptquality or through the Evaluate UI.
Below is a walkthrough of how to create a run through the UI:
Choosing your Template, Model, and Tune Settings
The first thing you do is choose a Model and Tune Settings. The Evaluate UI lets you query popular LLM APIs. For custom or self-hosted LLMs, you need to use the Python client.
Then, you select a template. You can create a new template or load an existing one. All of your templates and their versions are tracked and can be updated from here. To add a variable slot to your template, wrap it in curly brackets (e.g. "{topic}). You can upload a CSV file with a list of values for your slots, or manually enter values through the DATA section.

Selecting metrics

Galileo has built a menu of Guardrail Metrics for you to choose from. These metrics are tailored to your use case and are designed to help you evaluate your prompts and models. Galileo's Guardrail Metrics are a combination of industry-standard metrics (e.g. BLEU, ROUGE-1, Perplexity) and an outcome of Galileo's in-house ML Research Team (e.g. Uncertainty, Factuality, Groundedness).
Below is the list of metrics we currently support:
  • Uncertainty
  • Perplexity
  • Context Adherence
  • Completeness
  • Chunk Attribution
  • Chunk Utilization
  • Context Relevance
  • Correctness
  • Tone
  • PII
  • BLEU
  • ROUGE-1
  • More coming very soon.
The same set of Guardrail Metrics can be used to monitor your LLM App once it's in production. See Galileo Observe for more details.
For more information on selecting metrics through the promptquality library see Choosing your metrics. For more information on creating custom metrics see Registering and Using Custom Metrics.

Creating Runs Programmatically

Alternatively, you can use promptquality to create runs through your Python notebook. After running pip install promptquality, you can create prompt runs. Upon execution, you will log into the environment {YOUR_GALILEO_URL} using your API key.
import promptquality as pq
template = "Explain {topic} to me like I'm a 5 year old"
data = {"topic": ["Quantum Physics", "Politics", "Large Language Models"]}
settings=pq.Settings(model_alias='ChatGPT (16K context)',
Expected output for this request should look like this:
Go to {YOUR_GALILEO_URL} to generate a new API Key 🔐
Enter your API Key: {YOUR_KEY}
👋 You have logged into 🔭 Galileo ({YOUR_GALILEO_URL}) as {YOUR_USERNAME}.
Prompt run complete!: 100% x/x [00:09<00:00, 1.19s/it]
🔭 View your prompt run on the Galileo console at: {SOME_URL}
PromptMetrics(total_responses=x, average_hallucination=y, ...)

Compare runs

Get a single-page view to easily compare metrics across different runs and choose the best configuration. This makes the prompt engineering process scientific and systematic.

Evaluating runs

Using a combination of the Guardrail Metrics chosen earlier and manual inspection, you can compare your Prompt Runs and find the best Prompt-Model-Settings combination for your use case. Galileo automatically suggests one with a crown icon. You can change the crowned run as you see fit.

How can Evaluate help you fix Hallucinations?

Hallucination means different things in different use cases and for different people. In the realm of closed-book question answering, hallucinations may pertain to the accuracy of information or Correctness (i.e., whether all the facts in my response are indeed factually correct). In open-book scenarios, hallucinations might be linked to the grounding of information or Context Adherence (i.e., whether the facts presented in my response align with the documents or context I supplied to the model). Generally, hallucinations happen when models are forced to generate a response despite lacking substantial knowledge or confidence in its response (Uncertainty).
Galileo offers a comprehensive suite of Hallucination Metrics to help you identify and measure hallucinations. Additionally, Galileo's Guardrail Metrics are built to help you shed light on why and where the model may be struggling.
Enterprise users often have their own unique interpretations of what constitutes hallucinations. Galileo supports Custom Metrics and incorporates Human Feedback and Ratings, empowering you to tailor Galileo Prompt to align with your specific needs and the particular definition of hallucinations relevant to your use case.
With Galileo's Experimentation and Evaluation features, you can systematically iterate on your prompts and models, ensuring a rigorous and scientific approach to improving the quality of responses and addressing hallucination-related challenges.