Prompt
Galileo Prompt is a powerful tool to help you experiment with and systematically evaluate prompts in order to find the best combination of prompt template, model, and parameters for your generative AI application.
- Prompt Templating and Versioning - One place to create, manage, and track all versions of your templates
- Evaluation - Evaluate your prompts and mitigate your hallucinations using Galileo's Guardrail Metrics.
- Experiment tracking - compare different runs and choose the best as per metrics

Create a Prompt Run to evaluate the model's response to a template and a number of inputs. You can create runs through Galileo's Python library
promptquality
or through the Prompt Inspector UI. Below is a walkthrough of how to create a run through the UI:
Choosing your Template, Model, and Tune Settings
The first thing you do is choose a Model and Tune Settings. The Prompt Inspector UI lets you query popular LLM APIs. For custom or self-hosted LLMs, you need to use the Python client.
Then, you select a template. You can create a new template or load an existing one. All of your templates and their versions are tracked and can be updated from here. To add a variable slot to your template, wrap it in curly brackets (e.g. "{topic}). You can upload a CSV file with a list of values for your slots, or manually enter values through the DATA section.

Galileo has built a menu of Guardrail Metrics for you to choose from. These metrics are tailored to your use case and are designed to help you evaluate your prompts and models. Galileo's Guardrail Metrics are a combination of industry-standard metrics (e.g. BLEU, ROUGE-1, Perplexity) and an outcome of Galileo's in-house ML Research Team (e.g. Uncertainty, Factuality, Groundedness).

Below is the list of metrics we currently support:
- Uncertainty
- Perplexity
- Groundedness
- Factuality
- Context Relevance
- QA Relevance
- Tone
- PII
- BLEU
- ROUGE-1
- More coming very soon.
The same set of Guardrail Metrics can be used to monitor your LLM App once it's in production. See LLM Monitor for more details.
Alternatively, you can use
promptquality
to create runs through your Python notebook. After running pip install promptquality
, you can create prompt runs by doing the following:import promptquality as pq
pq.login({YOUR_GALILEO_URL})
template = "Explain {topic} to me like I'm a 5 year old"
data = {"topic": ["Quantum Physics", "Politics", "Large Language Models"]
pq.run(project_name='my_first_project',
template=template,
dataset=data,
settings=pq.Settings(model_alias='ChatGPT (16K context)',
temperature=0.8,
max_tokens=400))
Get a single-page view to easily compare metrics across different runs and choose the best configuration. This makes the prompt engineering process scientific and systematic.

Using a combination of the Guardrail Metrics chosen earlier and manual inspection, you can compare your Prompt Runs and find the best Prompt-Model-Settings combination for your use case. Galileo automatically suggests one with a crown icon. You can change the crowned run as you see fit.
Hallucination means different things in different use cases and for different people. In the realm of closed-book question answering, hallucinations may pertain to the accuracy of information or Factuality (i.e., whether all the facts in my response are indeed factually correct). In open-book scenarios, hallucinations might be linked to the grounding of information or Groundedness (i.e., whether the facts presented in my response align with the documents or context I supplied to the model). Generally, hallucinations happen when models are forced to generate a response despite lacking substantial knowledge or confidence in its response (Uncertainty).
Galileo offers a comprehensive suite of Hallucination Metrics to help you identify and measure hallucinations. Additionally, Galileo's Guardrail Metrics are built to help you shed light on why and where the model may be struggling.
Enterprise users often have their own unique interpretations of what constitutes hallucinations. Galileo supports Custom Metrics and incorporates Human Feedback and Ratings, empowering you to tailor Galileo Prompt to align with your specific needs and the particular definition of hallucinations relevant to your use case.
With Galileo's Experimentation and Evaluation features, you can systematically iterate on your prompts and models, ensuring a rigorous and scientific approach to improving the quality of responses and addressing hallucination-related challenges.
Last modified 17d ago