Identifying Hallucinations
How to use Galileo Prompt Inspector to find Hallucinations
Hallucination can have many definitions. In the realm of closed-book question answering, hallucinations may pertain to the accuracy of information or Factuality (i.e., whether all the facts in my response are indeed factually correct). In open-book scenarios, hallucinations might be linked to the grounding of information or Groundedness (i.e., whether the facts presented in my response align with the documents or context I supplied to the model). Generally, hallucinations happen when models are forced to generate a response despite lacking substantial knowledge or confidence in its response (Uncertainty).
Regardless of your usecase and definition of Hallucination, Galileo aims to help you identify and solve hallucinations.
Galileo offers a comprehensive suite of Hallucination Metrics to help you identify and measure hallucinations. Galileo's Guardrail Metrics are built to help you shed light on why and where the model may be struggling.
Uncertainty measures the model's certainty in its generated tokens. Because uncertainty works at the token level, it can be a great way of identifying where in the response the model started hallucinating.
When prompted for citations of papers on the phenomenon of "Human & AI collaboration", OpenAI's ChatGPT responds with this:

ChatGPT's response to a prompt asking for citations. Low, Medium and High Uncertainty is colored in Green, Yellow and Red.
A quick Google Search reveals that the cited paper doesn't exist. The arxiv link takes us to a completely unrelated paper.
While not every 'high uncertainty' token (shown in red) will contain hallucinations, and not every hallucination will contain high uncertainty tokens, we've seen a strong correlation between the two. Looking for Uncertainty is usually a good first step in identifying hallucinations.
Note: Uncertainty requires log probabilities and only works for certain models for now.
Groundedness measures whether your model's response was purely based on the context provided, i.e. the response didn't state any facts not contained in the context provided. For RAG users, Groundedness is a measurement of hallucinations.
If a response is grounded in the context (i.e. it has a value of 1 or close to 1), it only contains information given in the context. If a response is not grounded (i.e. it has a value of 0 or close to 0), it's likely to contain facts not included in the context provided to the model.

Hovering over the Groundedness value shows the rationale behind the score
Factuality measures whether the facts stated in the response are based on real facts. This metric requires additional LLM calls.
If the response is factual (i.e. it has a value of 1 or close to 1), the information is believed to be correct. We prompt a GPT model 5 times (e.g. ChatGPT or GPT4) and take the majority value as the result. The explanation why something was deemed factual or not can be seen upon hovering over the metric value.

Note: Because factuality relies on external Large Language Models and their knowledge base, its results are only as good as those models' knowledge base.
Enterprise users often have their own unique interpretations of what constitutes hallucinations. Galileo supports Custom Metrics and incorporates Human Feedback and Ratings, empowering you to tailor Galileo Prompt to align with your specific needs and the particular definition of hallucinations relevant to your use case.
With Galileo's Experimentation and Evaluation features, you can systematically iterate on your prompts and models, ensuring a rigorous and scientific approach to improving the quality of responses and addressing hallucination-related challenges.