Definition: Measure's whether a given model response is factual or not. Correctness (f.k.a. Factuality) is a good way of uncovering open-domain hallucinations: factual errors that don't relate to any specific documents or context. A high Correctness score means the response is more likely to be accurate vs a low response indicates a high probability for hallucination.
If the response is factual (i.e. it has a value of 1 or close to 1), the information is believed to be correct. If a response is not factual (i.e. it has a value of 0 or close to 0), it's likely to contain factual errors.
Calculation: Correctness is computed by sending an additional requests to an OpenAI LLM, using a carefully engineered chain-of-thought prompt that asks the model to judge whether or not the response was factually accurate. The metric requests multiple distinct responses to this prompt, each of which produces an explanation along with a final judgment: yes or no. The Correctness score is the fraction of "yes" responses, divided by the total number of responses.
We also surface one of the generated explanations. The surfaced explanation is always chosen to align with the majority judgment among the responses. In other words, if the score is greater than 0.5, the explanation will provide an argument that the response is factual; if the score is less than 0.5, the explanation will provide an argument that it is not factual.
Usefulness: Flag and examine responses that are likely to be factual or hallucinated. Once found users can take precaution measures to fix likely hallucinated responses and areas where your model is struggling.
The explanation why something was deemed factual or not can be seen upon hovering over the metric value.
Note: This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute.