ChainPoll

ChainPoll is a powerful, flexible technique for LLM-based evaluation that is unique to Galileo. It is used to power multiple metrics across the Galileo platform.

This page provides a friendly overview of what ChainPoll is and what makes it different.

For a deeper, more technical look at the research behind ChainPoll, check out our paper Chainpoll: A high efficacy method for LLM hallucination detection.

ChainPoll = Chain + Poll

ChainPoll involves two core ideas, which make up the two parts of its name:

  • Chain: Chain-of-thought prompting

  • Poll: Prompting an LLM multiple times

Let's cover these one by one.

Chain

Chain-of-thought prompting (CoT) is a simple but powerful way to elicit better answers from a large language model (LLM).

A chain-of-thought prompt is simply a prompt that asks the LLM to write out its step-by-step reasoning process before stating its final answer. For example:

  • Prompt without CoT:

    • "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"

  • Prompt with CoT:

    • "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? Think step by step, and present your reasoning before giving the answer."

While this might seem like a small change, it often dramatically improves the accuracy of the answer.

Why does CoT Work?

To better understand why CoT works, consider that the same trick also works for human beings!

If someone asks you a complex question, you will likely find it hard to answer immediately, on the spot. You'll want some time to think about it -- which could mean thinking silently, or talking through the problem out loud.

Asking an LLM for an answer without using CoT is like asking a human to answer a question immediately, on the spot, without pausing to think. This might work if the human has memorized the answer, or if the question is very straightforward.

For complex or difficult questions, it's useful to take some time to reflect before answers, and CoT allows the LLM to do this.

Poll

ChainPoll extends CoT prompting by soliciting multiple, independently generated responses to the same prompt, and aggregating these responses.

Here's why this is a good idea.

As we all know, LLMs sometimes make mistakes. And these mistakes can occur randomly, rather than deterministically. If you ask an LLM the same question twice, you will often get two contradictory answers.

This is equally true of the reasoning generated by LLMs when prompted with CoT. If you ask an LLM the same question multiple times, and ask it to explain its reasoning each time, you'll often get a random mixture of valid and invalid arguments.

But here's the key observation: "a random mixture of valid and invalid arguments" is more useful than it sounds! Because:

  • All valid arguments end up in the same place: the right answer.

  • But an invalid argument can lead anywhere.

This turns the randomness of LLM generation into an advantage.

If we generate a diverse range of arguments, we'll get many different arguments that lead to the right answer -- because any valid argument leads there. We'll also get some invalid arguments, but they'll end up all over the place, not concentrated around any one answer. (Some of them may even produce the right answer by accident!

This idea -- generate diverse reasoning paths with CoT, and let the right answer "bubble to the top" -- is sometimes referred to as self-consistency.

It was introduced in this paper, as a method for solving math and logic problems with LLMs.

From self-consistency to ChainPoll

Although ChainPoll is closely related to self-consistency, there are a few key differences. Let's break them down.

Self-consistency is a technique for picking a single best answer. It uses majority voting: the most common answer among the different LLM outputs is selected as the final answer of the entire procedure.

By contrast, ChainPoll works by averaging over the answers produced by the LLM to produce a score.

Most commonly, the individual answers are True-or-False, and so the average can be interpreted as the fraction of True answers among the total seto f answers.

For example, in our Context Adherence metric, we ask an LLM whether a response was consistent with a set of documents. We might get a set of responses like this:

  1. A chain of thought ending in the conclusion that Yes, the answer was supported

  2. A different chain of thought ending in the conclusion that Yes, the answer was supported

  3. A third chain of thought ending in the conclusion that No, the answer was not supported

In this case, we would average the three answers and return a score of 0.667 (=2/3) to you.

The majority voting approach used in self-consistency would round this off to Yes, since that's the most common answer. But this misses some of the information present in the underlying answer.

By giving you an average, ChainPoll conveys a sense of the evaluating LLM's level of certainty. In this case, while the answer is more likely to be Yes than No, the LLM is not entirely sure, and that nuance is captured in the score.

Additionally, self-consistency has primarily been applied to "discrete reasoning" problems like math and code. While ChainPoll can be applied to such problems, we've found it also works much more broadly, for almost any kind of question that can be posed in a yes-or-no form.

Frequently asked questions

How does ChainPoll compare to the methods used by other LLM evaluation tools, like RAGAS and TruLens?

We cover this in detail in the section below on The ChainPoll advantage.

ChainPoll involves requesting multiple responses. Isn't that slow and expensive?

Not as much as you might think!

We use batch requests to LLM APIs to generate ChainPoll responses, rather than generating the responses one-by-one. Because all requests in the batch have the same prompt, the API provider can process them more efficiently: the prompt only needs to be run through the LLM once, and the results can be shared across all of the sequences being generated.

This efficiency improvement often corresponds to better latency or lower cost from the perspective of the API consumer (and ultimately, you).

For instance, with the OpenAI API -- our default choice for ChainPoll -- a batch request for 3 responses from the same prompt will be billed for:

  • All the output tokens across all 3 responses

  • All the input tokens in the prompt, counted only once (not 3 times)

Compared to simply making 3 separate requests, this cuts down on the cost of the prompt by 2/3.

What LLMs does Galileo use with ChainPoll? Why those?

By default, we use OpenAI's latest version of GPT-3.5-Turbo.

Although GPT-3.5-Turbo can be less accurate than a more powerful LLMs such as GPT-4, it's much faster and cheaper. We've found that using it with ChainPoll closes a significant fraction of the accuracy gap between it and GPT-4, while still being much faster and less expensive.

That said, GPT-4 and other state-of-the-art LLMs can also benefit from ChainPoll.

Sounds simple enough. Couldn't I just build this myself?

Galileo continually invests in research aimed at improving the quality and efficiency of ChainPoll, as well as rigorously measuring these outcomes.

For example, in the initial research that produced ChainPoll, we found that the majority of available datasets used in earlier research on hallucination detection did not meet our standards for relevance and quality; in response, we created our own benchmark called RealHall.

By using Galileo, you automatically gain access to the fruits of these ongoing efforts, including anything we discover and implement in the future.

Additionally, Galileo ChainPoll metrics are integrated naturally with the rest of the Galileo platform. You won't have to worry about how to scale up ChainPoll requests, how to persist ChainPoll results to a database, or how to track ChainPoll metrics alongside other information you log during LLM experiments or in production.

How do I interpret the scores?

ChainPoll scores are averages over multiple True-or-False answers. You can interpret them as a combination of two pieces of information:

  • An overall inclination toward Yes or No, and

  • A level of certainty/uncertainty.

For example:

  • A score of 0.667 means that the evaluating LLM said Yes 2/3 of the time, and No 1/3 of the time.

    • In other words, its overall inclination was toward Yes, but it wasn't totally sure.

  • A score of 1.0 would indicate the same overall inclination, with higher confidence.

Likewise, 0.333 is "inclined toward No, but not sure," and 0 is "inclined toward No, with higher confidence."

It's important to understand that a lower ChainPoll score doesn't necessarily correspond to lower quality, particularly on the level of a single example. ChainPoll scores are best used either:

  • As a guide for your own explorations, pointing out things in the data for you to review, or

  • As a way to compare entire runs to one other in aggregate.

The ChainPoll advantage

ChainPoll is unique to Galileo. In this section, we'll explore how it differs from the approaches used in products like RAGAS and TruLens, and what makes ChainPoll more effective.

ChainPoll vs. RAGAS

RAGAS offers a Faithfulness score, which has a similar purpose to Galileo's Context Adherence score.

Both of these scores evaluate whether a response is consistent with the information in a context, such as the chunks provided by a RAG retriever.

However, under the hood, the two scores work very differently.

To compute Faithfulness, RAGAS calls an LLM in two distinct steps:

  1. The LLM is asked to break the response down into one or more granular statements.

    1. In this step, the LLM can only see the response, not the context.

  2. The LLM given the statements and the context, and is asked to judge whether or not each statement is consistent with the context.

    1. In this step, the LLM can see the context, but not the original response. Instead, it only sees the statements that were written in step 1.

The scores for each statement (0 for inconsistent, 1 for consistent) are averaged over statements to produce a score.

This procedure can go wrong in a few different ways, none of which apply to ChainPoll.

Statement breakdowns can be misleading

By breaking down the response into statements and judging the statements separately, RAGAS can ignore the way that different parts of the response are related.

An LLM response is not just a disconnected list of "statements," any more than this article is. It may make a complex claim or argument that loses its structure when broken down in this way.

Consider this example, from a dataset related to Covid-19 that we use internally at Galileo.

An LLM was given a set of documents describing medical studies, and asked

What important risk factors to infection were found during the second case-controlled study?

It responded:

The important risk factors to infection found during the second case-controlled study were hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use.

As it happened, this response was not consistent with the documents. Although all the factors cited as risk factors were mentioned in the documents, they weren't actually said to be risk factors during the second case-controlled study.

Galileo Context Adherence caught this error, assigning the response a score of 0. Here is one of the chain-of-thought explanations it produced:

The response claims that the important risk factors to infection found during the second case-controlled study were hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use. 

To verify this claim, we need to check the specific risk factors mentioned in the second case-controlled study document. However, the second case-controlled study document does not provide specific risk factors related to infection. It mainly focuses on data collection and limitations of the study. Therefore, the claim that hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use were important risk factors found during the study is not supported by the documents.

But RAGAS Faithfulness assigns this response a score of 1.0 (perfect).

To see what went wrong, let's review the 4 statements that RAGAS generated from the response, and its generated reasoning for why each one was consistent.

Statement 1/4
The second case-controlled study identified several important risk factors to infection.

Reasoning
The passage mentions that a case-control study was conducted to identify risk factors for multi-drug resistant infection in the pediatric intensive care unit (PICU).

Verdict
1 (Consistent)

---

Statement 2/4
These risk factors include hospitalization in the preceding 90 days.

Reasoning
The passage states that hospitalization in the preceding 90 days was a risk factor for infection with a resistant pathogen.

Verdict
1 (Consistent)

---

Statement 3/4
Residency in a nursing home was also found to be a significant risk factor.

Reasoning
The passage mentions that residency in a nursing home was an independent predictor of infection with a resistant pathogen.

Verdict
1 (Consistent)

---

Statement 4/4
Additionally, antibiotic use was identified as an important risk factor.

Reasoning
The passage states that antibiotic use was one of the main contents collected and analyzed in the study.

Verdict
1 (Consistent)

When RAGAS broke down the response into statements, it omitted key information that made the answer inconsistent.

Some of the statements are about the second case-controlled study, and some are about risk factors. Taken in isolation, each of these statements is arguably true.

But none of them captures the claim that the original LLM got wrong: that these risk factors were identified, not just in any study, but in the second case-controlled study.

ChainPoll allows the LLM to assess the entire input at once and come to a holistic judgment of it. By contrast, RAGAS fragments its reasoning into a sequence of disconnected steps, performed in isolation and without access to complete information.

This causes RAGAS to miss subtle or complex errors, like the one in the example above. But, given the increasing intelligence of today's LLMs, subtle and complex errors are precisely the ones you need to be worried about.

RAGAS does not handle refusals sensibly

Second, RAGAS Faithfulness is unable to produce meaningful results when the LLM refuses to answer.

In RAG, an LLM will sometimes respond with a refusal that claims it doesn't have enough information: an answer like "I don't know" or "Sorry, that wasn't mentioned in the context."

Like any LLM response, these are sometimes appropriate and sometimes inappropriate:

  • If the requested information really wasn't in the retrieved context, the LLM should say so, not make something up.

  • On the other hand, if the information was there, the LLM should not assert that it wasn't there.

In our tests, RAGAS Faithfulness always assigns a score of 0 to these kinds of refusal answers.

This is unhelpful: refusal answers are often desirable in RAG, because no retriever is perfect. If the answer isn't in your context, you don't want your LLM to make one up.

Indeed, in this case, saying "the answer wasn't in the context" is perfectly consistent with the context: the answer really was not there!

Yet RAGAS claims these answers are inconsistent.

Why? Because it is unable to break down a refusal answer into a collection of statements that look consistent with the context.

Typically, it produces no statements at all, and then returns a default score of 0. In other cases, it might produce a statement like "I don't know" and then assess this statement as "not consistent" since it doesn't make sense outside its original context as an answer to a question.

ChainPoll handles these cases gracefully: it assesses them like any other answer, checking whether they are consistent with the context or not. Here's an example:

The LLM response was

The provided context does not contain information about where the email was published. Therefore, it is not possible to determine where the email was published based on the given passages.

The Galileo Context Adherence score was 1, with an explanation of

The provided documents contain titles and passages that do not mention the publication details of an email. Document 1 lists an 'Email address' under the passage, but provides no information about the publication of an email. Documents 2, 3, and 4 describe the coverage of the Ebola Virus Disease outbreak and mention various countries and aspects of newspaper writings, but do not give any details about where an email was published. Hence, the context from these documents does not contain the necessary information to answer the question regarding the publication location of the email. The response from the large language model accurately reflects this lack of information.

RAGAS does not explain its answers

Although RAGAS does generate explanations internally (see the examples above), these are not surfaced to the user.

Moreover, as you can see above, they are briefer and less illuminating than ChainPoll explanations.

(We produced the examples above by adding callbacks to RAGAS to capture the requests it was making, and then following identifiers in the requests to link the steps together. You don't get any of that out of the box.)

ChainPoll vs. TruLens

TruLens offers a Groundedness score, which targets similar needs to Galileo Context Adherence and RAGAS Faithfulness: evaluating whether a response is consistent with a context.

As we saw above with RAGAS, although these scores look similar on the surface, there are important differences in what they actually do.

TruLens Groundedness works as follows:

  1. The response is split up into sentences.

  2. An LLM is given the list of sentences, along with the context. It is asked to:

    1. quote the part of the context (if any) that supports the sentence

    2. rate the "information overlap" between each sentence and the context on a 0-to-10 scale.

  3. The scores are mapped to a range from 0 to 1, and averaged to produce an overall score.

We've observed several failure modes of this procedure that don't apply to ChainPoll.

TruLens does not use chain-of-thought reasoning

Although TruEra uses the term "chain of thought" when describing what this metric does, the LLM is not actually asked to present a step-by-step argument.

Instead, it is merely asked to give a direct quotation from the context, then (somehow) assign a score to the "information overlap" associated with this quotation. It doesn't get any chance to "think out loud" about why any given quotation might, or might not, really constitute supporting evidence.

For example, here's what TruLens produces for the second case-controlled study example we reviewed above with RAGAS:

Statement Sentence: The important risk factors to infection found during the second case-controlled study were hospitalization in the preceding 90 days, residency in a nursing home, and antibiotic use. 

Supporting Evidence: pathogen isolated in both study groups, but there was a higher prevalence of MDR pathogens in patients with risk factors compared with those without. Of all the risk factors, hospitalization in the preceding 90 days 1.90 to 12.4, P = 0.001) and residency in a nursing home were independent predictors of infection with a resistant pathogen and mortality. 

Score: 8

The LLM quotes a passage that mentions the factors cited as risk factors in the response, without first stopping to think -- like ChainPoll does -- about whether the document actually says these are risk factors in the second case-controlled study.

Then, perhaps because the quoted passage is relatively long, it assigns it a score of 8/10. Yet this response is not consistent with the context.

TruLens uses an ambiguous grading system

You might have noticed another odd thing about the example just above. Even if the evidence really had been supporting evidence (which it wasn't), why "8 out of 10"? Why not 7/10, or 8/10, or 10/10?

There's no good answer to this question. TruLens does not provide the LLM with a clear grading guide, explaning exactly what makes an answer an "8/10" as opposed to a mere "7/10", and so on.

Instead, it only tells the LLM to "Output a number between 0-10 where 0 is no information overlap and 10 is all information is overlapping."

If you were given this instruction, would you know how to decide when to give an 8, vs. a 7 vs. a 9? The LLM is as confused as you are.

As a result, the ratings computed inside the TruLens Groundedness score often vary whimsically, without apparent meaning. In our testing, we've observed these numbers varying widely across the 0-to-10 scale when we re-run the scorer with the same input, even when the LLM cites the same supporting quotation (or a very similar one).

At Galileo, we've observed that LLMs are often confused by numerical rating systems, and produce much more reliable and intelligible results when asked for a simple Yes-or-No answer.

This makes intuitive sense: in the absence of detailed guidelines about what each numeric score means, a 0-to-10 rating scale is inherently ambiguous and ill-defined.

Sentence breakdowns can be misleading

As we noted above when discussing RAGAS, imposing a breakdown into granular units like sentences or statements -- and then grading each unit separately -- can miss the way the units are logically connected.

Thus, TruLens can exhibit errors similar to the one we covered in the section on RAGAS when dealing with complex answers.

TruLens gets confused by formatting

In our testing, we noticed that TruLens Groundedness would often generate strange, malformed output that included grades for some sentences from the context instead of the response.

Here's one example. The response was

Bacilius subtilis can be used as an oral vaccine by expressing the COE antigen of PEDV and delivering it to the GI tract to stimulate mucosal immune responses. It can protect antigens from the harsh digestive environment and enhance immune regulation and gut health in newborn piglets. Additionally, it has adjuvant activity and can deliver heterologous antigens to the GI tract, providing additional immunity stimulation.

There are 3 sentences here. TruLens is supposed to assign a score to each one, together with a supporting quotation. But here is what it actually produced:

STATEMENT 0:
Statement Sentence: Bacillus subtilis can be used as an oral vaccine by expressing the COE antigen of PEDV and delivering it to the GI tract to stimulate mucosal immune responses. 
Supporting Evidence: NOTHING FOUND 
Score: 0


STATEMENT 1:
Statement Sentence: Oral administration can protect the gut and stimulate the common mucosal immune system. 
Supporting Evidence: As a superior way of mucosal immunization, oral administration can protect the gut and stimulate the common mucosal immune system. 
Score: 10 

Statement Sentence: Oral immunization often delivers large amounts of antigens to prevent diarrheal diseases. 
Supporting Evidence: Oral immunization often delivers large amounts of antigens to prevent the diarrheal diseases. 
Score: 10 

Statement Sentence: Bacillus subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Supporting Evidence: Currently, Bacillus subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Score: 10 

Statement Sentence: Orally administered Bacillus subtilis can enhance immune regulation and gut health in pigs. 
Supporting Evidence: Moreover, oral administration of B. subtilis could elicit humoral and cellular immune responses to the maintenance of gut homeostasis by dendritic cells. 
Score: 5


STATEMENT 2:
Statement Sentence: Oral administration can protect the gut and stimulate the common mucosal immune system. 
Supporting Evidence: As a superior way of mucosal immunization, oral administration can protect the gut and stimulate the common mucosal immune system. 
Score: 10 

Statement Sentence: Orally administered B. subtilis could enhance immune regulation and gut health in pigs. 
Supporting Evidence: Moreover, oral administration of B. subtilis could elicit humoral and cellular immune responses to the maintenance of gut homeostasis by dendritic cells. 
Score: 7 

Statement Sentence: B. subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Supporting Evidence: Currently, Bacillus subtilis is widely used as a vaccine delivery system for its unique characteristics. 
Score: 10 

After the first statement, things go off the rails. The sentences listed under "STATEMENT 1" and "STATEMENT 2" don't appear in the response at all. And, nonsensically, the LLM has written multiple "Statement Sentences" under each of the "STATEMENT" headings.

In a case like this, the TruLens codebase assumes that each STATEMENT heading only has one score under it, and ends up picking the first one listed. Here, it ended up with the scores [0, 10, 10] for the three statements. But the latter two scores are nonsense -- they're not about sentences from the response at all.

We tracked this issue down to formatting.

Our context included multiple paragraphs and documents, which were separated by line breaks. It turns out that TruLens' prompt format also uses line breaks to delimit sections of the prompt. Apparently the LLM became confused by which line breaks meant what.

Replacing line breaks with spaces fixed the problem in this case. But you shouldn't have to worry about this kind of thing at all. Line breaks are not an exotic edge case, after all.

The prompt formats we use for Galileo ChainPoll metrics involve a more robust delimiting strategy, including reformatting your output in some cases if needed. This prevents issues like this from arising with ChainPoll.

Last updated