Integrating Evaluate into my existing application

If you already have a prototype or an application you're looking to run experiments and evaluations over, Galileo Evaluate allows you to hook into it and log the inputs, outputs, and any intermediate steps to Galileo for further analysis.

Before creating a run, you'll want to make sure you have an evaluation set (a set of questions / sample inputs you want to run through your prototype for evaluation). Your evaluation set should be consistent across runs.

Haven't written any code yet? Are you looking for a no-code way of testing out models and templates for your use case? Check out Creating Prompt Runs.

There are a few ways you can integrate your existing application depending on how you built it:

Langchain

Galileo supports the logging of chains from langchain. To log these chains, we require using the callback from our Python client promptquality.

For logging your data, first login:

import promptquality as pq
pq.login({YOUR_GALILEO_URL})

After that, you can set up the GalileoPromptCallback:

galileo_handler = pq.GalileoPromptCallback(
    project_name=<project-name>, scorers=[<list-of-scorers>]
)
  • project_name: each "run" will appear under this project. Choose a name that'll help you identify what you're evaluating

  • scorers: This is the list of metrics you want to evaluate your run over. Check out Galileo Guardrail Metrics and Custom Metrics for more information.

Executing and Logging

Next, run your chain over your Evaluation set and log the results to Galileo.

When you execute your chain (with run, invoke or batch), just include the callback instance created earlier in the callbacks as:

If using .run():

chain.run(<inputs>, callbacks=[galileo_handler])

If using .invoke():

chain.invoke(inputs, , config=dict(callbacks=[galileo_handler]))

If using .batch():

.batch(..., config=dict(callbacks=[galileo_handler]))

Important: Once you complete executing for your dataset, tell Galileo the run is complete by:

galileo_handler.finish()

The finish step uploads the run to Galileo and starts the execution of the scorers server-side. This step will also display the link you can use to interact with the run on the Galileo console.

A full example can be found here.

Note 1: Please make sure to set the callback at execution time, not at definition time so that the callback is invoked for all nodes of the chain.

Note 2: We recommend using .invoke instead of .batch because langchain reports latencies for the entire batch instead of each individual chain execution.

Custom Logging

If you're not using an orchestration library, or using one other than Langchain, we also provide a similar interface for uploading your executions that do not use a callback mechanism. To log your runs with Galileo, you'd start with the same typical flow of logging into Galileo:

import promptquality as pq
pq.login({YOUR_GALILEO_URL})

Then, for each step of your sequence (or node in the chain), construct a chain row:

from promptquality import NodeType, NodeRow

rows = [
    NodeRow(node_id=..., chain_root_id=..., node_type=<ChainNodeType>)
]

For example, you can log your retriever and llm node with the snippet below.

from promptquality import NodeType, NodeRow
import uuid

rows = []

CHAIN_ROOT_ID = uuid.uuid4(), # Randomly generated UUID
rows.append(
    NodeRow(node_id=CHAIN_ROOT_ID,
             chain_root_id=CHAIN_ROOT_ID, # UUID of the 'parent' node
             step = 0, #an integer indicating which step this node is
             node_input=..., # input into your overall sequence or chain
             node_output=..., # output of your overall sequence or chain
             latency=..., # latency of this step/node. in nanoseconds
             node_type=NodeType.chain # Can be chain, retriever, llm, chat, agent, tool
         )
)

rows.append(
    NodeRow(node_id=uuid.uuid4(), # Randomly generated UUID
             chain_root_id=CHAIN_ROOT_ID, # UUID of the 'parent' node
             step = 1, #an integer indicating which step this node is
             node_input=..., # input into your retriever
             node_output=..., # serialized output of the retriever (i.e. json.dumps([{"page_content": "doc_1", "metadata": {"key": "val"}}, {"page_content": "doc_2", "metadata": {"key": "val"}}, ...]))
             latency=..., # latency of this step/node. in nanoseconds
             node_type=NodeType.retriever # Can be chain, retriever, llm, chat, agent, tool
             )
)

rows.append(
    NodeRow(node_id=uuid.uuid4(), # Randomly generated UUUID
             chain_root_id=CHAIN_ROOT_ID, # UUID of the 'parent' node
             step = 2, #an integer indicating which step this node is
             node_input=..., # input into your llm (i.e. user query + relevant contexts passed in as a string)
             prompt = ..., # input into your llm (i.e. user query + relevant contexts passed in as a string)
             node_output=..., # output of the llm passed in as a string
             response = ..., # output of the llm passed in as a string
             latency=..., # latency of this step/node. in nanoseconds
             node_type=NodeType.llm # Can be chain, retriever, llm, chat, agent, tool
             )
)

We recommend you randomly generate node_id and chain_root_id (e.g. uuid()). Add the id of a 'parent' node as the chain_root_id of its children.

When your execution completes, log that data to Galileo:

pq.chain_run(rows, project_name=<project-name>, scorers=[<list-of-scorers>])

Once that's complete, this step will display the link to access the run from your Galileo Console.

Logging metadata

If you are logging chains from langchain, metadata values (such as chunk-level metadata for the retriever) will be automatocally included.

For custom chains, metadata values can be logged by dumping metadata along with page_content as demonstrated below.

from promptquality import NodeType, NodeRow
import uuid

retriever_output = [ { "page_content": "chunk 1 content", "metadata": {"key": "value"} }, { "page_content": "chunk 2 content", "metadata": {"key": "value"} }, ]

rows = []

rows.append(
    NodeRow(
        node_id=uuid.uuid4(), # Randomly generated UUID
        chain_root_id=..., # UUID of the 'parent' node
        step = ...,  #an integer indicating which step this node is
        node_type=NodeType.retriever, 
        step=..., #an integer indicating which step this node is
        node_input="the query to the retriever", 
        node_output=json.dumps(retriever_output) )
)

Running multiple experiments in one go

If you want to run multiple experiments in one go (e.g. use different templates, experiment with different retriever params, etc.), check out Chain Sweeps.

Last updated