Creating Prompt Runs

A Prompt Run is a quick and easy way to test a model + template + model settings combination for your use case. In order to create a prompt run, you'll need:

  • An Evaluation Set - a list of user queries / inputs that you want to run your evaluation over

  • A template / model combination you'd like to try.

You can create prompt runs via the Playground UI or via Python.

If you already have an application or prototype you're looking to Evaluate, Prompt Runs are not for you. Instead, we recommend integrating Evaluate into your existing application.

Creating a Prompt Run via Python

  1. Pip Install promptquality and create runs in your Python notebook.

  2. Next, you execute promptquality.run() like shown below

import promptquality as pq

pq.login({YOUR_GALILEO_URL})

template = "Explain {topic} to me like I'm a 5 year old"

data = {"topic": ["Quantum Physics", "Politics", "Large Language Models"]}

pq.run(project_name='my_first_project',
       template=template,
       dataset=data,
       settings=pq.Settings(model_alias='ChatGPT (16K context)',
                            temperature=0.8,
                            max_tokens=400)) 

Creating a Prompt Run via the Playground UI

  1. Login to the Galileo console

  2. Create a New Project via the "+" button.

    1. Give your project a Name, or choose Galileo's proposed name

    2. Select "Evaluate"

    3. Click on Create Project

This will take you to the Galileo Playground. Next, we choose a template, model and hyperparemeter settings

Choosing a Template, Model, and Tune Hyperparameters

  1. Choose an LLM, and adjust hyperparameters settings. For custom or self-hosted LLMs, follow the section Setting Up Your Custom LLMs.

  2. Give your template a name, or select a pre-defined template

  3. Enter a Prompt. Put variables in curly braces e.g. {topic}

  4. Add Data: There are 2 ways to add data

    1. Upload a CSV - with the first row representing variable names and each following row representing the values

    2. Manually add data by clicking on "+ Add data"

Choosing Your Guardrail Metrics

Galileo offers a comprehensive selection of Guardrail Metrics for monitoring your LLM (Large Language Model) App in production. These metrics are meticulously chosen based on your specific use case, ensuring effective evaluation of your prompts and models. Our Guardrail Metrics encompass:

  • Industry-Standard Metrics: These include well-known metrics such as BLEU (Bilingual Evaluation Understudy), ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation), and Perplexity, which are essential for assessing the linguistic quality of generated text.

  • Metrics from Galileo's ML Research Team: Developed through rigorous research, our team has introduced innovative metrics like Uncertainty, Correctness, and Context Adherence. These metrics are designed to evaluate the reliability and authenticity of the generated content, ensuring it meets high standards of accuracy and relevance.

For detailed information on each metric and how they can be utilized to monitor your LLM App effectively in a production environment, refer to our List of Metrics available through Galileo's platform.

Looking to build more complex systems?

If you're building more complex systems, e.g. an application that leverages RAG, Agents, or other multi-step workflows, check out how to use Galileo with RAG or Galileo with Agents.

Last updated