Automatic Prompt Optimization

Explains our Automatic Prompt Optimization Client with a detailed walk through

Our automatic prompt optimizer allows users to receive an optimized prompt over their specific data, allowing for the removal of the entire prompt engineering process.

This page will serve as a tutorial for how to deploy and use our prompt optimizer class available in promptquality.


The below code demonstrates the process for optimizing a prompt used in a Retrieval-Augmented Generation (RAG) application. Here is a brief overview of each step and component involved:

  1. Initial Prompt: This is the starting point that includes placeholders (slots) for inputs.

    1. Ex. for RAG applications, the prompt can be "For the given context: {context}. Answer the question: {question}"

  2. Dataset: This dataset should contain columns corresponding to the prompt's slots and optionally an expected answer column labeled target.

    1. Ex. for our RAG example the dataset should have columns: "context", "question", "target."

  3. Evaluation Criteria: These criteria define how to evaluate the model's output. We provide extensive examples below for a number of use cases with and without expected answers. Note: expected answers will likely improve results.

    1. Ex. For our RAG example: "Does the output semantically align with the expected answer. While we do not need a perfect syntactical match the LLM response should convey the expected answer. If it does give it a 1, if it does not give it a 0."

  4. Task Description: A concise description of the task that helps tailor the prompt to the specific application.

    1. Ex. "The task is to answer user provided questions given context that should have the answer."

  5. Configuration and Execution: The Python code sets up the PromptOptimizationConfiguration with the prompt, criteria, task description, and other parameters (like iterations and model aliases). The optimize_prompt function uses this configuration, along with the dataset, to perform the optimization.

  6. Fetching Results: After optimization, the fetch_prompt_optimization_result function retrieves the optimized prompts and their ratings.

By following these steps, you can refine your prompts to improve the performance of your AI model in specific applications.

Here is how that could look in code

from promptquality.prompt_optimization import optimize_prompt
from promptquality.types.prompt_optimization import PromptOptimizationConfiguration
from promptquality.constants.models import Models
from pathlib import Path

initial_prompt = (
    "For the given context: \n{context}. Answer the question: {question}?"
evaluation_criteria = (
    "Does the ouput semantically align with the expected answer. \
    While we do not need a perfect syntactical match the LLM response \
    should convey the expected answer. If it does give it a 1, if it \
    does not give it a 0."
task_description = (
    "The task is answering user provided questions given context \
    that should have the answer.."

po_config = PromptOptimizationConfiguration(
    num_val_rows=200, # if this is more than 0 you must supply a val dataset below
    generation_model_alias="ChatGPT (4K context)",
    evaluation_model_alias="ChatGPT (4K context)",

output = optimize_prompt(

# wait 20 minutes - 1 hour depending on size of dataset
from promptquality.prompt_optimization import fetch_prompt_optimization_result

prompts_and_ratings = fetch_prompt_optimization_result(

Cost Calculation

Note for a dataset of 40 rows we make the following number of calls to OpenAI's API:

  • 40 for generation

  • 40 for evaluation

  • 5 for gradient calculation

  • 1 for gradient summarization

  • 3 for editing the prompt

  • 30 for picking a new prompt

Note there is a fixed cost of 39 calls per iteration, only the calls for generation and evaluation will change as dataset size changes. Therefore one can reasonably calculate their cost for a specific number of iterations. In general for all datasets we have tested for 10 iterations cost has stayed below $2 for GPT 3.5.

We recommend a dataset size of at least 30, and also recommend providing a validation dataset as it allows us to check that your prompt has improved once training has completed.


  • You have an openai api key stored in your production galileo environment that we will use to query models.

  • You have access to GPT-4o. We only utilize this for editing the prompt so it should only consume 20-30 calls. If you do not we fall back to GPT 3.5

Example Evaluation Criteria Templates

Applications with target answers

Note: the prompt fed to the evaluation LLM refers to the target column as "expected answer." For best results use the same verbiage in your criteria

  • RAG: "Does the llm output match the expected answer? If the model says it does not have enough context to answer the question give it a 0. Otherwise judge whether a human would grade the output as matching the expected answer. Adding context around the answer is fine as long as the answer is correct according to the expected answer. If it does match give it a 1. If it does not give it a 0."

  • Math: "Does the output align with the expected answer? The questions are math questions. Check if the answer matches the expected answer. Give it a 1 if a math teacher would consider the answer correct. Give it a 0 if the answer is incorrect. Do not worry about intermediate calculations, only the final answer."

  • General Reasoning: "Does the output align with the expected answer? Check if the logic presented makes sens and the final answer could reasonably be judged as matching the expected answer. Give it a 1 if a well-educated adult would consider the answer correct. Give it a 0 if the answer is incorrect."

Application without target answers

  • RAG: "Does the LLM answer the question completely based only on the information given. Be harsh and make sure the system adheres and only uses the information given in the context. If it completely adheres to the context give it a 1 otherwise give it a 0. It is ok to acknowledge the context does not have the answer but make absolutely sure that the answer to the question is nowhere in the context before giving a 1. Be very harsh."

  • Chat bot assistant: "Act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider how helpful, thoughtful, informative and thorough an answer is. Only give perfect answers a 1. Give everything else a 0."

Last updated