Using Evaluation Metrics for Quantitative Analysis of Prompts

When adding test cases you can also optionally add one or more Evaluation Metrics (evals) for each test case.

In your workshop, you'll have a set of one or more prompts and one or more test cases. Each test case contains a set of input values for the variables used in the prompt and each test case runs against each prompt.

The evals will also be run for each test case and each eval will return a score between 0 and 1. Evals allow you to enter a target value to help measure the quality of the AI-generated response to the prompt.

You can run your test cases and see all responses and eval metrics. In addition to visually inspecting the AI-generated responses for qualitative analysis, you can also use the eval scores for quantitative analysis. These scores can facilitate the process of iteratively modifying prompts to improve the eval scores.

These scores can also help you compare AI-generated results over time to help monitor for model drift.

You can use any of the following Metric Types:

Exact Match

For an Exact Match evaluation, input the exact response expected from the classification task in the target value field. This can be helpful for classification tasks or sentiment analysis. For example, if you prompt was:
prompt = "Classify the text into neutral, negative or positive. 
{{ text }}
Sentiment:
"
For a test case where the text input variable value was: "This movie is definitely one of my favorite movies of its kind" you could input "positive" in the target value field. This evaluation would score a "1" if the AI-generated response exactly matches "positive" for this test case; otherwise, it scores a "0".

The exact match metric type is ideal for binary classification tasks where precision in response is crucial.

Regex Match 

For the Regex Match evaluation, input a regex pattern in the target value field to specify the format or characteristics the response should meet. This is particularly useful for checking if responses adhere to expected patterns, allowing for flexibility in the answers. For example, if your prompt is:
prompt = "Provide a summary of the customer's feedback in three words or less. 
{{ text }}
Summary:
"
If you want your eval to check if the response is exactly three words, no more, no less, you could use the regex pattern ^\b(\w+\b\s?){3}$ in the target value field. This pattern asserts that the response must consist of exactly three words. The evaluation will score a "1" if the AI's response matches this pattern, and a "0" if it does not.

The regex match metric type is useful for ensuring responses meet specific formatting rules or contain certain elements without requiring exact text matches.

Another example would be using the regex pattern to check if a word or set of words is included in the AI-generated text. For example, if you specified that the AI should cite its sources in the format "citation:" you could enter the regex pattern \b(citation:)\b in the target value field. The evaluation will score a "1" if the AI-generated response matches this pattern and a "0" if it does not.

Similarity 

The Similarity evaluation method allows for a more nuanced assessment of the AI's response by comparing the semantic closeness of the AI-generated text to a target text. This method is valuable for tasks where the exact wording may vary but the underlying meaning or intent should be consistent. For example, if your prompt is:

prompt = "Summarize the main idea of the following text in a single sentence.
{{ text }}
Summary:
"
If the variable value is a block of text about a complex topic, you would input a concise summary that captures the essence of the topic in the target value field. For instance, if the topic is about renewable energy, you might enter "Renewable energy sources are essential for sustainable future energy solutions." as the target text. The similarity evaluation then computes how closely the AI's summary matches the meaning of your target summary, scoring closer to "1" for highly similar responses and closer to "0" for responses that diverge significantly in content or meaning.

This evaluation type is especially suited for summarization, paraphrasing, creative writing, email copy, marketing copy, or any task where conveying the same idea in different words is acceptable or desired.

Subscribe to our newsletter

The latest prompt engineering best practices and resources, sent to your inbox weekly.