A consistent challenge for prompt engineers is the probabilistic nature of the responses generated by Large Language Models (LLMs). LLMs, by design, generate responses based on the probabilities derived from the data they were trained on. This means that even with the same prompt template, an LLM might produce different responses at different times, with different template variables, or under other slightly varied conditions.
This probabilistic behavior can introduce unpredictability in the responses, making it challenging for prompt engineers to test and compare various prompt versions. Also, as LLM providers update their models over time it can be difficult to monitor ongoing consistency and reliability, especially for applications requiring high accuracy and precision. Engineers must therefore employ various strategies and evaluations to mitigate these challenges and refine prompts to guide the model towards generating the desired output more consistently.
This article explores Shiro's comprehensive approach to evaluating AI-generated responses when testing prompts, which combines both quantitative and qualitative assessments to help engineering teams refine and monitor prompt performance for their AI-powered applications.
This probabilistic behavior can introduce unpredictability in the responses, making it challenging for prompt engineers to test and compare various prompt versions. Also, as LLM providers update their models over time it can be difficult to monitor ongoing consistency and reliability, especially for applications requiring high accuracy and precision. Engineers must therefore employ various strategies and evaluations to mitigate these challenges and refine prompts to guide the model towards generating the desired output more consistently.
This article explores Shiro's comprehensive approach to evaluating AI-generated responses when testing prompts, which combines both quantitative and qualitative assessments to help engineering teams refine and monitor prompt performance for their AI-powered applications.
Qualitative Assessment
Through a combination of approaches, we aim to evaluate a holistic view of AI-generated responses across various prompts, comparing the quality and relevance of the outputs. The primary methods available for qualitative assessment of these responses generated by various test cases include:
- Visual Inspection
- Implicit and Explicit Feedback
- AI-powered Analysis
Visual Inspection: The initial step in our evaluation process involves a qualitative assessment, primarily through visual inspection of the AI-generated responses across various iterations. This approach allows us to gauge the immediate quality and relevance of responses based on different prompts and variables. By examining the nuances in AI responses, we can identify areas for improvement in understanding and generating contextually appropriate outputs.
Implicit and Explicit Feedback: Once a model is deployed, incorporating user feedback becomes crucial. We can collect both explicit feedback, such as "thumbs up" or "thumbs down" reactions, and implicit feedback, which can be inferred from user interactions, such as whether they continue to ask related questions or if their query was resolved. This user-centric feedback mechanism enables continuous model refinement and enhances response accuracy.
AI-powered Analysis: To further our qualitative assessment, we are developing a feature that employs an additional call to an AI for bias evaluation. For instance, we might utilize the GPT-4 model to assess potential bias in responses generated by GPT-3.5. This process involves a specific prompt designed to determine the presence of bias in the text, expecting a simple "yes" or "no" answer. Through this method, we can ensure that responses not only meet accuracy standards but also adhere to principles of fairness and neutrality.
Quantitative Assessment
While qualitative analysis of AI-generated responses is critical, utilizing quantitative metrics can provide more structured and objective data for measuring prompt effectiveness. By utilizing a set of metrics and setting target values for each metric, prompt engineers can iteratively adjust prompts by tweaking language, parameters, or even models to improve these scores.
This iterative process can greatly improve response accuracy and reliability. Metrics also provide an objective way to measure prompt responses over time to monitor for model drift over time as the underlying models are updated by LLM providers.
This iterative process can greatly improve response accuracy and reliability. Metrics also provide an objective way to measure prompt responses over time to monitor for model drift over time as the underlying models are updated by LLM providers.
Evals
To streamline the evaluation process, Shiro incorporates Evaluation Metrics (evals) into your prompt engineering testing framework. These evals, applied to each test case, return a score between 0 and 1, providing a quantifiable measure of the AI-generated response's quality.
- Exact Match
- Regex Match
- Cosine Similarity
- JSON Validation
Cosine Similarity: The similarity score evaluates the semantic closeness of AI-generated responses to a pre-defined target text. The similarity metric is valuable for tasks where conveying the same idea in different words is acceptable. By converting the text into vector embeddings and calculating the similarity, we establish a benchmark for response quality. The similarity metric returns a score between "0" and "1", where "0" is not at all similar and "1" is very similar. For example, a score above 0.7 typically indicates a high degree of similarity between two pieces of text.
Exact Match: This metric is most useful for tasks requiring precise answers. Exact match is ideal for classification tasks and semantic analysis, or to validate specific formats where precise responses are critical. An exact match will score a "1", any other response will score a "0".
Regex Match: This metric ensures responses meet specific formatting or content requirements. Regex match is particularly useful for checking if responses adhere to expected patterns, allowing for flexibility in the answers. The evaluation will score a "1" if the AI-generated response matches this pattern and a "0" if it does not.
JSON Validation: This metric is essential for validating prompts where the AI is instructed to respond with JSON. The response is checked to see if it is indeed valid JSON and if so will score a "1", and any response containing invalid JSON will score a "0". We are also working on a feature that evaluates JSON responses to check if they meet an expected schema or contain specific keys.
Regex Match: This metric ensures responses meet specific formatting or content requirements. Regex match is particularly useful for checking if responses adhere to expected patterns, allowing for flexibility in the answers. The evaluation will score a "1" if the AI-generated response matches this pattern and a "0" if it does not.
JSON Validation: This metric is essential for validating prompts where the AI is instructed to respond with JSON. The response is checked to see if it is indeed valid JSON and if so will score a "1", and any response containing invalid JSON will score a "0". We are also working on a feature that evaluates JSON responses to check if they meet an expected schema or contain specific keys.
Performance
In addition to the quantitative eval metrics, Shiro also tracks the following Performance Metrics for each test case.
- Latency
- Prompt Tokens
- Completion Tokens
Latency: A metric to measure how long it takes in milliseconds for the API call to the model to execute and return a complete response.
Prompt Tokens: The number of tokens in the computed prompt body (which combines the prompt template with the variable input values provided for the test case).
Completion Tokens: The number of tokens in the AI-generated response.
Prompt Tokens: The number of tokens in the computed prompt body (which combines the prompt template with the variable input values provided for the test case).
Completion Tokens: The number of tokens in the AI-generated response.
By employing Shiro's blend of quantitative and qualitative assessments, prompt engineers can significantly improve prompt effectiveness and reliability. This approach not only improves quality but also lays the groundwork for continuous improvement through user feedback and iterative testing. As we further refine these evaluation techniques, the potential for creating highly responsive, accurate, and unbiased AI applications becomes increasingly attainable.