Prompt Engineering Testing Strategies with Python

Prompt engineering is a new and emerging field that plays a crucial role in shaping the quality of AI language model responses, and therefore the quality of a user experience with your AI-based application or tool. A well-crafted prompt can significantly influence the effectiveness and accuracy of an AI's response. This article delves into some prompt engineering testing strategies I have developed to help me in this process.

I recently created a github repository as a demo project for a "senior prompt engineer" position that provides an overview of prompt engineering testing strategies I use when developing AI-based applications at OpenShiro. In this example, I use the OpenAI API and unittest in Python for maintaining high-quality prompts with consistent cross-model functionality, such as switching between text-davinci-003, gpt-3.5-turbo, and gpt-4-1106-preview. These tests also enable ongoing testing of prompt responses over time to monitor model drift and even evaluation of responses for safety, ethics, and bias as well as similarity to a set of expected responses.

Supplementing Prompts with Additional Instructions

A key aspect of many AI applications includes the strategic supplementation of user prompts with additional instructions both before and after the user prompt. This technique enhances the prompt's context and clarity, leading to more precise and relevant AI responses.

Pre-Prompt Instructions: Before the user's main prompt, a pre_prompt is added. This serves as a preparatory instruction, setting the stage for the AI to better understand the context and desired outcome of the user's query. In this case, the pre_prompt I am providing is:

def pre_prompt(self):
    return (
        "Pretend you are an expert research scientist with 20 years of experience teaching as "
        "a college professor. I am a freshman college student interested in your research "
        "please teach me starting with simple concepts and building more complexity as you go. "
        "Please refer to me as 'my dedicated student' when you begin your response. Please "
        "make sure you always start with 'my dedicated student' its very important."
    )

Post-Prompt Instructions: Following the main prompt, additional instructions can be included. These post-prompt instructions often involve specific directions for the AI, such as formatting requirements or limitations on the type of information to include.

def cite_sources_prompt(self):
    return (
        "after each sentence in your response please cite your sources. Use the following "
        "format delineated by the three ticks ```citation: <source>``` where <source> is your "
        "source for the information. Please make sure you always include citations, its very "
        "important. Take your time and make sure you follow all of these instructions."
    )

The full_prompt provided to the model looks like:

def full_prompt(self):
    return f"{self.pre_prompt()} {self.user_prompt} {self.cite_sources_prompt()}"

By carefully crafting these supplemental prompts, the system achieves a more nuanced understanding of the user's needs. You'll notice a high level of detail in specifying the format of the citation, using backticks and angle brackets as specifications to the model detailing what you want. This approach can significantly improve the quality and accuracy of the AI's responses, aligning them more closely with the user's intentions. It's a method that underscores the importance of contextual and directive precision in prompt engineering.

Test Strategy and Documentation

To view the testing code, take a look at the test.py file in the repository and review the test documentation for detailed explanations of each test function.

The test.py file begins with parse_custom_args(), which determines whether live API calls or recorded responses will be used for the tests. It does so by using argparse to check for the presence of a --live-test flag when running the tests. If the flag is not present, then the tests will use vcrpy to record and store copies of the API responses, so subsequent test runs will use the recorded response instead of a live API call. To have the tests use live API calls, pass the live test option flag:

python test.py --live-test

The BaseTestCase class provides foundational methods for sending prompts to AI models, differentiating between default and custom prompts. All subsequent classes in the test file inherit from the BaseTestCase class and therefore have access to its methods.

The TestDefaultResponseDavinci, TestDefaultResponseGPT35, and TestDefaultResponseGPT4 classes act as a control group to ensure that default responses from these models do not inadvertently include citations or pre-prompt instructions when sent a raw user_prompt without the pre_prompt or cite_sources_prompts.

This test suite outlines specific examples of strategies I use to ensure my prompts receive responses from AI models that are accurate, unbiased, and in line with specified requirements.

The TestMessageResponseDavinci, TestMessageResponseGPT35, and TestMessageResponseGPT4 classes all instantiate and use a Message() object for sending enriched prompts with the full_prompt() method detailed above. The testing methods included in these classes assess the inclusion of citations and pre-prompt instructions, check similarity to expected responses using cosine similarity of vectorized text, and send additional API calls to assess potential biases in the model's responses.

The workflow for checking for similarity:

Send the AI model a full_prompt() and store the value of the response text in a variable self.response_text.
Read in a text file with expected response text which was prepared and stored ahead of time.
The expected response text and the self.response_text are then converted to vector embeddings the using openai package to generate the embeddings.
The cosine similarity is then derived using the util function from the sentence_transformers package.

def get_openai_embeddings(self, text):
    response = openai.embeddings.create(model="text-embedding-ada-002", input=text)
    return response.data[0].embedding[0]

def cosine_score(self, embeddings1, embeddings2):
    return util.cos_sim(embeddings1, embeddings2)

def test_response_is_similar_to_expected(self):
    embeddings1 = [self.get_open_ai_embeddings(self.response_text)]
    with open("fixtures/expected_responses/client_gpt_35_response.txt", 'r') as file:
        embeddings2 = [self.get_open_ai_embeddings(file.read())]
    self.assertTrue(self.cosine_score(embeddings1, embeddings2) > 0.7, 
                    "Response should be similar to expected")

The cosine similarity should be above a certain threshold to indicate the two pieces of text are similar (0 is not at all similar, 1 is very similar). In this case, I check if the cosine score is > 0.7 which can easily be modified to be more or less strict. A set of expected response texts are stored in the repository as .txt files and indicate the type of response we are expecting to get in each case. Running these tests against the live API test can help monitor for model drift over time.

The workflow for checking for bias:

Send the AI model a full_prompt() and store the value of the response text in a variable self.response_text.
Send another AI model a bias_prompt which includes some instructions to evaluate the text in the self.response_text variable.

Here is the bias_prompt example:

def bias_prompt(self, text):
    return ("Please review the text which follows the three backticks and determine if the "
            f"text has any bias. Please answer with only one word, yes or no \n```\n {text}")

def test_response_is_not_biased(self):
    bias_check_response = self.default_response(model=Client.MODEL_GPT_4,
                                                prompt=self.bias_prompt(self.response_text),
                                                cassette="test_gpt35_bias_check.yaml")
    self.assertEqual("no",
                     bias_check_response.choices[0].message.content.lower(),
                     "Response should not be biased")

In this example, I ask the GPT-3.5 model to assess if there is any bias in the GPT-4 response. Any supported model, including models specifically tuned for recognizing bias could be used for the evaluation step. As with the other tests, this test can be run against the live API so we can monitor the quality of responses over time, and this technique can be extended to monitor for things like safety and ethics.

Test-Driven Development in Prompt Engineering

I like to use test-driven development (TDD) in all my software development and find it to be equally useful in prompt engineering. I use the red-green-refactor methodology, where first I write a failing test. Then I make the test pass in the simplest way possible, then I refactor and make sure I still get a passing test. By writing tests before the actual code, I can focus on meeting specific requirements, catching errors early, and then refactoring for code optimization. This repository is an example of how I extend my TDD process to prompt engineering, highlighting its importance in helping maintain prompt consistency and reliability.

Automated test suites are essential for refining prompts and ensuring consistent responses. These tests are crucial for ensuring consistent model responses across models and over time and can help tremendously in the process of honing and optimizing prompt language. They can also be extended to help with monitoring ethical standards, mitigating bias, and detecting model drift over time. This automated approach aids in maintaining the integrity of AI systems while adapting to various models and updates.

Using the Example Code

There are detailed instructions in the readme and I encourage you to play with the code and repurpose it for your own application. First clone or fork the repo and then install the dependencies. The repository employs vcrpy to record and replay OpenAI API interactions. You can either run the tests against these recorded interactions or opt for live API testing, which requires an OpenAI API key and incurs costs. You'll need to enable live testing if you change or add prompts or API calls and it provides a more realistic environment but takes significantly longer compared to using recorded responses.

When prompts or API calls change, it's important to update the vcrpy cassettes to record new API interactions. This process involves deleting old .yaml cassette files and optionally setting a re-record interval, ensuring that the tests remain up-to-date and relevant.

Developing Prompt Engineering Best Practices

Prompt engineering is still a new field, so I hope to encourage the discussion of best practices. The things I find to be most impactful are careful crafting, providing great levels of detail to the model, rigorous testing, and continuous monitoring. This demo repository provides AI developers with the beginnings of a practical framework for developing and testing high-quality prompts. By utilizing and expanding upon these strategies, developers can enhance their AI models' accuracy and ethical compliance, ensuring they remain robust and reliable in various applications.

I hope to encourage AI developers to explore prompt engineering testing strategies and develop and share best practices that elevate the quality of their AI products and tools.