"Prompt Engineering" Needs Prompt Evaluation

Introduction

Prompts are an input we provide to a large language model in order to get some sort of output that helps with our problem. For example, if we want to classify text as "Positive" or "Negative", we might provide a prompt like "Classify this \"{{text}}\" as "Positive" or "Negative". However, when you have a prompt, you should ideally have some sense of how well that prompt is working. To do this, you need some process for evaluating your prompt.

Prompt evaluation

The exact means by which you evaluate your prompt are dependent on your problem. If your problem is sentiment classification as I described earlier, you'll probably want a validation set that you'll tune your prompt over and then you'll want a test set that you'll run your prompt against once you're satisfied with it's results on the validation set.

Alternatively, if you have a text generation problem, you might want to use a large language model to evaluate the output generated with your prompt. For example, let's say you want to generate a summary of customer reviews for a product. The evaluation prompts we might use could be binary classification prompts such as the following:

Does the following summary of customer reviews contain any offensive language:
{{Review}}

Please only output "Yes" or "No"

The issue here of course is that we're using an LLM to evaluate an LLM which isn't ideal but unfortunately there really isn't a better way to do this at scale (the other option is human reviews). Another thing to note here is that you want to use a "strong" LLM for this task. Specifically, this is a large language model that is very large and very general. An example of a strong LLM is GPT-4.

As an aside, if your problem is code generation, then you can have a suite of test cases for your problem and run that suite against your output. This will allow you to determine how effective your prompt's code output is.

Additional ideas

Other things that are important for understanding the performance of your prompt is that to continue adding to your test suite, it's worth it to log inputs to your prompt, and log the outputs with some sort of way to determine whether that output conforms to your needs. This could be a way of gathering feedback. For example, in Midjourney when you generate an image, there are 5 buttons. 4 of those buttons allow you to choose an image generated by Midjourney, 1 of those tells Midjourney that you reject the output. The first 4 buttons is positive feedback for the model, the last button is negative feedback. Collecting user feedback in a manner analogous to this is helpful for adding new test cases to your test suite for your prompt.

As unscientific as it may be, it can be valuable to also interact with your deployment of your large language model and perform exploratory testing in order to determine whether or not you feel comfortable with your system's performance. In a similar fashion, it can be valuable to monitor user engagement with your LLM deployment in order to determine whether or not user engagement fits what you expect.

Conclusion

Generally, speaking any deployment of generative AI should have some means of evaluating that your deployment performs in a manner that is up to your standards. This could entail using a strong LLM to evaluate the output of your system, it could be using traditional metrics for evaluating classification (ex. precision), or it could be something else entirely but some means of evaluation is necessary. On top of that, collecting user feedback and performing exploratory testing are necessary.