One of the key ingredients needed to effectively implement an LLM pipeline is a way to evaluate the efficacy of your pipeline. That is you need to evaluate the final output that is the product of not just the LLM itself or the prompt but the interaction between the LLM, the prompt and settings such as temperature or minimum and maximum tokens.
Consider the boilerplate code to access the GPT API (autogenerated :
openai.api_key = os.getenv("OPENAI_API_KEY")
response = openai.ChatCompletion.create(
There are seven arguments in the function to create the ‘response’, each of which alters the final output. Being able to choose the optimal combination of these outputs depends on being able to evaluate and differentiate outputs produced by different values of these arguments
This is a different problem to the LLM evaluations which are most commonly found in papers or on LLM makers’ websites . While it may be that you’re using an LLM that can pass the bar exam or similar test advertised in these sources, that doesn’t mean that your pipeline with the prompt you created and the settings you chose will necessarily summarise a collection of legal documents the way you need.
This is especially the case when you are building a pipeline for an external user, and therefore can’t adjust the prompt on the fly. For example, suppose you want to use an LLM API to embed an LLM solution, and use a basic prompt skeleton to generate descriptions of particular items, such as in a catalogue. There are two levels to consider for suitablility:
Firstly, are the answers you generate fit for purpose?
Secondly, can you rely on the answers continuing to be fit for purpose with future iterations?
In a sense the first can be assessed by looking at one or several answers in isolation. If you judge them to be suitable, you’re across the line. However, to assess the long term reliability of the…