Unsupervised LLM Evaluations

Practitioners guide to judging outputs of large language models

12 min read

23 hours ago

<TLDR>

Evaluating AI-generated outputs is critical for building robust applications of large language models because it allows complex AI applications to be split into simple stages with built-in error control.

It is relatively straightforward to evaluate generative outputs in a supervised mode, where the “right answers” can be computed or hinted by human evaluators.

At the same time, in many practical LLM applications the supervised approach is too restrictive, and there is a need for evaluations capable of tackling open-ended questions. The simplest way to build an unsupervised evaluator is to ask an LLM to evaluate itself. However, the ability of generative models to detect errors in their own output is not well understood.

We demonstrate that the quality of self-evaluations can be improved with iterative self-reflection. Similar to the “Chain of Thought” technique, this method trades compute at inference for the robustness of the final result.

</TLDR>

Link to Google Colab notebook with examples:

https://colab.research.google.com/drive/1q_dChQBMbnUXZ377JVwYsjvn7lZ_7qlZ?usp=sharing

Image source: Flux 1. Pro model prompted for “robot evaluating other robots”

Introduction

When building processing pipelines using large language models, the often-mentioned issue is the quality of generated outputs. If a good evaluation process is in place, it can highlight cases of poor performance and trigger LLM fine-tuning, prompt adjustments, escalation to human agents — or all these actions at once.

Here is a typical workflow that uses evaluations for training: an LLM goes over the input dataset, and any output discrepancies detected by the evaluator are used to generate synthetic data to fine-tune the model. The application is deployed only when the target quality metrics are met.

Image by the author: Evaluation loop for LLM fine-tuning

Using LLM evaluators in production is very similar — except that detected discrepancies are usually sent to a human agent to ensure the workflow can continue despite raising an error flag.

However, building a good LLM evaluator is not trivial. The complexity of this problem stems from two practical restrictions:

First, it is highly desirable to minimize human involvement in evaluations. For example, imagine a chatbot interacting with a user and missing a common colloquial pattern of ellipsis (using one word instead of the full output sentence):

Bot: Is that correct?

User: correct

Bot: Sorry, I didn’t get that. Please try again.

User: yes it is correct

Given this dialog section, a human should easily highlight deficiencies in the chatbot’s response and suggest a fine-tuning course. However, in order to find this problem, an evaluator would have to read the entire dialog (which can be very long). This approach does not work at scale–which means we should strive for evaluation without humans.

Second, the process of judging the LLM output without knowing the “ground truth” is comparable in complexity to the original task. This means a state-of-the-art LLM can (at most) employ an evaluator with similar capabilities (most likely itself), thus raising questions about the validity of such evaluation.

Supervised evaluations

If we look at the well-studied to evaluate LLMs today, we will notice they mostly center on supervised or semi-supervised use cases.

If the training dataset comes with “ground truth” answers, evaluation becomes trivial — and can even drive optimization frameworks like DSPy. The same is true when testing an enterprise LLM app against historical cases handled by human agents, where the “ground truth” equates to the judgments of those agents.

Another opportunity to check the output against the “ground truth” comes when the LLM output can be formally verified on its own — such as computer code that can be compiled and tested. Despite the fact that a computer program can be written in many different ways, the correct code should pass the tests regardless of the chosen implementation path.

Cases where the generative output cannot be formally verified usually require adding a human into the loop. For example, RLHF can be used to rate LLM outputs according to ordinal human preferences and thus steer the network toward complicated and nuanced policies.

Unsupervised self-evaluations

Meanwhile, there are many open-ended evaluation cases where “ground truth” approach cannot be implemented, and RLHF is too lengthy or too costly. This explains the interest in unsupervised self-evaluation techniques.

So, assuming we have an open-ended LLM evaluation question that would normally require human involvement — like “how can this chatbot improve” — what can be done to automate?

An economical evaluation harness can be built if we assume that contemporary large language models with rich semantic representations are inherently capable of self-evaluations. This means you can simply ask the model to evaluate its own output, or use another LLM for the same task to avoid cross-contamination in their training sets.

Unfortunately, a naïve implementation of LLM self-judging may fail. There are two possible reasons for this.

The first reason is the most obvious: an LLM is not guaranteed to produce coherent outputs on subjects not reflected in its training set or unreachable for retrieval tools (like web search). For example, consider this Google map snippet, which depicts an unincorporated town called Mineral in California and a small body of water (Martin Creek) that runs through it.

Mineral, CA vicinity (source: Google maps)

The town of Mineral does not have its own Wikipedia page, and most LLMs do not use GIS data in training, so when asked which creek passes through the town, an LLM can easily hallucinate:

GPT-4o hallucination example

Mill Creek is a real water body that skirts the (much larger) census-designated place of which Mineral is a part. However, Mill Creek does not pass through the town of Mineral proper, so GPT-4o’s answer is incorrect.

Intuitively, it feels like a failure in plain LLM inference due to the subject being outside of the knowledge set and unreachable via common tools (like web search) should not be easily self-detectable.

The second reason for failing self-evals is somewhat more subtle and is driven by probabilistic nature of LLM outputs. It is easy to demonstrate in cases where evaluation should include multiple observations. For example, consider this dialog between the user and chatbot:

Bot: I am a chatbot to help you find the best mobile phone plan.

User: I need at least 500Gb of data and priced below $50

Bot: The “Ultimate Freedom” plan delivers uncapped Internet access for $79.99

User: This might actually work.

Bot: Do you want to change the current plan?

User: Is “Ultimate Freedom” domestic only ? Can I have international roaming?

Bot: Sorry, I didn’t get that. Please try again.

User: I am done for now. Please do not make any changes.

Bot: I am a chatbot to help you find the best mobile phone plan.

If we repeatedly ask an LLM to list the improvements that can be made to the bot based on this dialog, we will notice the output list is not stable between runs. One version may look like this:

{["Ensure the bot can handle follow-up questions after suggesting a plan.", 
"Improve bot's recognition of farewell messages.",
"Make sure the bot can assist with additional features or package inquiries."]}

While another may look like this:

{["Ensure the bot can switch contexts and answer follow-up questions.", 
"Enable the bot to recognize when a conversation ends.",
"Improve the bot's ability to provide options for additional services",
"Support input about cost limitations, e.g. users stating a desired price."]}

Even though there is significant overlap between these two answers, it is clear that an exhaustive list of suggestions (similar to what a human expert would produce) was not generated in either case.

The unexpected power of self-reflection

Once we outline the typical failure modes for evaluations, it may seem like using an LLM to judge itself is a bad idea. After all, this sounds like asking a diligent student to re-check their own answers. Since a good student does not make many typos, re-checking merely reflects existing knowledge and should not result in improvements.

However, this is where our intuition about LLMs may go awfully wrong.

In fact, most LLMs are capable of corrective self-evaluation, even when the subject lies outside of their knowledge base.

To illustrate this phenomenon, let us return to GPT-4o example hallucinating about the body of water crossing the town of Mineral, CA. Interestingly enough, this particular hallucination can be dispelled during self-evaluation:

Self-evaluation in GPT-4o is capable of reversing hallucinations

So where is the magic?

In this example, the LLM does not have the knowledge or the tools to get the correct answer, so it hallucinates the “most plausible” completion. However, when asked to evaluate itself, it arrives at the conclusion that the facts it can access do not corroborate previous statement. Even though GPT-4o does not know the correct answer, it can dismiss the incorrect one.

A more sophisticated model (like GPT-4o1) may be slightly harder to treat in the same way because it tends to produce more nuanced responses:

Hallucination in GPT-4o1 is more nuanced.

Instead of hallucinating a completion on the subject it cannot verify, GPT-4o1 may choose to answer the question it was never asked — like “Which primary body of water runs near Mineral, CA?”. This evasion means that a direct self-evaluation prompt along the lines of “evaluate as True or False” may fail.

However, a more deliberative way of asking for self-evaluation can still be successful, even if it takes multiple iterations:

This ability of LLMs to self-reflect in an iterative way is, of course, well-known and is somewhat taken for granted in applications like code generation. Here we are just extending the same technique to self-evaluation.

The “expected” power of memoization

The same idea of iterative reflection is also applicable to LLM tasks that tend to produce incomplete outputs. If we revisit the bot dialog example and allow an LLM to iterate on a memoized list of improvements, we will observe the model is rarely “satisfied” with the result at first shot.

In other words, if we formulate a prompt like this:

iterative_prompt = """
Consider the following dialog between the user and the chatbot.
The bot's goal is to suggest a cheaper mobile plan based on the information the user provides.
The user's responses are not guaranteed to be consistent or coherent at all times.

This dialog was evaluated by an LLM and this evaluation is provided below.

You job is to assess the quality of evaluation and respond with "success"=True and repeat the original action list if there is nothing significant to add.
If there is something missing in evaluation, respond with "success"=False and a new list of action items to create better user experience integrating the old list with new suggestions. Make sure the list items are unique and not repetitive.

"""

Then it would typically take 2–4 passes over the list of improvements until the LLM converges on recommendations and declares the evaluation task to be successful:

🍩 
success='False' action_items=['Enable bot to understand user inquiries about add-on packages related to international calls.', "Improve bot's understanding to handle informal or casual goodbyes such as 'byebye'."]
🍩
success='False' action_items=['Enable bot to understand user inquiries about add-on packages related to international calls.', "Improve bot's understanding to handle informal or casual goodbyes such as 'byebye'.", "Enhance the bot's capability to suggest plans that are closer to the user's budget, such as recommending plans around $10 instead of $14 when the user specifies a $10 budget."]
🍩
success='False' action_items=['Enable bot to understand user inquiries about add-on packages related to international calls.', "Improve bot's understanding to handle informal or casual goodbyes such as 'byebye'.", "Enhance the bot's capability to suggest plans that are closer to the user's budget, such as recommending plans around $10 instead of $14 when the user specifies a $10 budget.", 'Ensure the bot confirms if the user is interested in plans without inclusive international minutes given their travel habits.', 'Add functionality for the bot to suggest alternative communication methods like VoIP for international calls if budget constraints are strict.', "Improve the bot's ability to suggest plans that balance cost with user requirements, such as considering travel habits and required features."]
🍩
success='True' action_items=['Enable bot to understand user inquiries about add-on packages related to international calls.', "Improve bot's understanding to handle informal or casual goodbyes such as 'byebye'.", "Enhance the bot's capability to suggest plans that are closer to the user's budget, such as recommending plans around $10 instead of $14 when the user specifies a $10 budget.", 'Ensure the bot confirms if the user is interested in plans without inclusive international minutes given their travel habits.', 'Add functionality for the bot to suggest alternative communication methods like VoIP for international calls if budget constraints are strict.', "Improve the bot's ability to suggest plans that balance cost with user requirements, such as considering travel habits and required features."]

After this initial “warm-up” over one dialog, we can feed the model with more sample dialogs and see what happens.

In a manner similar to what a human evaluator would do, the GPT-4o model considers that many dialog samples are not worth producing new recommendations (just one model run is enough)–yet some may trigger much longer deliberation:

Number of LLM invocations per first 50 dialogs from ExpBot dataset until convergence (chart by the author)

The final result will be a fairly exhaustive list of recommendations on improving the chatbot:

Final recommendations: 

["Improve the bot's ability to avoid repetitive greetings and restarts when the user's input is vague or repeated, creating a more fluid conversation flow.",
"Enhance the bot's active listening skills to acknowledge user needs and concerns before suggesting starting over, to better handle user dissatisfaction.",
"Include a function allowing users to ask follow-up questions for more details about the suggested plan, such as data overage charges and roaming fees.",
"Develop a mechanism for the bot to detect and correct minor typographical errors and currency symbol mismatches in user inputs.",
"Provide alternative suggestions that might not fit all criteria but offer significant savings or benefits in other areas based on the provided user data.",
"Implement a feedback system enabling users to rate the accuracy or helpfulness of the plan suggestion provided, allowing for iterative improvements.",
"Incorporate a bot training mechanism to ensure it can handle responses that are non-standard in format or include extraneous details not directly related to the plan.",
"Add the ability for the bot to suggest seeking human assistance when complex queries or dissatisfaction arise that the bot cannot resolve.",
"Enhance the bot's language processing capabilities to accurately interpret various phrasings and informal expressions from the user.",
"Increase the bot's capability for dynamic clarification requests, creating a smoother interaction flow.",
"Refine the bot's ability to verify user information effectively to reduce misunderstandings and user frustration.",
"Improve the bot's handling of unrealistic and inconsistent user inputs to guide the conversation back to relevant queries.",
"Integrate a process for flagging nonsensical data entries and guide the user toward providing accurate information.",
"Provide clearer explanations or breakdowns of the suggested plan's features, especially if different from the user's mentioned requirements.",
"Improve response to questions unrelated to starting new calculations to avoid redundant loops."]

Some technical notes on this example:

  1. For simplicity, we have combined the evaluation and generation into one prompt which relies on OpenAI’s structured outputs to produce the desired result.
  2. The inherent limitation of memoization is a requirement to process samples sequentially. This can take some time on a long dataset, and also blocks us from using cheap inference via call batching.

To further improve the performance, we can take advantage of the fact that most samples in a dataset do not generate new insights. This means we can produce the initial list of recommendations by iterating over a small subset of samples sequentially, and serve the rest of the dataset in parallel via DataChain library (or in a batch with OpenAI API) to flag the “interesting” cases and shave 30–50% off the time (or expense) budgets based on your preferences.

Conclusion

LLMs can and should be used for unsupervised evaluations (including self-evaluations). The fine-print is that it requires a well-thought approach–which often resolves to an iterative way to improve and refine the judgements.

Here is a link to the sample implementation in Google Colab:

https://colab.research.google.com/drive/1q_dChQBMbnUXZ377JVwYsjvn7lZ_7qlZ?usp=sharing