Evaluate anything you want | Creating advanced evaluators with LLMs

Discover how to build custom LLM evaluators for specific real-world needs.

7 min read

3 hours ago

Image generated by DALLE-3 | Robot Inspections in the isometric style

Considering the rapid advancements in the field of LLM “chains”, “agents”, chatbots and other use cases of text-generative AI, evaluating the performance of language models is crucial for understanding their capabilities and limitations. Especially crucial to be able to adapt those metrics according to the business goals.

While standard metrics like perplexity, BLEU scores and Sentence distance provide a general indication of model performance, based on my experience, they often underperform in capturing the nuances and specific requirements of real-world applications.

For example, take a simple RAG QA application. When building a question-answering system, factors of the so-called RAG Triad like context relevance, groundedness in facts, and language consistency between the query and response are important as well. Standard metrics simply cannot capture these nuanced aspects effectively.

This is where LLM-based “Blackbox” metrics come in handy. While the idea can sound naive the concept behind LLM-based “blackbox” metrics is quite compelling. These metrics utilise the power of large language models themselves to evaluate the quality and other aspects of the generated text. By using a pre-trained language model as a “judge”, we can assess the generated text according to the language model’s understanding of the language and pre-defined criteria.

In this article, I will show the end-to-end example of constructing the prompt, running and tracking the evaluation.

Since LangChain is kinda of de-facto the most popular framework to build chatbots and RAG, i will build the application example on it. It will be easier to integrate into MVP and it has simple evaluation capabilities inside. However, you can use any other frameworks you want ot build you own.
Main value of article — pipeline and prompts.

How to-do?

Let’s dive into the code and explore the process of creating custom evaluators. We’ll walk through a few key examples and discuss their implementations.

Example #1 |…