Evaluate anything you want | Creating advanced evaluators with LLMs

Discover how to build custom LLM evaluators for specific real-world needs.

Nikita Kiselov

Published in

Towards Data Science

7 min read

3 hours ago

—

Image generated by DALLE-3 | Robot Inspections in the isometric style

Considering the rapid advancements in the field of LLM “chains”, “agents”, chatbots and other use cases of text-generative AI, evaluating the performance of language models is crucial for understanding their capabilities and limitations. Especially crucial to be able to adapt those metrics according to the business goals.

While standard metrics like perplexity, BLEU scores and Sentence distance provide a general indication of model performance, based on my experience, they often underperform in capturing the nuances and specific requirements of real-world applications.

For example, take a simple RAG QA application. When building a question-answering system, factors of the so-called “RAG Triad” like context relevance, groundedness in facts, and language consistency between the query and response are important as well. Standard metrics simply cannot capture these nuanced aspects effectively.

This is where LLM-based “Blackbox” metrics come in handy. While the idea can sound naive the concept behind LLM-based “blackbox” metrics is quite compelling. These metrics utilise the power of large language models themselves to evaluate the quality and other aspects of the generated text. By using a pre-trained language model as a “judge”, we can assess the generated text according to the language model’s understanding of the language and pre-defined criteria.

In this article, I will show the end-to-end example of constructing the prompt, running and tracking the evaluation.

Since LangChain is kinda of de-facto the most popular framework to build chatbots and RAG, i will build the application example on it. It will be easier to integrate into MVP and it has simple evaluation capabilities inside. However, you can use any other frameworks you want ot build you own.
Main value of article — pipeline and prompts.

How to-do?

Let’s dive into the code and explore the process of creating custom evaluators. We’ll walk through a few key examples and discuss their implementations.

Example #1 |…

MIT spinoff Liquid debuts non-transformer AI models and they’re already state-of-the-art

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Liquid AI, a startup co-founded by

September 30, 2024

Microsoft’s new AI agents set to shake up enterprise software, sparking new battle with Salesforce

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft just announced a suite of

October 21, 2024

Vera AI launches ‘AI Gateway’ to help companies safely scale AI without the risks

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Vera AI Inc., a startup focused

October 2, 2024

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.