How to Create a RAG Evaluation Dataset From Documents

Automatically create domain-specific datasets in any language using LLMs

Our automatically generated RAG evaluation dataset on the Hugging Face Hub (PDF input file from the European Union licensed under CC BY 4.0). Image by the author

In this article I will show you how to create your own RAG dataset consisting of contexts, questions, and answers from documents in any language.

Retrieval-Augmented Generation (RAG) [1] is a technique that allows LLMs to access an external knowledge base.

By uploading PDF files and storing them in a vector database, we can retrieve this knowledge via a vector similarity search and then insert the retrieved text into the LLM prompt as additional context.

This provides the LLM with new knowledge and reduces the possibility of the LLM making up facts (hallucinations).

The basic RAG pipeline. Image by the author from the article “How to Build a Local Open-Source LLM Chatbot With RAG”

However, there are many parameters we need to set in a RAG pipeline, and researchers are always suggesting new improvements. How do we know which parameters to choose and which methods will really improve performance for our particular use case?

This is why we need a validation/dev/test dataset to evaluate our RAG pipeline. The dataset should be from the domain we are interested…