Why Does Position-Based Chunking Lead to Poor Performance in RAGs?

How to implement semantic chunking and gain better results.

Photo by vackground.com on Unsplash

Neighbors could still be different.

Language models come with a context limit. For newer OpenAI models, this is around 128k tokens, roughly 80k English words. This may sound big enough for most use cases. Still, large production-grade applications often need to refer to more than 80k words, not to mention images, tables, and other unstructured information.

Even if we pack everything within the context window with more irrelevant information, LLM performance drops significantly.

This is where RAG helps. RAG retrieves the relevant information from an embedded source and passes it as context to the LLM. To retrieve the ‘relevant information,’ we should have divided the documents into chunks. Thus, chunking plays a vital role in a RAG pipeline.

Chunking helps the RAG retrieve specific pieces of a large document. However, small changes in the chunking strategy can significantly impact the responses LLM makes.