Duplicate Detection with GenAI

How using LLMs and GenAI techniques can improve de-duplication

Ian Ormesher

Published in

Towards Data Science

5 min read

13 hours ago

—

2D UMAP Musicbrainz 200K nearest neighbour plot

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. But it is possible to use the latest advancements in Large Language Models and Generative AI to vastly improve the identification and repair of duplicated records. On common benchmark datasets I found an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using my proposed method.

I want to explain the technique here in the hope that others will find it helpful and use it for their own de-duplication needs. It’s useful for other scenarios where you wish to identify duplicate records, not just for Customer data. I also wrote and published a research paper about this which you can view on Arxiv, if you want to know more in depth:

Duplicate Detection with GenAI

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually…

arxiv.org

Traditional Approach

The task of identifying duplicate records is often done by pairwise record comparisons and is referred to as “Entity Matching” (EM). Typical steps of this process would be:

Data Preparation
Candidate Generation
Blocking
Matching
Clustering

Data Preparation

Data preparation is the cleaning of the data and involves such things as removing non-ASCII characters, capitalisation and tokenising the text. This is an important and necessary step for the NLP matching algorithms later in the process which don’t work well with different cases or non-ASCII characters.

Candidate Generation

In the usual EM method, we would produce candidate records by combining all the records in the table with themselves to produce a cartesian product. You would remove all combinations which are of a row with itself. For a lot of the NLP matching algorithms comparing row A with row B is equivalent to comparing row B with row A. For those cases you can get away with keeping just one of those pairs. But even after this, you’re still left with a lot of candidate records. In order to reduce this number a technique called “blocking” is often used.

Blocking

The idea of blocking is to eliminate those records that we know could not be duplicates of each other because they have different values for the “blocked” column. As an example, If we were considering customer records, a potential column to block on could be something like “City”. This is because we know that even if all the other details of the record are similar enough, they cannot be the same customer if they’re located in different cities. Once we have generated our candidate records, we then use blocking to eliminate those records that have different values for the blocked column.

Matching

Following on from blocking we now examine all the candidate records and calculate traditional NLP similarity-based attribute value metrics with the fields from the two rows. Using these metrics, we can determine if we have a potential match or un-match.

Clustering

Now that we have a list of candidate records that match, we can then group them into clusters.

Proposed Method

There are several steps to the proposed method, but the most important thing to note is that we no longer need to perform the “Data Preparation” or “Candidate Generation” step of the traditional methods. The new steps become:

Create Match Sentences
Create Embedding Vectors of those Match Sentences
Clustering

Create Match Sentences

First a “Match Sentence” is created by concatenating the attributes we are interested in and separating them with spaces. As an example, let’s say we have a customer record which looks like this:

We would create a “Match Sentence” by concatenating with spaces the name1, name2, name3, address and city attributes which would give us the following:

“John Hartley Smith 20 Main Street London”

Create Embedding Vectors

Once our “Match Sentence” has been created it is then encoded into vector space using our chosen embedding model. This is achieved by using “Sentence Transformers”. The output of this encoding will be a floating-point vector of pre-defined dimensions. These dimensions relate to the embedding model that is used. I used the all-mpnet-base-v2 embedding model which has a vector space of 768 dimensions. This embedding vector is then appended to the record. This is done for all the records.

Clustering

Once embedding vectors have been calculated for all the records, the next step is to create clusters of similar records. To do this I use the DBSCAN technique. DBSCAN works by first selecting a random record and finding records that are close to it using a distance metric. There are 2 different kinds of distance metrics that I’ve found to work:

L2 Norm distance
Cosine Similarity

For each of those metrics you choose an epsilon value as a threshold value. All records that are within the epsilon distance and have the same value for the “blocked” column are then added to this cluster. Once that cluster is complete another random record is selected from the unvisited records and a cluster then created around it. This then continues until all the records have been visited.

Experiments and Results

I used this approach to identify duplicate records with customer data in my work. It produced some very nice matches. In order to be more objective, I also ran some experiments using a benchmark dataset called “Musicbrainz 200K”. It produced some quantifiable results that were an improvement over standard NLP techniques.

Visualising Clustering

I produced a nearest neighbour cluster map for the Musicbrainz 200K dataset which I then rendered in 2D using the UMAP reduction algorithm:

2D UMAP Musicbrainz 200K nearest neighbour plot

Resources

I have created various notebooks that will help with trying the method out for yourselves:

GitHub – ianormy/genai_duplicate_detection_paper: Resources and notebooks to accompany the…

Resources and notebooks to accompany the Duplicate Detection using GenAI paper …

github.com

Duplicate Detection with GenAI paper: [2406.15483] Duplicate Detection with GenAI
GitHub resources: https://github.com/ianormy/genai_duplicate_detection_paper
all-mpnet-base-v2 embedding model: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/
Sentence Transformers: https://huggingface.co/sentence-transformers/
UMAP python package: https://pypi.org/project/umap-learn/
Benchmark datasets for entity resolution: https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution/
Musicbrainz 200K dataset: https://dbs.uni-leipzig.de/files/datasets/saeedi/musicbrainz-200-A01.csv.dapo

Duplicate Detection with GenAI

How using LLMs and GenAI techniques can improve de-duplication

Duplicate Detection with GenAI

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually…

Traditional Approach

Data Preparation

Candidate Generation

Blocking

Matching

Clustering

Proposed Method

Create Match Sentences

Create Embedding Vectors

Clustering

Experiments and Results

Visualising Clustering

Resources

GitHub – ianormy/genai_duplicate_detection_paper: Resources and notebooks to accompany the…

Resources and notebooks to accompany the Duplicate Detection using GenAI paper …

Apple’s next-gen iPhone 17 Air rumors: A17 chip, 8GB RAM, single 48MP rear camera, and more

What To Do As A Startup Without A Market | HackerNoon

Western Digital Introduces 4 TB microSDUC, 8 TB SDUC, and 16 TB External SSDs

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Duplicate Detection with GenAI

How using LLMs and GenAI techniques can improve de-duplication

Duplicate Detection with GenAI

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually…

Traditional Approach

Data Preparation

Candidate Generation

Blocking

Matching

Clustering

Proposed Method

Create Match Sentences

Create Embedding Vectors

Clustering

Experiments and Results

Visualising Clustering

Resources

GitHub – ianormy/genai_duplicate_detection_paper: Resources and notebooks to accompany the…

Resources and notebooks to accompany the Duplicate Detection using GenAI paper …

Apple’s next-gen iPhone 17 Air rumors: A17 chip, 8GB RAM, single 48MP rear camera, and more

What To Do As A Startup Without A Market | HackerNoon

Western Digital Introduces 4 TB microSDUC, 8 TB SDUC, and 16 TB External SSDs

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Subscribe