Techniques for Chat Data Analytics with Python

Part II: Topic Extraction with BERTopic

Photo by Mikechie Esparagoza
and obtained from Pexels.com

In the first part of this series, I introduced you to my artificially created friend John, who was nice enough to provide us with his chats with five of the closest people in his life. We used just the metadata, such as who sent messages at what time, to visualize when John met his girlfriend, when he had fights with one of his best friends and which family members he should write to more often. If you didn’t read the first part of the series, you can find it here.

What we didn’t cover yet but we will dive deeper into now is an analysis of actual messages. Therefore, we will use the chat between John and Maria to identify the topics they discuss. And of course, we will not go through the messages one by one and classify them — no, we will use the Python library BERTopic to extract the topics that the chats revolve around.

What is BERTopic?

BERTopic is a topic modeling technique introduced by Maarten Grootendorst that uses transformer-based embeddings, specifically BERT embeddings, to generate coherent and interpretable topics from large collections of documents. It was designed to overcome the limitations of traditional topic modeling approaches like LDA (Latent Dirichlet Allocation), which often struggle to handle short…