Choose the Right One: Evaluating Topic Models for Business Intelligence

Python tutorial for evaluating top-notch bigram topic models in customer email classification The post Choose the Right One: Evaluating Topic Models for Business Intelligence appeared first on Towards Data Science.

Apr 24, 2025 - 21:21

Choose the Right One: Evaluating Topic Models for Business Intelligence

Topic models are used in businesses to classify brand-related text datasets (such as product and site reviews, surveys, and social media comments) and to track how customer satisfaction metrics change over time.

There is a myriad of recent topic models one can choose from: the widely used BERTopic by Maarten Grootendorst (2022), the recent FASTopic presented at last year’s NeurIPS, (Xiaobao Wu et al.,2024), the Dynamic Topic Model by Blei and Lafferty (2006), or a fresh semi-supervised Seeded Poisson Factorization model (Prostmaier et al., 2025).

For a business use case, training topic models on customer texts, we often get results that are not identical and sometimes even conflicting. In business, imperfections cost money, so the engineers should place into production the model that provides the best solution and solves the problem most effectively. At the same pace that new topic models appear on the market, methods for evaluating their quality using new metrics also evolve.

This practical tutorial will focus on bigram topic models, which provide more relevant information and identify better key qualities and problems for business decisions than single-word models (“delivery” vs. “poor delivery”, “stomach” vs. “sensitive stomach”, etc.). On one side, bigram models are more detailed; on the other, many evaluation metrics were not originally designed for their evaluation. To provide more background in this area, we will explore in detail:

How to evaluate the quality of bigram topic models
How to prepare an email classification pipeline in Python.

Our example use case will show how bigram topic models (BERTopic and FASTopic) help prioritize email communication with customers on certain topics and reduce response times.

1. What are topic model quality indicators?

The evaluation task should target the ideal state:

The ideal topic model should produce topics where words or bigrams (two consecutive words) in each topic are highly semantically related and distinct for each topic.

In practice, this means that the words predicted for each topic are semantically similar to human judgment, and there is low duplication of words between topics.

It is standard to calculate a set of metrics for each trained model to make a qualified decision on which model to place into production or use for a business decision, comparing the model performance metrics.

Coherence metrics evaluate how well the words discovered by a topic model make sense to humans (have similar semantics in each topic).
Topic diversity measures how different the discovered topics are from one another.

Bigram topic models work well with these metrics:

NPMI (Normalized Point-wise Mutual Information) uses probabilities estimated in a reference corpus to calculate a [-1:1] score for each word (or bigram) predicted by the model. Read [1] for more details.

The reference corpus can be either internal (the training set) or external (e.g., an external email dataset). A large, external, and comparable corpus is a better choice because it can help reduce bias in training sets. Because this metric works with word frequencies, the training set and the reference corpus should be preprocessed the same way (i.e., if we remove numbers and stopwords in the training set, we should also do it in the reference corpus). The aggregate model score is the average of words across topics.

SC (Semantic Coherence) does not need a reference corpus. It uses the same dataset as was used to train the topic model. Read more in [2].

Let’s say we have the Top 4 words for one topic: “apple”, “banana”, “juice”, “smoothie” predicted by a topic model. Then SC looks at all combinations of words in the training set going from left to right, starting with the first word {apple, banana}, {apple, juice}, {apple, smoothie} then the second word {banana, juice}, {banana, smoothie}, then last word {juice, smoothie} and it counts the number of documents that contain both words, divided by the frequency of documents that contain the first word. Overall SC score for a model is the mean of all topic-level scores.

Image 1. Semantic coherence by Mimno et al. (2011) illustration. Image by author.

PUV (Percentage of Unique Words) calculates the share of unique words across topics in the model. PUV = 1 means that each topic in the model contains unique bigrams. Values close to 1 indicate a well-shaped, high-quality model with small word overlap between topics. [3].

The closer to 0 the SC and NIMP scores are, the more coherent the model is (bigrams predicted by the topic model for each topic are semantically similar). The closer to 1 PUV is, the easier the model is to interpret and use, because bigrams between topics do not overlap.

2. How can we prioritize email communication with topic models?

A large share of customer communication, not only in e-commerce businesses, is now solved with chatbots and personal client sections. Yet, it is common to communicate with customers by email. Many email providers offer developers broad flexibility in APIs to customize their email platform (e.g., MailChimp, SendGrid, Brevo). In this place, topic models make mailing more flexible and effective.

In this use case, the pipeline takes the input from the incoming emails and uses the trained topic classifier to categorize the incoming email content. The outcome is the classified topic that the Customer Care (CC) Department sees next to each email. The main objective is to allow the CC staff to prioritize the categories of emails and reduce the response time to the most sensitive requests (that directly affect margin-related KPIs or OKRs).

Image 2. Topic model pipeline illustration. Image by author.

3. Data and model set-ups

We will train FASTopic and Bertopic to classify emails into 8 and 10 topics and evaluate the quality of all model specifications. Read my previous TDS tutorial on topic modeling with these cutting-edge topic models.

As a training set, we use a synthetically generated Customer Care Email dataset available on Kaggle with a GPL-3 license. The prefiltered data covers 692 incoming emails and looks like this:

Image 3. Customer Care Email dataset. Image by author.

3.1. Data preprocessing

Cleaning text in the right order is essential for topic models to work in practice because it minimizes the bias of each cleaning operation.

Numbers are typically removed first, followed by emojis, unless we don’t need them for special situations, such as extracting sentiment. Stopwords for one or more languages are removed afterward, followed by punctuation so that stopwords don’t break up into two tokens (“we’ve” -> “we” + ‘ve”). Additional tokens (company and people’s names, etc.) are removed in the next step in the clean data before lemmatization, which unifies tokens with the same semantics.

Image 4. General preprocessing steps for topic modeling. Image by author

“Delivery” and “deliveries”, “box” and “Boxes”, or “Price” and “prices” share the same word root, but without lemmatization, topic models would model them as separate factors. That’s why customer emails should be lemmatized in the last step of preprocessing.

Text preprocessing is model-specific:

FASTopic works with clean data on input; some cleaning (stopwords) can be done during the training. The simplest and most effective way is to use the Washer, a no-code app for text data cleaning that offers a no-code way of data preprocessing for text mining projects.
BERTopic: the documentation recommends that “removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings”. For this reason, cleaning operations should be included in the model training.

3.2. Model compilation and training

You can check the full codes for FASTopic and BERTopic’s training with bigram preprocessing and cleaning in this repo. My previous TDS tutorials (4 ) and (5 ) explain all steps in detail.

We train both models to classify 8 topics in customer email data. A simple inspection of the topic distribution shows that incoming emails to FASTopic are quite well distributed across topics. BERTopic classifies emails unevenly, keeping outliers (uncategorized tokens) in T-1 and a large share of incoming emails in T0.

Image 5: Topic distribution, email classification. Image by author.

Here are the predicted bigrams for both models with topic labels:

Image 6: Models’ predictions. Image by author.

Because the email corpus is a synthetic LLM-generated dataset, the naive labelling of the topics for both models shows topics that are:

Comparable: Time Delays, Latency Issues, User Permissions, Deployment Issues, Compilation Errors,
Differing: Unclassified (BERTopic classifies outliers into T-1), Improvement Suggestions, Authorization Errors, Performance Complaints (FASTopic), Cloud Management, Asynchronous Requests, General Requests (BERTopic)

For business purposes, topics should be labelled by the company’s insiders who know the customer base and the business priorities.

4. Model evaluation

If three out of eight classified topics are labeled differently, then which model should be deployed? Let’s now evaluate the coherence and diversity for the trained BERTopic and FASTopic T-8 models.

4.1. NPMI

We need a reference corpus to calculate an NPMI for each model. The Customer IT Support Ticket Dataset from Kaggle, distributed with Attribution 4.0 International license, provides comparable data to our training set. The data is filtered to 11923 English email bodies.

Calculate an NPMI for each bigram in the reference corpus with this code.
Merge bigrams predicted by FASTopic and BERTopic with their NPMI scores from the reference corpus. The fewer NaNs are in the table, the more accurate the metric is.

Image 7: NPMI coherence evaluation.Image by author.

3. Average NPMIs within and across topics to get a single score for each model.

4.2. SC

With SC, we learn the context and semantic similarity of bigrams predicted by a topic model by calculating their position in the corpus in relation to other tokens. To do so, we:

Create a document-term matrix (DTM) with a count of how many times each bigram appears in each document.
Calculate topic SC scores by searching for bigram co-occurrences in the DTM and the bigrams predicted by topic models.
Average topic SC to a model SC score.

4.3. PUV

Topic diversity PUV metric checks the duplicates of bigrams between topics in a model.

Join bigrams into tokens by replacing spaces with underscores in the FASTopic and BERTopic tables of predicted bigrams.

Image 8: Topic diversity illustration. Image by author.

2. Calculate topic diversity as count of distinct tokens/ count of tokens in the tables for both models.

4.4. Model comparison

Let’s now summarize the coherence and diversity evaluation in Image 9. BERTopic models are more coherent but less diverse than FASTopic. The differences are not very large, but BERTopic suffers from uneven distribution of incoming emails into the pipeline (see charts in Image 5). Around 32% of classified emails fall into T0, and 15% into T-1, which covers the unclassified outliers. The models are trained with a min. of 20 tokens per topic. Increasing this parameter causes the model to be unable to train, probably because of the small data size.

For this reason, FASTopic is a better choice for topic modelling in email classification with small training datasets.

Image 9: Topic model evaluation metrics. Image by author.

The last step is to deploy the model with topic labels in the email platform to classify incoming emails:

Image 10. Topic model classification pipeline, output. Image by author.

Summary

Coherence and diversity metrics compare models with similar training setups, the same dataset, and cleaning strategy. We cannot compare their absolute values with the results of different training sessions. But they help us decide on the best model for our specific use case. They offer a relative comparison of various model specifications and help decide which model should be deployed in the pipeline. Topic models evaluation should always be the last step before model deployment in business practice.

How does customer care benefit from the topic modelling exercise? After the topic model is put into production, the pipeline sends a classified topic for each email to the email platform that Customer Care uses for communicating with customers. With a limited staff, it is now possible to prioritize and respond faster to the most sensitive business requests (such as “time delays” and “latency issues”), and change priorities dynamically.

Data and complete codes for this tutorial are here.

Petr Korab is a Python Engineer and Founder of Text Mining Stories with over eight years of experience in Business Intelligence and NLP.

Acknowledgments: I thank Tomáš Horský (Lentiamo, Prague), Martin Feldkircher, and Viktoriya Teliha (Vienna School of International Studies) for useful comments and suggestions.

References

[1] Blei, D. M., Lafferty, J. D. 2006. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning (pp. 113–120).

[2] Dieng A.B., Ruiz F. J. R., and Blei D. M. 2020. Topic Modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8:439-453.

[3] Grootendorst, M. 2022. Bertopic: Neural Topic Modeling With A Class-Based TF-IDF Procedure. Computer Science.

[4] Korab, P. Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code. Towards Data Science. 22.1.2025. Accessible from: link.

[5] Korab, P. Topic Modelling with BERTtopic in Python. Towards Data Science. 4.1.2024. Accessible from: link.

[6] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. 2024. FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. arXiv preprint: 2405.17978.

[7] Mimno, D., Wallach, H., M., Talley, E., Leenders, M, McCallum. A. 2011. Optimizing Semantic Coherence in Topic Models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.

[8] Prostmaier, B., Vávra, J., Grün, B., Hofmarcher., P. 2025. Seeded Poisson Factorization: Leveraging domain knowledge to fit topic models. arXiv preprint: 2405.17978.

The post Choose the Right One: Evaluating Topic Models for Business Intelligence appeared first on Towards Data Science.