Contextualized Topic Modeling with Python (EACL2021)

Combining BERT and friends with Neural Variational Topic Models

Federico Bianchi

Published in

TDS Archive

8 min readFeb 11, 2021

In this blog post, I discuss our latest published paper on topic modeling:

Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. European Chapter of the Association for Computational Linguistics (EACL). https://arxiv.org/pdf/2004.07737/

Suppose we have a small set of documents in Portuguese that is not large enough to reliably run standard topic modeling algorithms. However, we have enough English documents in the same domain. With our cross-lingual zero-shot topic model (ZeroShotTM), we can first learn topics on English and then predict topics for Portuguese documents (as long as we use pre-trained representations that account for both English and Portuguese).

We also release a Python package that can be used to run topic modeling! Take a look at our github page and at our colab tutorial!

Warm-Up: What is Topic Modeling?

Topic models allow us to extract meaningful patterns from text, making it easier to glance over textual data and better understand the latent distributions of topics that live underneath. Say you want to get a bird’s eye view on a set of documents you have at hand; well it is not a great idea to read them one by one, right?

Topic models can help: they look at your document collection and they extract recurrent themes.

Some Details

Topic models usually make two main assumptions.

First of all, a document can talk about different topics in different proportions. For example, imagine that we have three topics, i.e. “human being”, “evolution” and “diseases”. A document can talk a little about humans, a little about evolution, and the remaining about animals. In probabilistic terms, we can say that it talks 20% about humans, 20% about evolution, and 60% about animals. This can be easily expressed by a multinomial distribution over the topics. This probability distribution is called document-topic distribution.

An example of how a document that describes **human evolution** could be categorized in different topics. Image by Author.

Secondly, a topic in topic modeling is not an unordered list of words: in a topic like “animal, cat, dog, puppy, ” each word of the vocabulary contributes with a specific weight. In other words, also a topic can be expressed by a multinomial distribution, where the words of the vocabulary with the highest probability are the ones that contribute the most to the given topic. This distribution is called word-topic distribution.

The most well-known topic model is LDA (Blei et al., 2003) that also assumes that words in a document are independent of each other, i.e. are expressed as Bag Of Words (BoW). Several topic models have been proposed across the years, addressing a wide variety of problems and tasks. State-of-the-art topic models include neural topic models based on variational autoencoders (VAE) (Kingma & Welling, 2014).

Two Problems with Standard Topic Modeling

However, this kind of models usually have to deal with two limitations:

Once trained, most topic models cannot deal with unseen words, this is because they are based on Bag of Words (BoW) representations, which cannot account for missing terms.
It is difficult to apply topic models to multilingual corpora without combining the vocabulary of multiple languages (Minmo et al, 2009; Jagarlamudi et al, 2010), making the task computationally expensive and without any support for zero-shot learning.

How do we solve this?

Contextualized Topic Model: inviting BERT and friends to the table

Our new neural topic model, ZeroShotTM, takes care of both problems we just illustrated. ZeroShotTM is a neural variational topic model that is based on recent advances in language pre-training (for example, contextualized word embedding models such as BERT).

Yes, you got it right! We propose to combine deep learning based topic models with recent embeddings techniques such as BERT or XLM.

ZeroShotTM: our neural variational topic model with contextualized embeddings. Image by Author.

A pre-trained representation of the documents is passed to the neural architecture and then used to reconstruct the original BoW of the document. Once the model is trained, ZeroShotTM can generate the representations of the test documents, thus predicting their topic distributions even if the documents contain unseen words during training.

Moreover, if we use a multilingual pre-trained representation during training, we can get a significant advantage at test time. Using representations that share the same embedding space allows the model to learn topic representations that are shared by documents in different languages. A trained model can then predict the topics of documents in unseen languages during training.

We extend a neural topic model, ProdLDA (Srivastava & Sutton, 2017), that is based on a Variational Autoencoder. ProdLDA takes as input the BoW representation of a document and learns two parameters μ and σ² of a Gaussian distribution. A continuous latent representation is sampled from these parameters and then passed through a softplus, thus obtaining the document-topic distribution of the document. Then, this topic-document representation is used to reconstruct the original document BOW representation.

Instead of using the BoW representation of documents as model input, we pass the pre-trained document representation to the neural architecture and then use it to reconstruct the original BoW of the document. Once the model is trained, ZeroShotTM can generate the representations of the test documents, thus predicting their topic distributions even if the documents contain unseen words during training.

Our ZeroShotTM, thanks to multilingual representations, can efficiently predict the topics for documents that are in languages that have not been seen during training. Image by Author.

TL;DR

ZeroShotTM has two main advantages: 1) it can handle missing words in the test set and 2) inherits the multilingual capabilities of recent pre-trained multilingual models.

With standard topic models, you often have to remove from the data those words you have in the test set but that are missing in the training set. In this case, since we rely on a contextualized model, we can use it to build the document representation of the test document expressing the full power of these embeddings. Moreover, standard multilingual topic modeling requires to take care of multiple language vocabularies.

ZeroShotTM can be trained on English data (that is a data-rich language) and tested on more low resource data. For example, you can train it on English Wikipedia documents and test it on completely unseen Portuguese documents, as we show in the paper.

Contextualized Topic Modeling: A Python Package

We have built an entire package around this model. You can run the topic models and get results with a few lines of code. On the package homepage, we have different Colab Notebooks that can help you run experiments.

You can follow the example here or directly on colab.

Depending on the use case, you might want to use a specific contextualized model. In this case, we are going to use the distiluse-base-multilingual-cased model.

When using topic models, it is always better to perform text pre-processing, but when we are dealing with contextualized models, such as BERT or the Universal Sentence Encoder it might be better not to do much pre-processing (as these models are contextual and use the context a lot).

Today’s Task

We are going to train a topic model on English Wikipedia documents and predict the topics for Italian documents.

The first thing you need to do is to install the package, from the command line you can run:

Be sure to install the correct PyTorch version for your system and also, if you want to use CUDA, install the version that supports CUDA (you might find it easier to use colab).

Data

Let’s download some data, we are going to get them from an online repository. These two commands will download the English abstracts and the Italian documents.

If you open the English file, the first document you see should contain the following text:

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara 
Escarpment have surfaced for decades,it was not until The Niagara...

Now, let’s apply some preprocessing. We need this step to create the bag of words that is going to be used by our model to represent the topics; Nevertheless, we will still use the non pre-processed dataset to generate the representations from distiluse-base-multilingual-cased model.

Let’s Train Folks

Our pre-processed documents will contain only the 2K most frequent words, this is an optimization step that allows us to remove words that might not be too meaningful. The TopicModelDataPreparation object allows us to create our dataset to train the topic model. Then with the use of our ZeroShotTM object, we can train the model.

Playing with the model

One thing we can quickly check is our topics. Running the following method should get you some topics in output

They are noice, ain’t they?

[['house', 'built', 'located', 'national', 'historic'],
 ['family', 'found', 'species', 'mm', 'moth'],
 ['district', 'village', 'km', 'county', 'west'],
 ['station', 'line', 'railway', 'river', 'near'],
 ['member', 'politician', 'party', 'general', 'political'],
...
...

If you want to have a better visualization you can get it using this:

Yea, but where’s the multilingual?

So now, we need to do some inferences. We could nonetheless predict the topic from some new unseen English documents, but why don’t we try with Italian?

The procedure is similar to what we have already seen; in this case, we are going to collect the topic distributions for each document and extract the most probable topic for each document.

The document contained the following text, about a movie:

Bound - Torbido inganno (Bound) è un film del 1996 scritto e diretto da Lana e Lilly Wachowski. È il primo film diretto dalle sorelle Wachowski, prima del grande successo cinematografico e mediatico di Matrix, che avverrà quasi quattro anni dopo

Look at the predicted topic:

['film', 'produced', 'released', 'series', 'directed', 'television', 'written', 'starring', 'game', 'album']

Makes sense, right?

This ends this blog post, feel free to send me an email if you have questions or if you have something you’d like to discuss :) You can also find me on Twitter.

References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. JMLR.

Jagarlamudi, J., & Daumé, H. (2010). Extracting multilingual topics from unaligned comparable corpora. ECIR.

Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. ICLR.

Mimno, D., Wallach, H., Naradowsky, J., Smith, D. A., & McCallum, A. (2009, August). Polylingual topic models. EMNLP.

Srivastava, A., & Sutton, C. (2017). Autoencoding variational inference for topic models. ICLR.

Credits

The content of this blog post has been mainly written by me and by Silvia Terragni. Many thanks to Debora Nozza and Dirk Hovy for the comments on a previous version of this article.

TDS Archive

Contextualized Topic Modeling with Python (EACL2021)

Combining BERT and friends with Neural Variational Topic Models

Warm-Up: What is Topic Modeling?

Some Details

Two Problems with Standard Topic Modeling

Contextualized Topic Model: inviting BERT and friends to the table

TL;DR

Contextualized Topic Modeling: A Python Package

Today’s Task

Data

Let’s Train Folks

Playing with the model

Yea, but where’s the multilingual?

References

Credits

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in TDS Archive

Written by Federico Bianchi

Responses (3)

Recommended from Medium

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

This new IDE from Google is an absolute game changer

This new IDE from Google is seriously revolutionary.

Laziness Does Not Exist

Psychological research is clear: when people procrastinate, there's usually a good reason

You’re Not a Criminal, But You’re Going to Jail: My ICE Detention Story as a Canadian Citizen

By Jasmine Mooney.

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Common side effects of not drinking

By rejecting alcohol, you reject something very human, an extra limb that we have collectively grown to deal with reality and with each…