Exploring Textual Data using LDA (2023)

Make sense of unstructured text data by applying machine learning principles.

Exploring Textual Data using LDA (1)

I recently completed my first machine learning project at work and decided to apply the methods used in that project to a project of my own. The project I completed at work revolved around automatically classifying textual data using Latent Dirichlet Allocation (LDA).

LDA is an unsupervised machine learning model in the natural language processing arena. Because of its unsupervised nature, LDA does not require a labelled training set. This makes it ideal for certain use cases or when large, labelled textual data-sets are not readily available.

LDA is used chiefly for topic modeling, clustering text documents by similarity. Document size can range from as small as a single word (not ideal) to as large as an entire publication. The content of the LDA clusters is determined using the terms (words) in each document and the frequency, and sometimes even order (using n-grams), in which they appear. Documents that are deemed similar to each other are clustered together and we assume that each cluster is representative of a topic, although we do not know per se what that topic is until after the cluster has been created. It’s important to point out that the model understands neither the content nor context of the documents in these clusters and therefore cannot actually give the clusters a topic label. It instead “labels” each cluster using an index integer from (0, n); n being the number of topics we tell the model to look for. A human, or very smart aquatic mammal, is required to analyze the clusters and determine how each cluster should be labeled.

In this post, we’ll clean some Twitter data and write an LDA model to cluster that data. We’ll then use pyLDAvis to generate an interactive visualization of the clusters.

Key dependencies: pandas, nltk, gensim, numpy, pyLDAvis

Here are some definitions to be familiar with beforehand:

  1. document: a text object (e.g. a tweet)
  2. dictionary: a list of all unique tokens (words, terms) in our collection of documents, each with a unique integer identifier
  3. bag-of-words: a collection of all of our documents, each document reduced to a list of matrices, one matrix for each word in the document — Using gensim’s doc2bow, each matrix is represented as a tuple with a the term’s unique integer id, at index 0 and the number of times it occurs in the document at index 1. (e.g. the document “the box was in the bigger box” would be reduced to something like [(“the”, 2), (“box”, 2), (“was”, 1), (“in”, 1), (“bigger”, 1)] but substituting the “term” for the term’s unique dictionary id)
  4. coherence score: a float value ranging from 0 to 1 used to evaluate how well our model and number of clusters fit our data
  5. cluster: a node representing a group of documents, an inferred topic

Earlier this year, I began collecting a couple hundred thousand political tweets, the end goal being to run various analyses on the tweets and their metadata leading up to the 2020 U.S. presidential election.

This post’s dataset will consist of 3,500 tweets that mention at least one of the following: “@berniesanders”, “@kamalaharris”, “@joebiden”, “@ewarren” (the Twitter handles for Bernie Sanders, Kamala Harris, Joe Biden, and Elizabeth Warren respectively). I collected these tweets in early November 2019 and have made them available for download here. We’ll explore this data and attempt to figure out what people were tweeting about in early November.

I won’t delve into how to collect the tweets but I’ve included the code I used below. Successfully running the code requires access to the tweepy API. I did not collect retweets, nor did I collect tweets that were not written in English (the model requires much more tuning to accommodate for multiple languages).

class Streamer(StreamListener):
def __init__(self):
super().__init__()
self.limit = 1000 # Number of tweets to collect.
self.statuses = [] # Pass each status here.

def on_status(self, status):
if status.retweeted or "RT @"
in status.text or status.lang != "en":
return # Remove re-tweets and non-English tweets.
if len(self.statuses) < self.limit:
self.statuses.append(status)
print(len(self.statuses)) # Get count of statuses
if len(self.statuses) == self.limit:
with open("/tweet_data.csv", "w") as file:
writer = csv.writer(file) # Saving data to csv.
for status in self.statuses:
writer.writerow([status.id, status.text,
status.created_at, status.user.name,
status.user.screen_name, status.user.followers_count, status.user.location])
print(self.statuses)
print(f"\n*** Limit of {self.limit} met ***")
return False
if len(self.statuses) > self.limit:
return False

streaming = tweepy.Stream(auth=setup.api.auth, listener=Streamer())

items = ["@berniesanders", "@kamalaharris", "@joebiden", "@ewarren"] # Keywords to track

stream_data = streaming.filter(track=items)

This passes the tweet text data along with its metadata (id, created date, name, username, follower count, and location) to a csv named tweet_data.

import pandas as pddf = pd.read_csv(r"/tweet_data.csv", names= ["id", "text", "date", "name", "username", "followers", "loc"])

Now that we have our data packed into a neat csv we can begin prepping the data for our LDA machine learning model. Text data is typically considered unstructured, and requires cleaning before meaningful analysis can be conducted. Tweets are particularly messy due to their inconsistent nature. For example, any given Twitter user may tweet in full sentences one day and then single words and hashtags the next. Another user may only tweet links, and another user may only tweet hashtags. On top of that, there are grammatical and spelling errors that users may overlook intentionally. There are also terms that are used colloquially that would not appear in a standard English dictionary.

Cleaning

We’ll remove all punctuation marks, special characters and url links and then apply lower() to each tweet. This brings some level of consistency to our documents (remember each tweet is treated as a document). I’ve also removed instances of “berniesanders”, “kamalaharris”, “joebiden”, and “ewarren” as they will skew our term frequencies since each document will contain at least one of these items.

import stringppl = ["berniesanders", "kamalaharris", "joebiden", "ewarren"]def clean(txt):
txt = str(txt.translate(str.maketrans("", "", string.punctuation))).lower()
txt = str(txt).split()
for item in txt:
if "http" in item:
txt.remove(item)
for item in ppl:
if item in txt:
txt.remove(item)
txt = (" ".join(txt))
return txt

df.text = df.text.apply(clean)

Below are the packages we need to import to prepare our data before feeding it to our model. I’ll include these imports when writing the code for the data prep as well.

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS as stopwords
import nltk
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer as lemm, SnowballStemmer as stemm
from nltk.stem.porter import *
import numpy as np
np.random.seed(0)

We’ve already cleaned our documents a bit, but now we need to lemmatize and stem them. Lemmatization converts the words in our documents to first person and converts all verbs to present tense. Stemming returns the words in our documents to their root format. Luckily, nltk has both a lemmatizer and a stemmer that we can leverage.

LDA involves a stochastic process, meaning our model requires the ability to produce random variables, hence the numpy import. Adding numpy.random.seed(0) allows our model to be reproducible as it will generate and use the same random variables instead of generating new ones every time the code is run.

Gensim’s STOPWORDS is a list of terms deemed irrelevant or likely to fuddle our bag-of-words. In NLP, “stopwords” refers to a collection of terms we do not want our model to pick up. This list will be used to remove these irrelevant terms from our documents. We can print(stopwords) to view the terms that will be removed.

Here are the terms in stopwords.

Exploring Textual Data using LDA (2)

For this model, we’ll leave the stopwords list untouched but in some cases, it may be necessary to add specific terms we want our model to ignore. The below code is one way to add terms to stopwords.

stopwords = stopwords.union(set(["add_term_1", "add_term_2"]))

Lemmatizing and Stemming

Let’s write some code for our data prep.

import warnings 
warnings.simplefilter("ignore")
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS as stopwords
import nltk
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer as lemm, SnowballStemmer as stemm
from nltk.stem.porter import *
import numpy as np
np.random.seed(0)

Initialize a stemmer.

stemmer = stemm(language="english")

Write a function that will both lemmatize and stem our documents. GeeksforGeeks has examples regarding using nltk for lemmatizing and examples regarding using nltk for stemming.

def lemm_stemm(txt):
return stemmer.stem(lemm().lemmatize(txt, pos="v"))

Write a function that will remove stopwords from our documents while also applying lemm_stemm().

def preprocess(txt):
r = [lemm_stemm(token) for token in simple_preprocess(txt) if token not in stopwords and len(token) > 2]
return r

Assign our cleaned and prepared documents to a new variable.

proc_docs = df.text.apply(preprocess)

Now that we’ve prepared our data, we can begin writing our model.

Dictionary

As mentioned in the Introduction, a dictionary (in LDA) is a list of all unique terms that occur throughout our collection of documents. We’ll be going with gensim’s corpora package to construct our dictionary.

dictionary = gensim.corpora.Dictionary(proc_docs)
dictionary.filter_extremes(no_below=5, no_above= .90)
len(dictionary)

The filter_extremes() parameters serves as a second line of defense against stopwords or other commonly-used terms that add little substance to the meaning of a sentence. Playing around with these parameters can help fine-tune the model. I won’t go into detail regarding this but have included the screenshot below from gensim’s dictionary documentation explaining the parameters.

Exploring Textual Data using LDA (3)

Our dictionary has 972 unique tokens (terms).

Exploring Textual Data using LDA (4)

Bag-of-Words

As stated in the Introduction, a bag-of-words (in LDA) is a collection of all our documents broken down into matrices. Matrices consist of a term’s identifier and the number of times it occurs in the document.

n = 5 # Number of clusters we want to fit our data to
bow = [dictionary.doc2bow(doc) for doc in proc_docs]
lda = gensim.models.LdaMulticore(bow, num_topics= n, id2word=dictionary, passes=2, workers=2)
print(bow)
Exploring Textual Data using LDA (5)

Let’s see how our clusters are forming by looking at the key terms that define them.

for id, topic in lda.print_topics(-1):
print(f"TOPIC: {id} \n WORDS: {topic}")
Exploring Textual Data using LDA (6)

Looking at each topic cluster we can get an idea of what they represent. Take a look at Topic 1 and Topic 4.

Regarding Topic 1: In Topic 1, the key terms “cenkuygur” and “anakasparian” refer to Cenk Uygur and Ana Kasparian, co-hosts of The Young Turks (a political commentary firm and show). Topic 1 also includes the key terms “right”, “trump”, and “nra”.

On November 15, there was a school shooting at Saugus High School near Santa Clarita, California. There was significant media coverage and online buzz regarding this tragic event. The Young Turks (TYT) is a vocal proponent of stricter gun laws and regularly butts heads with the NRA and other gun-groups. TYT has even spearheaded a pledge campaign called #NeverNRA.

This topic cluster can be labelled as “TYT vs the NRA”, or something similar.

Regarding Topic 4: The terms “cenkuygur” and “anakasparian” are repeated in Topic 4. Topic 4 also includes “theyoungturk”, referring to The Young Turks, and “berni”, referring to Bernie Sanders.

On November 12, Cenk Uygur put out a public endorsement for candidate Bernie Sanders. This endorsement was repeated by TYT’s Twitter account. Bernie Sanders then publicly thanked them for the endorsement. Also, on November 14, Mr. Uygur announced he was running for Congress. Both these developments garnered notable attention on Twitter.

This topic cluster can be labelled as “TYT & Bernie Sanders”, or something similar.

Exploring Textual Data using LDA (7)
Exploring Textual Data using LDA (8)

There are similar explanations for the other topic clusters as well.

Most good machine learning models and applications have a feedback loop. This is a way to evaluate the model’s performance, scalability, and overall quality. In the topic modeling space, we use coherence scores to determine how “coherent” our model is. As I mentioned in the Introduction, coherence is a float value between 0 and 1. We’ll use gensim for this as well.

# Eval via coherence scoringfrom gensim import corpora, models
from gensim.models import CoherenceModel
from pprint import pprint
coh = CoherenceModel(model=lda, texts= proc_docs, dictionary = dictionary, coherence = "c_v")
coh_lda = coh.get_coherence()
print("Coherence Score:", coh_lda)
Exploring Textual Data using LDA (9)

We’ve received a coherence score of 0.44. This isn’t the best, but actually isn’t too bad. This score was achieved without any fine-tuning. Really digging into our parameters and testing outcomes should yield a higher score. There really isn’t an official threshold for scoring. My coherence score goal is typically around a 0.65. See this article and this Stack Overflow thread for more on coherence scoring.

Visualize with pyLDAvis

Lastly, we can visualize our clusters using pyLDAvis. This package creates a distance map of clusters with the clusters plotted along an x and y axis. This distance map can be opened in Jupiter by calling pyLDAvis.display() but can also be opened in the web by calling pyLDAvis.show().

import pyLDAvis.gensim as pyldavis
import pyLDAvis
lda_display = pyldavis.prepare(lda, bow, dictionary)
pyLDAvis.show(lda_display)

Here is a screenshot of our pyLDAvis distance map.

Exploring Textual Data using LDA (10)

Hovering over each cluster brings up the relevance of the key terms within that cluster (in red) and the relevance of those same key terms across the entire collection of documents (in blue). This is an effective way of displaying findings to stakeholders.

Conclusion

Here’s all the code I used above, including the code I used to generate the word cloud and the code I used to collect the tweet data.

### All Dependencies ###

import pandas as pd
from wordcloud import WordCloud as cloud
import matplotlib.pyplot as plt
import string
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS as stopwords
import nltk
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer as lemm, SnowballStemmer as stemm
from nltk.stem.porter import *
import numpy as np
np.random.seed(0)
from gensim import corpora, models
from gensim.models import CoherenceModel
from pprint import pprint
import pyLDAvis.gensim as pyldavis
import pyLDAvis

### Word Cloud ###

df = pd.read_csv(r"/tweet_data.csv", names=["id", "text", "date", "name",
"username", "followers", "loc"])

def clean(txt):
txt = str(txt).split()
for item in txt:
if "http" in item:
txt.remove(item)
txt = (" ".join(txt))
return txt

text = (df.text.apply(clean))

wc = cloud(background_color='white', colormap="tab10").generate(" ".join(text))

plt.axis("off")
plt.text(2, 210, "Generated using word_cloud and this post's dataset.", size = 5, color="grey")

plt.imshow(wc)
plt.show()

### Stream & Collect Tweets ###class Streamer(StreamListener):
def __init__(self):
super().__init__()
self.limit = 1000 # Number of tweets to collect.
self.statuses = [] # Pass each status here.

def on_status(self, status):
if status.retweeted or "RT @"
in status.text or status.lang != "en":
return # Remove re-tweets and non-English tweets.
if len(self.statuses) < self.limit:
self.statuses.append(status)
print(len(self.statuses)) # Get count of statuses
if len(self.statuses) == self.limit:
with open("/tweet_data.csv", "w") as file:
writer = csv.writer(file) # Saving data to csv.
for status in self.statuses:
writer.writerow([status.id, status.text,
status.created_at, status.user.name,
status.user.screen_name, status.user.followers_count, status.user.location])
print(self.statuses)
print(f"\n*** Limit of {self.limit} met ***")
return False
if len(self.statuses) > self.limit:
return False

streaming = tweepy.Stream(auth=setup.api.auth, listener=Streamer())

items = ["@berniesanders", "@kamalaharris", "@joebiden", "@ewarren"] # Keywords to track

stream_data = streaming.filter(track=items)

### Data ###

df = pd.read_csv(r"/tweet_data.csv", names= ["id", "text", "date", "name",
"username", "followers", "loc"])

### Data Cleaning ###

ppl = ["berniesanders", "kamalaharris", "joebiden", "ewarren"]

def clean(txt):
txt = str(txt.translate(str.maketrans("", "", string.punctuation))).lower()
txt = str(txt).split()
for item in txt:
if "http" in item:
txt.remove(item)
for item in ppl:
if item in txt:
txt.remove(item)
txt = (" ".join(txt))
return txt

df.text = df.text.apply(clean)

### Data Prep ###

# print(stopwords)

# If you want to add to the stopwords list: stopwords = stopwords.union(set(["add_term_1", "add_term_2"]))

### Lemmatize and Stem ###

stemmer = stemm(language="english")

def lemm_stemm(txt):
return stemmer.stem(lemm().lemmatize(txt, pos="v"))

def preprocess(txt):
r = [lemm_stemm(token) for token in simple_preprocess(txt) if token not in stopwords and len(token) > 2]
return r

proc_docs = df.text.apply(preprocess)

### LDA Model ###

dictionary = gensim.corpora.Dictionary(proc_docs)
dictionary.filter_extremes(no_below=5, no_above= .90)
# print(dictionary)

n = 5 # Number of clusters we want to fit our data to
bow = [dictionary.doc2bow(doc) for doc in proc_docs]
lda = gensim.models.LdaMulticore(bow, num_topics= n, id2word=dictionary, passes=2, workers=2)
# print(bow)

for id, topic in lda.print_topics(-1):
print(f"TOPIC: {id} \n WORDS: {topic}")

### Coherence Scoring ###

coh = CoherenceModel(model=lda, texts= proc_docs, dictionary = dictionary, coherence = "c_v")
coh_lda = coh.get_coherence()
print("Coherence Score:", coh_lda)

lda_display = pyldavis.prepare(lda, bow, dictionary)
pyLDAvis.show(lda_display)

LDA is a great model for exploring textual data although it requires a good amount of optimization (depending on use-case) to be used in production. The gensim, nltk, and pyLDAvis packages are priceless when writing, evaluating, and displaying models.

Thanks a bunch for allowing me to share, more to come. 😃

Top Articles
Latest Posts
Article information

Author: Barbera Armstrong

Last Updated: 03/05/2023

Views: 6298

Rating: 4.9 / 5 (79 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Barbera Armstrong

Birthday: 1992-09-12

Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

Phone: +5026838435397

Job: National Engineer

Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.