A step by step mathematical and code-based guide on demystifying TF-IDF values by calculating them on a mystic poem by Rumi.
As you skimmed through the title and subtitle above, you have already formed the idea that this post is fairly technical. Most likely, you are going to skim through this post first, glance at the titles before deciding whether you want to read the entire article. Maybe, the formula and the code will entice you to read this article. Or, maybe it will cause you to put off reading this post. As you can see, you have a great mechanism to process text quite efficiently.
In the field of Natural Language Processing, we are in trying to get the algorithms to understand the text. Most algorithms are mathematical functions.So, they need a representation of text as numbers to work effectively.This article drills down into a pretty old mechanism to represent text as numbers called TF-IDF a.k.a Term Frequency Inverse Document Frequency.
So, what is TF-IDF?
Intuitively, to understand what text is about, we look for words that occur frequently. Term frequency covers that aspect by capturing the number of times each word occurs in the text. However, in this article, there are 200+ occurrences of the word ‘the’ and only 55 occurrences of the word TF (which includes the code used as well). To downgrade the relative importance of words that occur all too frequently, an inverse weighting is introduced to scale down the words that occur too frequently. This inverse weighting is referred to as Inverse Document Frequency. Together, TF-IDF captures the relative importance of words in a set of documents or a collection of texts.
There are many great articles written on the intuition behind TF-IDF and not many written on how to derive the exact values for TF-IDF. The focus of this article is to piece together various calculations involved and provide how to derive each step with python code so that you can derive it on the texts you are working with.
In the field of Natural Language Processing, discoveries of word embedding in 2013 and language models in 2018 have changed the landscape and led to many exciting developments in NLP. So, it seems a bit strange that here in 2020, I have chosen to write 2000+ words about TF IDF which was first formulated in the 1970s. Here are my 2 reasons on why I think it is good to understand the underlying calculations.
- Understanding TF-IDF makes it one step easier to understand and interpret the results of algorithms you apply on top of TF-IDF.
- In an applied business context, the text classification problem is one of the common problems in NLP. In text classification problems, the algorithms have to predict the topic based on a predefined set of topics it has trained on. In 2018, Google released a text classification framework based on 450K experiments on a few different text sets. Based on the 450K experiments, Google found that when the number of samples/number of words < 1500, TF IDF was the best way to represent text. When you have a smallish sample size for a relatively common problem, it helps to try out TF IDF.
We will be using a beautiful poem by the mystic poet and scholar Rumi as our example corpus. First, we will calculate TF IDF values for the poem using TF IDF Vectorizer from the sklearn package. Then, we will pull apart the various components and work through various steps involved in calculating TF-IDF values. Mathematical calculations and Python code will be provided for each step.
So, let’s go!
To illustrate the concept of TF-IDF, we need a corpus. A corpus is a collection of documents. In a typical Natural Language Processing problem, a corpus can vary from a list of call center logs/ transcripts, a list of social media feedback to a large collection of research documents.
To illustrate the various steps involved, we need to keep the corpus as small as possible so that the matrices fit neatly on a page. I chanced upon this quote / beautiful poem from the 13th-century Persian poet and Sufi mystic Rumi (Jalāl ad-Dīn Muhammad Rūmī) and it fits our use case perfectly. So, we will be using this poem as our list of documents with each sentence considered as a document.
Isn’t that a really beautiful poem? It’s good to uplift ourselves as we power through code and a little math.
We represent the 8 sentences above as a list of strings in Python.
corpus = ["you were born with potential",
"you were born with goodness and trust",
"you were born with ideals and dreams",
"you were born with greatness",
"you were born with wings",
"you are not meant for crawling, so don't",
"you have wings",
"learn to use them and fly"
We will be decimating the beautiful poem into mysterious decimals in this step. But, hey, after all, we are trying to demystify these decimals by understanding the calculations involved in TF-IDF. As mentioned before, it is quite easy to derive through sklearn package.
#transform the tf idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer()
X_train_tf_idf = tf_idf_vect.fit_transform(corpus)
terms = tf_idf_vect.get_feature_names()
In the matrix below, each row represents a sentence from the above poem. Each column represents a unique word in the poem in alphabetical order. As you can see, there are lot of zeros in the matrix. So, a memory-efficient sparse matrix is used for representing this. I have converted it to a data frame for ease of visualization.
Let us interpret the numbers we have received so far. As you may have noticed, the words “you were born” are repeated throughout the poem. So, we anticipate that these words will not be getting high TF-IDF scores. If you look at the values for those three words, you can see that most often they get between 0.2 and 0.3.
Let us look at Document 0— You were born with potential. The word potential stands out. If you look at the various TF-IDF values in the first row in the matrix, you will see that the word potential has the highest TF-IDF value.
Let us look at Document 4 (row 5): You were born with wings. Again, same as before, the word “wings” has the highest value in that sentence.
Notice that the word “wings” appears also in Document 6. TF-IDF value for the word “wings” in Document 6 is different to TF-IDF value for the word “wings” in Document 4. In Document 6, the word “wings” is deemed less important than the word “have” in Document 6.
The objective of this article is to look at how the above TF-IDF values can be calculated from scratch. We will be focusing on applying the calculations on the words wings and potential in particular, to derive the values highlighted in red in the matrix displayed above.
We will break apart the various components and then put them back together. We will do this in three steps:
- Step 1: Derive term frequency values from scratch
- Step 2: Derive inverse document frequency values from scratch
- Step 3: Aggregate the above two values using multiplication and normalization
The term frequency is pretty straight forward. It is calculated as the number of times the words/terms appear in a document.
For the sentences, “you were born with potential” and “you were born with wings”, the following are the term frequency values.
Extending the same, to all 8 sentences in the poem, we get the following word count matrix. As before, rows represent sentences and columns represent words in the poem ordered alphabetically. Count Vectorizer returns a sparse matrix that is converted to a data frame for the ease of visualization.
The above is calculated by the following code.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(corpus)
terms = count_vect.get_feature_names()
Let us find out term frequency for the words potential and wings.
- Term Frequency for the word potential in Doc 0= 1
- Term Frequency for the word wings in Doc 4= 1
- Term Frequency for the word wings in Doc 6 = 1
However, if we just take the term frequency as a measure of importance, we fall into one major pitfall as represented by the tweet below from Terrible Maps representing the most frequent word in each state in the United States.
We need a mechanism to tone down the relative importance of the words that occur all too frequently in all the documents. Enter the Inverse Document Frequency. Intuitively, if a word appears in all documents, then it may not play such a big part in differentiating between the documents.
Similar to Term Frequency,
Document Frequency(term t) = number of documents with the term t/ total number of documents = d(t)/n
Inverse Document Frequency = total number of documents / number of documents with the term t = n / d(t)
If a word appears in all the documents, we want it at the bottom of the range of 0–1. So, a logarithmic scale intuitively makes sense to be used here as log 1 is 0. However, there are some practical considerations such as avoiding the infamous divide by 0 error, 1 is added to the denominator.
Inverse Document frequency for the default settings in TF IDF vectorizer in sklearn is calculated as below (default settings have
smooth_idf=True that adds “1” to the numerator and denominator as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions).
- n is the total number of documents in the document set.
- d(t) is the number of documents in the document set that contain term.
Let us identify the individual values for the words — potential and wings.
- Number of Documents = 8
- Number of documents in the corpus that contain the word potential = 1
- Number of documents in the corpus that contain the word wings = 2
Applying the formula for Inverse Document Frequency, we get
IDF values for the poem can be obtained by running the following Python code.
# explore idf
# idf_ attribute can be used to extract IDF values
# transpose the 1D IDF array to convert to a dataframe to make it easy to visualise
df_idf = idf2df(vectorizer.idf_[:,np.newaxis].T ,terms)
We obtain the values as shown below and we can cross-check the values from our calculations for the words potential and wings. Note that there is only one IDF value for a word in the corpus.
As the name implies TF-IDF is a combination of Term Frequency(TF) and Inverse Document Frequency(IDF), obtained by multiplying the 2 values together. The sklearn implementation then applies normalization on the product between TF and IDF. Let us look at each of those steps in detail.
Step 3 a: Multiply TF and IDF
In multiplying the 2 matrices together, we take an element-wise multiplication of Term Frequency Matrix and Inverse Document Frequency. Consider the first sentence — “You were born with potential”. To find the product of TF and IDF for this sentence, it is calculated as below.
This can be done by the following code for the entire data frame:
df_mul = df_count.mul(df_idf.to_numpy())
Applying it to the corpus, we get the following matrix:
You may notice that the product of TF and IDF can be above 1. Now, the last step is to normalize these values so that TF-IDF values always scale between 0 and 1.
Step 3 b: Normalize the product of TF and IDF
In day to day life, when we want to normalize the values so that we can compare them easily, we use percentages or proportions. So, we could hypothetically calculate a proportion of TF-IDF values across various words in a sentence. Please note that both TF and IDF are non-negative values as the lowest value possible for Term Frequency and Inverse Document Frequency is 0. So, taking out proportions would be equivalent to what is known as L1 normalization. In L1 normalization each element in a vector(think various TF-IDF values of a sentence) is divided by sum of absolute values all elements. There is an option to L1 normalize the values in sklearn, but that is not the default setting.
In sklearn, the default normalization applied is L2 normalization. Easiest way to think about L2 normalization is to think about the length of a line or Pythagoras theorem with one of the corners of the triangle at the origin.
In the diagram above, the length of the line is 5. In this case, the line is a 1D vector. When vectors are n-dimensional, the length of the vector is similar to the length of a line but extended to n dimensions. So, if a vector v is composed of n-elements, the length of the vector is calculated as
In L2 normalization, we are essentially dividing the vector by the length of the vector. For a more mathematical explanation of L1 and L2 norm, please refer to Wikipedia.
To apply L2 norm, for each of the sentences we need to calculate the square root of the sum of squares of the product of TF and IDF.
It can be done in Python as below:
from sklearn.preprocessing import Normalizer
df_mul.iloc[:,:] = Normalizer(norm='l2').fit_transform(df_mul)
Here are the values obtained:
Finally, we are ready to calculate the final TF-IDF scores!
TF-IDF for the word potential in you were born with potential (Doc 0): 2.504077 / 3. 66856427 = 0.682895
TF-IDF for the word wings in you were born with wings (Doc 4) = 2.098612/ 3. 402882126 = 0.616716
TF-IDF for the word wings in you have wings (Doc 6) = 2.098612/ 3. 452116387 = 0.607744
This can be calculated with the following code.
from sklearn.preprocessing import Normalizer
df_mul.iloc[:,:] = Normalizer(norm=’l2').fit_transform(df_mul)display(HTML(df_mul.to_html()))
When you run the above code, you get the results as below, which is same as the matrix in the Looking Ahead to the Output section above.
So, far we have done the following
The main limitation of TF IDF is that word order which is an important part of understanding the meaning of a sentence is not considered in TF-IDF.
Also, document length can introduce a lot of variance in the TF IDF values.
The following are the assumptions and conventions used in this post.
- Rounding — For the ease of visualization, decimals are rounded to 4 decimal places. I have taken care not to round the numbers that are called out as examples. Any rounding changes you may spot in various cells are because of that.
- As mentioned in the sklearn documentation, there is a slight difference between most text-book formula for IDF and the implementation in sklearn.
- For simplicity, I have used default settings for sklearn TF IDF vectorizer. There are ways to alter this such as a) Use L1 normalization instead of L2 normalization & b) set smooth_idf=False which does not add “1” to the numerator and denominator.
- Stop words are not removed in the example above. For various use cases, relevant stop words can be removed.
- I have used such a small corpus only for illustration purposes. In real-world use cases, the corpus, vocabulary and matrices shown in the illustration are much larger.
Jupyter notebook used for calculating TF-IDF values is available in GitHub.
 scikit-learn documentation
Text Classification Framework, Google, 2018