tf-idf Model for Page Ranking - GeeksforGeeks (2023)

tf-idf stands for Term frequency-inverse document frequency. The tf-idf weight is a weight often used in information retrieval and text mining. Variations of the tf-idf weighting scheme are often used by search engines in scoring and ranking a document’s relevance given a query. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus (data-set).

How to Compute: tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important. Typically, the tf-idf weight is composed by two terms-

  1. Normalized Term Frequency (tf)
  2. Inverse Document Frequency (idf)

Let’s us take 3 documents to show how this works.

Doc 1: Ben studies about computers in Computer Lab.

Doc 2: Steve teaches at Brown University.

Doc 3: Data Scientists work on large datasets. Let’s say we are doing a search on these documents with the following query:

Data Scientists The query is a free text query. It means a query in which the terms of the query are typed freeform into the search interface, without any connecting search operators.

Step 1: Computing the Term Frequency(tf)

Frequency indicates the number of occurrences of a particular term t in document d. Therefore,

tf(t, d) = N(t, d), wherein tf(t, d) = term frequency for a term t in document d.N(t, d) = number of times a term t occurs in document d

We can see that as a term appears more in the document it becomes more important, which is logical.We can use a vector to represent the document in bag of words model, since the ordering of terms is not important. There is an entry for each unique term in the document with the value being its term frequency. Given below are the terms and their frequency on each of the document. [N(t, d)] tf for document 1:

Doc 1BenStudiesComputerLab
tf1121

Vector Space Representation for Doc 1 : [1, 1, 2, 1] tf for document 2:

Doc 2SteveteachesBrownUniversity
tf1111

Vector Space Representation for Doc 2 : [1, 1, 1, 1] tf for document 3:

Doc 3DataScientistsworklargedatasets
tf11111

Vector Space Representation for Doc 3 : [1, 1, 1, 1, 1] Thus, the representation of documents as vectors in a common vector space is known as Vector Space Model and it’s very fundamental to information retrieval. Since we are dealing with the term frequency which rely on the occurrence counts, thus, longer documents will be favored more. To avoid this, normalize the term frequency

tf(t, d) = N(t, d) / ||D||wherein, ||D|| = Total number of term in the document

||D|| for each document:

Documents||D||
17
25
36

Given below are the normalized term frequency for all the documents, i.e. [N(t, d) / ||D||] Normalized TF for Document 1:

Doc1BenstudiesComputerLab
Normalized Tf0.1430.1430.2860.143

Vector Space Representation for Document 1 : [0.143, 0.143, 0.286, 0.143] Normalized tf for document 2:

Doc 2SteveteachesBrownUniversity
NormalizedTf0.20.20.20.2

Vector Space Representation for Document 2 : [0.2, 0.2, 0.2, 0.2] Normalized tf for document 3:

Doc 3DataScientistsworklargedatasets
NormalizedTf0.1670.1670.1670.1670.167

Vector Space Representation for Document 3 : [0.167, 0.167, 0.167, 0.167, 0.167] Below function in Python will do the normalized TF calculation:

Python3

def termFrequency(term, doc):

"""

Input: term: Term in the Document, doc: Document

Return: Normalized tf: Number of times term occurs

in document/Total number of terms in the document

"""

# Splitting the document into individual terms

normalizeTermFreq = doc.lower().split()

# Number of times the term occurs in the document

term_in_document = normalizeTermFreq.count(term.lower())

# Total number of terms in the document

len_of_document = float(len(normalizeTermFreq ))

# Normalized Term Frequency

normalized_tf = term_in_document / len_of_document

return normalized_tf

Step 2: Compute the Inverse Document Frequency – idf

It typically measures how important a term is. The main purpose of doing a search is to find out relevant documents matching the query. Since tf considers all terms equally important, thus, we can’t only use term frequencies to calculate the weight of a term in the document. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones. Logarithms helps us to solve this problem. First of all, find the document frequency of a term t by counting the number of documents containing the term:

df(t) = N(t)where-df(t) = Document frequency of a term tN(t) = Number of documents containing the term t

Term frequency is the occurrence count of a term in one particular document only; while document frequency is the number of different documents the term appears in, so it depends on the whole corpus. Now let’s look at the definition of inverse document frequency. The idf of a term is the number of documents in the corpus divided by the document frequency of a term.

idf(t) = N/ df(t) = N/N(t)

It’s expected that the more frequent term to be considered less important, but the factor (most probably integers) seems too harsh. Therefore, we take the logarithm (with base 2 ) of the inverse document frequencies. So, the idf of a term t becomes :

idf(t) = log(N/ df(t))

This is better, and since log is a monotonically increasing function we can safely use it. Let’s compute IDF for the term Computer:

idf(computer) = log(Total Number Of Documents / Number Of Documents with term Computer in it)

There are 3 documents in all = Document1, Document2, Document3

The term Computer appears in Document1idf(computer) = log(3 / 1) = 1.5849

Given below is the idf for terms occurring in all the documents-

GivenNo. of documents in which term appears(Nt)idf = Log(N/Nt)
Ben1Log(3/1)=1.5849
Studies1Log(3/1)=1.5849
Computer1Log(3/1)=1.5849
Lab1Log(3/1)=1.5849
Steve1Log(3/1)=1.5849
Teaches1Log(3/1)=1.5849
Brown1Log(3/1)=1.5849
University1Log(3/1)=1.5849
Data1Log(3/1)=1.5849
Scientists1Log(3/1)=1.5849
Work1Log(3/1)=1.5849
Large1Log(3/1)=1.5849
Dataset1Log(3/1)=1.5849

Given below is the function in python to calculate idf:

Python3

def inverseDocumentFrequency(term, allDocs):

num_docs_with_given_term = 0

"""

Input: term: Term in the Document,

allDocs: List of all documents

Return: Inverse Document Frequency (idf) for term

= Logarithm ((Total Number of Documents) /

(Number of documents containing the term))

"""

# Iterate through all the documents

for doc in allDocs:

"""

Putting a check if a term appears in a document.

If term is present in the document, then

increment "num_docs_with_given_term" variable

"""

if term.lower() in allDocs[doc].lower().split():

num_docs_with_given_term += 1

if num_docs_with_given_term > 0:

# Total number of documents

total_num_docs = len(allDocs)

# Calculating the IDF

idf_val = log(float(total_num_docs) / num_docs_with_given_term)

return idf_val

else:

return 0

Step 3: tf-idf Scoring

Now we have defined both tf and idf and now we can combine these to produce the ultimate score of a term t in document d. Therefore,

tf-idf(t, d) = tf(t, d)* idf(t, d)

For each term in the query multiply its normalized term frequency with its IDF on each document. In Document3 for the term data, the normalized term frequency is 0.167 and its IDF is 1.5849. Multiplying them together we get 0.2646. Given below is TF * IDF calculations for data and Scientists in all the documents.

Doc 1Doc 2Doc 3
Data000.2646
Scientists000.2646

We will use any of the similarity measures (eg, Cosine Similarity method) to find the similarity between the query and each document. For example, if we use Cosine Similarity Method to find the similarity, then smallest the angle, the more is the similarity. Using the formula given below we can find out the similarity between any two documents, let’s say d1, d2.

Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||Dot product (d1, d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]||d1|| = square root(d1[0]^2 + d1[1]^2 + ... + d1[n]^2)||d2|| = square root(d2[0]^2 + d2[1]^2 + ... + d2[n]^2)

References:


My Personal Notesarrow_drop_up

Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated: 04/01/2023

Views: 6278

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.