Understanding TF-IDF in NLP.
TF-IDF, short for Term Frequency–Inverse Document Frequency, is a numerical statistic that is intended to reflect how important a word is to a document, in a collection or Corpus(Paragraph).It is often used as a Weighing Factor in searches of information retrieval, Text Mining, and User Modelling. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
TF-IDF is much more preferred than Bag-Of-Words, in which every word, is represented as 1 or 0, every time it gets appeared in each Sentence, while, in TF-IDF, gives weightage to each Word separately, which in turn defines the importance of each word than others.
Lets Understand these Terms separately as,
- TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document), and,
- · IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
Now Lets jump into the example part of it:
Let’s Consider these Three sentences:
- He is a Good Boy
2. She is a Good Girl, and,
3. Both are Good Boy, and Girl, respectively.
So, after using Regular Expression, stop-words and other Functions from NLTK library, we get purified version of these three sentences, which can be shown as,
- Good Boy
2. Good Girl, and
3.Good Boy Girl, respectively.
Now, Lets Consider TF(Term Frequency) operations,
Let’s assume a word “Good”, in sentence 1, as we know,
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
So, Number of times the word “Good” appears in Sentence 1 is, 1 Time, and the Total number of times the word “Good”, appears in all three Sentences is 3 times, so the TF(Term Frequency) value of word “Good” is, TF(“Good”)=1/3=0.333.
Now, lets consider the value of each word, with reference to each sentence, in a tabular format, which can be shown as,
So, we can see TF value of each Word with respect to each Sentence.
Lets Consider Second of TF-IDF, That is, IDF(Inverse Document Frequency) of Each word, with respect to each Sentence.
As we know,
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
Again, lets consider, the word “Good”, in Sentence 1,Now, we know that Total Number of Sentences we have is 3(Total number of documents), also , We know the word ” Good” appears overall 3 times, considering all 3 sentences, so, Number of documents with term “Good” in it=3,
So, IDF (Inverse Document Frequency) Value of word “Good” would be “ Log(3/3)”, Now, lets consider the IDF( Inverse Document Frequency ) Value of each word, in a Tabular For,
Now, We have Values for both, TF( Term Frequency ) as well as IDF( Inverse Document Frequency ) for each word, for each Sentence we have,
So, Finally the TF-IDF Value for each word would be= TF(Value)*IDF(Value).
Let’s Present TF-IDF Value of each word in a tabular form given below,
As a Conclusion, we can see that, the word “Good”, appears in each of these 3 sentences, as a result the Value of the word “Good” is Zero, while the Word “Boy” appears only 2 times, in each of these 3 sentences, As a results, we can see, in Sentence 1, the Value(Importance) of word “Boy” is more than the Word “Good”.
As a result, we can see that, TF-IDF, gives Specific Value or Importance to each Word, in any paragraph, The terms with higher weight scores are considered to be more importance, as a result TF-IDF has replaced the concept of “Bag-Of-Words” which gives the Same Value to each word, when occurred in any sentence of Paragraph, which is Major Disadvantage of Bag-Of-Words.
TF-IDF was invented for document search and can be used to deliver results that are most relevant to what you’re searching for. Imagine you have a search engine and somebody looks for “Dog”. The results will be displayed in order of relevance. That’s to say the most relevant Pet articles will be ranked higher because TF-IDF gives the word “Dog” a higher score.
Well, This is How TF-IDF(Term Frequncy and Inverse Document Frequency) works!
If you Like this Blog, Do leave a “Clap”, and Share with your Friends, Colleagues.