In this article we are going to see how we can calculate similarity between two documents or two strings using Euclidean distance (ED) or Cosine Similarity (CS). Finding similarity is a part of clustering, which is a part of data mining.
What is Clustering in Data Mining?
Clustering is a technique in which you group similar things together in clusters on the basis of some properties. In machine learning and data mining, we often deal with raw or untagged data which we need to sort. So in order to sort/tag them we use clustering techniques.
Euclidean Distance and Cosine Similarity are two very simply but poplar distance based clustering techniques.
Distance is inversely proportional to similarity. If two things have high similarity then they have smaller distance. On other words, if two things have large distance, they are very less similar.
We are going to take some tweets as example and learn to use Euclidean distance and Cosine similarity for document similarity calculation.
- Tokenize every document that exist in corpus. Tokenisation means breaking a sentence/document into words on the basis of space and punctuation marks.
- Using the result of above step, form a WordList that contains all terms that exist in corpus. Remember every word in document is a token, and collection of tokens where they appear only once is called term.
- Form a term-document incidence. It is a two-dimensional matrix where we have shown whether a document contains a term or not. Technically each row of this matrix represents a document vector.
- Apply distance formula on document vectors to calculate their document-distance.
- Store calculated similarity and distance value in Distance Matrix
- Print that matrix.
Let suppose we have a corpus C which has 5 documents:
We need to tokenize every document di ∈ C. So we are splitting each di with space as it doesn’t have any other punctuation marks in them.
From above tokenised data, form a WordList that has terms only i.e unique words:
We can assign indexes to these terms, as:
Term-Incidence Matrix is the two-dimensional array which shows documents and terms that occur in them.
Consider each row in following matrix as a document and columns as terms. We have a term incidence matrix of dimension D × W, where D is number of documents and W is number of terms in WordList. We have put 1 against term if it exist in document otherwise 0.
Using Cosine Similarity For Document Similarity
If we want to use Cosine Similarity instead of Euclidean Distance then it’s formula will be as follows:
Where P and Q are the two different documents, and Pi refers to the ith term in the matrix for document P. So, if we want to calculate Cosine Similarity of documents d2 and d3 then:
One by one you have to calculate document distance of all documents from each other. This matrix is mere an example with dummy values.
Now this matrix can be used to conclude a number of things. If the distance between two documents is really small, we can say that they are really similar.
Similarly if distance between two document is large, we can say they are not very similar.
Euclidean Distance and Cosine Similarity can be used to calculate distances. These techniques are generally used in AI clustering when you have un-ordered or random data.
These two techniques can be used in a number of ways. They are a part of machine learning and AI topics. Some applications can be:
- Ecommerce recommendation algorithm based on product similarity
- Sentiment analysis using similarity between two positive/negative texts.
- Diagnose disease on given parameters by comparing them for similarity of previous symptoms.
I want to acknowledge my teacher Dr. Muhammad Yaseen Khan who taught me these two techniques during my Data Structures and Algorithms course. This article is originally his document & I am sharing this on the internet so that other can learn as well.