Last Updated on May 29, 2023 by Jay
This tutorial will show you what cosine similarity is and how to calculate it in Python.
What is cosine similarity?
In Natural Language Processing (NLP), cosine similarity is a measure used to determine how similar two documents or texts are, even when their lengths might differ. Imagine each unique word within the documents as a dimension in a high-dimensional space. Each document is then represented as a vector in this space, where the direction of the vector is determined by the words used in the document and the magnitude of the vector by the frequency of the words. Cosine similarity measures the cosine of the angle between these two vectors. If the vectors are very close in orientation (small angle), the cosine of the angle will be close to 1, indicating high similarity. If the vectors are orthogonal (90 degrees apart), they have no common words, so the cosine is 0, indicating no similarity. This concept is central to many applications, such as information retrieval, text mining, and document clustering.
That was a high-level description of cosine similarity. It could be confusing for someone new to the field, but it’s an important definition to know. We’re going to walk through a very simple example to demonstrate the concept.
2 words example
Consider the following chart:
The x-axis represents how many times the word “Hello” shows up in a setence, and y-axis represents how many times the word “World” shows up.
We’ll plot the following two sentences on the chart.
Each word appears once in the first sentence, “Hello World”, so we can draw a line from point (0,0) to point (1,1). This (1,1) is called a vector, a simple one, of course. See the purple arrow in the below graph.
Similarly, for the second sentence “Hello”, we can draw a vector (1,0). The 0 on the y-axis is because this sentence doesn’t contain the word “World” in it. See the green arrow in the blow graph.
We know the angle between the purple and green vectors is 45°. The cosine of 45° is 0.7071 from high school trigonometry class.
Possible cosine values
Since we are plotting word count here, we can never go into the negative territory. The degrees between the two vectors will be between 0° to 90°, which means the cosine values will be from 0 to 1. A value of 0 means no similarity while a value of 1 means highly similar.
Let’s next look at another two sentences:
The plot shows two vectors orthogonal, i.e. right angle or 90°.
Cos(90°) gives 0, which means the two sentences are not similar. This is no surprise given the two sentences are basically two different words.
What happens if we compare the same two sentences? We should expect they are very similar and the cosine similarity should be 1, let’s take a look.
When two sentences are the same, their plot will overlap. Meaning the angle between the two vectors will be 0°.
The cosine of 0° is 1, confirming they are highly similar.
More complex example
We are able to plot the previous examples because each sentence contains only two words. Each word takes up a dimension. If a sentence has 3 words, we’ll need to draw a 3-D plot. If a sentence has 100 different words. We need to draw a 100-dimension plot, which is not possible.
Linear algebra comes to the rescue!
We can use the follow formula to calculate cosine similarity:
Both A and B are n-dimensional vectors. Ai and Bi are the ith elements of vectors A and B, respectively.
We can think of the Ai and Bi represent word count. Using the previous “Hello World” and “Hello” as an example:
|Sentence||“Hello” count||“World” count|
Plugging the above word count into the formula, we get the same result as the cos(45):
We can use the above formula to solve the consine similarity for any dimension now. Let’s consider the following two sentences:
“I love Python”
“Python is great and I love it”
Let’s create the word count table:
|“I love Python”||1||1||1||0||0||0|
|“Python is great and I love Python”||1||1||2||1||1||1|
The cosine similarity is now 0.7698, indicating high similarity between the two sentences.
Cosine Similarity Python Inplementation
As you see the cosine similarity calculation is not hard with some linear algebra. Still we shouldn’t implement the calculation ourself because it’s already be done in many places. I’m going to use the sklearn library to demonstrate how to use calculate the cosine similarity.
from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity import pandas as pd s1 = 'hello world' s2 = 'hello' cv = CountVectorizer() sparse_matrix = cv.fit_transform([s1,s2]).toarray() df = pd.DataFrame(sparse_matrix, columns =cv.get_feature_names_out()) df
We use the CountVectorizer object to create the word count table. It will also fill in the count values for each word.
This gives a consine similarity of 0.7071, which matches to cos(45) we calculated previously.
Stop words are words like “and”, “is”, “I”, “him”, etc. which are presumed to be uninformative in representing a text content. In many cases, we can remove those stop words from our sentences and it will not affect similarity calculation.
To remove stop words, we can include the stop_words=’english’ argument in the CountVectorizer constructor.
cv = CountVectorizer(stop_words='english') a1 = "I love Python" a2 = "Python is great and I love Python" sparse_matrix = cv.fit_transform([a1,a2]) df2 = pd.DataFrame(sparse_matrix.toarray(), columns =cv.get_feature_names_out()) df2
As we can see in the above word count table, most (stop) words got dropped and we are left with only words that provide useful information.
The cosine similarity calculate from the above simplified table is actually higher than what we would have calculated using the full word count table. This is because noise (extra stop words) was removed and we are left with more informative data that are much closer in meaning.
That’s an overview of what cosine similarity is and how to calculate it in Python!