Last Updated on March 30, 2021 by Jay
When we deal with data from different sources, we often encounter one problem – some strings could be spelled differently, but they have the same meaning. We humans can identify the meanings right away, but machines cannot. This short tutorial will cover how to find similar strings using Python.
Machines don’t really understand the human language…
For example, we have two sentences:
>>> s1 = "The leaf is green, and the pumpkin is orange, and the apple is red, and the banana is yellow."
>>> s2 = "the leaf is green, banana yellow, apple red, pumpkin orange"
We humans can tell that s1
and s2
have the same meaning. But in the eyes of a machine, they are two different sentences.
>>> s1 == s2
False
So how can we let the machine recognize that these two sentences mean the same thing? Applying machine learning technique is one way, but today we’ll show something much easier to use.
Introducing The Levenshtein Distance
It turns out that there’s a formula called Levenshtein distance that measures the minimum of single-character edits required to change one string into the other string. This distance is named after the Soviet mathematician Vladimir Levenshtein. In this tutorial, we will leave out all the theoretical details, but if you are interested, Wikipedia is a good starting point.
Fuzzy String Matching In Python
The appropriate terminology for finding similar strings is called a fuzzy string matching. We are going to use a library called fuzzywuzzy. Although it has a funny name, it a very popular library for fuzzy string matching. The fuzzywuzzy library can calculate the Levenshtein distance, and it has a few other powerful functions to help us with fuzzy string matching.
Let’s start from something easy, like comparing two words. Then we’ll move on to more complicated scenarios, such as comparing multiple sentences. Here we go!
Let’s first install the library. If you are new to this blog and need help with installing Python and libraries, read this tutorial here.
pip install fuzzywuzzy
Single Word Example
Starting with the simplest example. What do you think the computer views these two words: “Bezos” and “bezos”? Of course, they are not the same. Because of the capital letter “B” does not equal to the lower case “b”!
>>> "Bezos" == "bezos"
False
It’s rather easy to match these two words. we can simply make both words all lower cases (or upper cases), then compare again. We can use the String lower()
/ upper()
methods directly on any given string data.
>>> "Bezos".lower() == "bezos".lower()
True
>>> "Bezos".upper() == "bezos".upper()
True
Multiple Words Example
You might already know that “Jeff” is the short name for “Jeffery”, but apparently machines don’t know that yet.
>>> n1 = "Jeff Bezos"
>>> n2 = "Jeffery Bezos"
>>> n1 == n2
False
Now, using the fuzzy string matching technique, we get a number of 87. The ratio()
method calculates a Levenshtein distance ratio. Personally, I interpret these ratios as “probability of similarity”. The higher this ratio, the more similar the two words are.
from fuzzywuzzy import fuzz
>>> fuzz.ratio(n1, n2)
87
Let’s now introduce another variable n3 = “Bezos”, and then calculate the Levenshtein distance ratio with both “Jeff Bezos” and “Jeffery Bezos”.
The matching results are very poor. This is because by definition of the Levenshtein distance, it takes many edits to change “Bezos” to either “Jeff Bezos” or “Jeffery Bezos”.
n3 = "Bezos"
>>> fuzz.ratio(n1, n3)
67
>>> fuzz.ratio(n2, n3)
56
We can use fuzzywuzzy’s partial_ratio()
method to help in this situation. The partial_ratio()
method tries to match substrings within each sample string. In this case, because “Bezos” exist in both n1 or n2, we get a match value of 100 in both cases.
>>> fuzz.partial_ratio(n1, n3)
100
>>> fuzz.partial_ratio(n2, n3)
100
Sentence Example
Let’s take it a step further and compare two sentences.
>>> s3 = "The leaf is green, and the pumpkin is orange."
>>> s4 = "THE PUMPKIN IS ORANGE, AND THE LEAF IS GREEN"
>>> fuzz.ratio(s3.lower(), s4.lower())
47
>>> fuzz.partial_ratio(s3.lower(),s4.lower())
64
Note one sentence contains both upper and lower case letters, and the other sentence contains only upper case letters. We need to help the ratio
function a little bit by converting both sentences into lower case letters.
The performance is still very poor with the ratio()
methods. Remember, the higher the ratio, the closer the match. The partial_ratio()
method doesn’t help too much in this case either. This is because the two sentences have reversed order, and neither string is a subset of the other.
Do not worry, because the fuzzywuzzy library provides other powerful functions for string comparison!
>>> fuzz.token_sort_ratio(s3, s4)
100
To see why the token_sort_ratio()
function gives a perfect match, we need to understand what the function does, which is the following things:
- “Tokenizes” the sentences. Meaning it breaks up sentences into individual words
- Converts all words into lower cases
- Removes punctuations
- Sorts the string tokens (i.e. individual words) alphabetically
- Joins the individual words then compare the newly formed strings
It’s clear that by the end of the above process, the two strings s3
and s4
are essentially the same, no wonder we got a perfect match!
Sentence Example – Continued
Now let’s review our original question at the beginning of this tutorial. We’ll first try the token_sort_ratio
function.
>>> s1 = "The leaf is green, and the pumpkin is orange, and the apple is red, and the banana is yellow."
>>> s2 ="the leaf is green, banana yellow, apple red, pumpkin orange"
>>> fuzz.token_sort_ratio(s3, s4)
77
This time, the token_sort_ratio()
gives a ratio of 77. The reason is that the s1
sentence contains a lot more duplicated words than the s2
sentence, such as “and”, “the”, and “is”. Those words appeared several times in the s1
sentence, but only once in the s2
sentence. However, those duplicated words do not provide additional meaning to the first sentence, we need to help the machine filter them.
fuzzywuzzy provides another useful function to solve this problem: token_set_ratio()
.
This function is very similar to token_sort_ratio()
. The only difference between the two token functions is that instead of sorting the individual tokens, it applies a set()
to the list of individual words. So what does a set() do in Python? The following example explains it well – a set
can contain only unique items. This feature makes set
a convenient way to remove duplicated items from an iterable.
>>> li = [1,2,2,3,3,3,4,4,4,4]
>>> set(li)
{1, 2, 3, 4}
Let’s try it on the two strings s1
and s2
. Again, a beautifully perfect match! This is because the token_set_ratio()
has remove all duplicated words, and what’s left in the two strings matched pretty well.
>>> fuzz.token_set_ratio(s1, s2)
100
What’s next?
Now we are equipped with the knowledge of fuzzy string matching in Python. In the next tutorial, we’ll walk through some examples how to use fuzzy string matching together with pandas and Excel.
This article is exactly the help I needed with a recent python/excel obstacle, thank you for writing this! Looking forward to the next one.
Hi Some Guy on the Internet 🙂
Thank you! Comments like this keeps me motivated to write more stuff!
The next article will cover how to integrate fuzzy string matching with pandas. It’s already in the works and I just need to polish it more before publishing. Please stay tuned!
Jay