# How to find similar strings using Python

Sharing is caring!

Last Updated on March 30, 2021 by Jay

When we deal with data from different sources, we often encounter one problem – some strings could be spelled differently, but they have the same meaning. We humans can identify the meanings right away, but machines cannot. This short tutorial will cover how to find similar strings using Python.

## Machines don’t really understand the human language…

For example, we have two sentences:

``````>>> s1 = "The leaf is green, and the pumpkin is orange, and the apple is red, and the banana is yellow."
>>> s2 = "the leaf is green, banana yellow, apple red, pumpkin orange"``````

We humans can tell that `s1` and `s2 `have the same meaning. But in the eyes of a machine, they are two different sentences.

``````>>> s1 == s2
False``````

So how can we let the machine recognize that these two sentences mean the same thing? Applying machine learning technique is one way, but today we’ll show something much easier to use.

## Introducing The Levenshtein Distance

It turns out that there’s a formula called Levenshtein distance that measures the minimum of single-character edits required to change one string into the other string. This distance is named after the Soviet mathematician Vladimir Levenshtein. In this tutorial, we will leave out all the theoretical details, but if you are interested, Wikipedia is a good starting point.

### Fuzzy String Matching In Python

The appropriate terminology for finding similar strings is called a fuzzy string matching. We are going to use a library called fuzzywuzzy. Although it has a funny name, it a very popular library for fuzzy string matching. The fuzzywuzzy library can calculate the Levenshtein distance, and it has a few other powerful functions to help us with fuzzy string matching.

Let’s start from something easy, like comparing two words. Then we’ll move on to more complicated scenarios, such as comparing multiple sentences. Here we go!

Let’s first install the library. If you are new to this blog and need help with installing Python and libraries, read this tutorial here.

``pip install fuzzywuzzy``

### Single Word Example

Starting with the simplest example. What do you think the computer views these two words: “Bezos” and “bezos”? Of course, they are not the same. Because of the capital letter “B” does not equal to the lower case “b”!

``````>>> "Bezos" == "bezos"
False``````

It’s rather easy to match these two words. we can simply make both words all lower cases (or upper cases), then compare again. We can use the String `lower()` / `upper()` methods directly on any given string data.

``````>>> "Bezos".lower() == "bezos".lower()
True

>>> "Bezos".upper() == "bezos".upper()
True``````

### Multiple Words Example

You might already know that “Jeff” is the short name for “Jeffery”, but apparently machines don’t know that yet.

``````>>> n1 = "Jeff Bezos"
>>> n2 = "Jeffery Bezos"
>>> n1 == n2
False``````

Now, using the fuzzy string matching technique, we get a number of 87. The `ratio()` method calculates a Levenshtein distance ratio. Personally, I interpret these ratios as “probability of similarity”. The higher this ratio, the more similar the two words are.

``````from fuzzywuzzy import fuzz

>>> fuzz.ratio(n1, n2)
87``````

Let’s now introduce another variable n3 = “Bezos”, and then calculate the Levenshtein distance ratio with both “Jeff Bezos” and “Jeffery Bezos”.

The matching results are very poor. This is because by definition of the Levenshtein distance, it takes many edits to change “Bezos” to either “Jeff Bezos” or “Jeffery Bezos”.

``````n3 = "Bezos"
>>> fuzz.ratio(n1, n3)
67
>>> fuzz.ratio(n2, n3)
56``````

We can use fuzzywuzzy’s `partial_ratio()` method to help in this situation. The `partial_ratio()` method tries to match substrings within each sample string. In this case, because “Bezos” exist in both n1 or n2, we get a match value of 100 in both cases.

``````>>> fuzz.partial_ratio(n1, n3)
100

>>> fuzz.partial_ratio(n2, n3)
100``````

### Sentence Example

Let’s take it a step further and compare two sentences.

``````>>> s3 = "The leaf is green, and the pumpkin is orange."
>>> s4 = "THE PUMPKIN IS ORANGE, AND THE LEAF IS GREEN"
>>> fuzz.ratio(s3.lower(), s4.lower())
47
>>> fuzz.partial_ratio(s3.lower(),s4.lower())
64``````

Note one sentence contains both upper and lower case letters, and the other sentence contains only upper case letters. We need to help the `ratio` function a little bit by converting both sentences into lower case letters.

The performance is still very poor with the `ratio()` methods. Remember, the higher the ratio, the closer the match. The `partial_ratio()` method doesn’t help too much in this case either. This is because the two sentences have reversed order, and neither string is a subset of the other.

Do not worry, because the fuzzywuzzy library provides other powerful functions for string comparison!

``````>>> fuzz.token_sort_ratio(s3, s4)
100``````

To see why the `token_sort_ratio()` function gives a perfect match, we need to understand what the function does, which is the following things:

• “Tokenizes” the sentences. Meaning it breaks up sentences into individual words
• Converts all words into lower cases
• Removes punctuations
• Sorts the string tokens (i.e. individual words) alphabetically
• Joins the individual words then compare the newly formed strings

It’s clear that by the end of the above process, the two strings `s3` and `s4` are essentially the same, no wonder we got a perfect match!

### Sentence Example – Continued

Now let’s review our original question at the beginning of this tutorial. We’ll first try the `token_sort_ratio` function.

``````>>> s1 = "The leaf is green, and the pumpkin is orange, and the apple is red, and the banana is yellow."
>>> s2 ="the leaf is green, banana yellow, apple red, pumpkin orange"
>>> fuzz.token_sort_ratio(s3, s4)
77``````

This time, the `token_sort_ratio()` gives a ratio of 77. The reason is that the `s1` sentence contains a lot more duplicated words than the `s2` sentence, such as “and”, “the”, and “is”. Those words appeared several times in the `s1` sentence, but only once in the `s2` sentence. However, those duplicated words do not provide additional meaning to the first sentence, we need to help the machine filter them.

fuzzywuzzy provides another useful function to solve this problem: `token_set_ratio()`.

This function is very similar to `token_sort_ratio()`. The only difference between the two token functions is that instead of sorting the individual tokens, it applies a `set()` to the list of individual words. So what does a set() do in Python? The following example explains it well – a `set` can contain only unique items. This feature makes `set` a convenient way to remove duplicated items from an iterable.

``````>>> li = [1,2,2,3,3,3,4,4,4,4]
>>> set(li)
{1, 2, 3, 4}``````

Let’s try it on the two strings `s1` and `s2`. Again, a beautifully perfect match! This is because the `token_set_ratio()` has remove all duplicated words, and what’s left in the two strings matched pretty well.

``````>>> fuzz.token_set_ratio(s1, s2)
100``````

## What’s next?

Now we are equipped with the knowledge of fuzzy string matching in Python. In the next tutorial, we’ll walk through some examples how to use fuzzy string matching together with pandas and Excel.