Use fuzzy string matching in pandas

If your job requires working with large amount of data in Excel, and needing to match up certain records from time to time, you must have experienced the pain trying to match records with slightly different spellings. This tutorial might save you some time and lots of headaches! We talked about fuzzy string matching previously, now let’s try to use it together with pandas. Today we’ll walk through how to do fuzzy matching within dataframes.

The idea is that given two (or more) datasets, each contains a column of unique key identifiers that we can use to match up records. However, those unique key identifiers might have different spellings. This could be resulted from typo, or maybe the data is from different sources, etc.

If we were going to do this kind of matching in Excel, we can use the last argument in the vlookup formula to get an approximate match.

A simple example

Feel free to download the example Excel workbook to follow along. Although the information is outdated (Bezos no longer the richest person on earth), it doesn’t affect the demonstration so I won’t bother with refreshing to the most up-to-date data.

We have two datasets on the top 10 richest people on earth. One dataset contains their names, net worth, age, and nationality. The other dataset contains their source of wealth. We want to combine these two datasets, but some names are spelled differently between these two datasets, such as “Jeff Bezos” and “Jeffery Bezos”.

top10_richest_1
top10_richest_1
top10_richest_2
top10_richest_2

We will begin by loading the two datasets into Python.

import pandas as pd

df1 = pd.read_excel('Top 10 richest.xlsx', sheet_name= "Sheet1")
df2 = pd.read_excel('Top 10 richest.xlsx', sheet_name= "Sheet2")

A look at the process module

The process module from the fuzzywuzzy library is handy in this case, since it allows to compute the similarity score for a given keyword against a vector of string values. Let’s try an example with “Jeffery Bezos”. Since we want only the closest match, and ignore everything else, we’ll use the extractOne() function.

In the code below we are trying to find the best match for “Jeff Bezos” from the “Name” column in the df2.

Not surprisingly, it found the best match out of all names, which is “Jeffery Bezos”, with a similarity score of 87. The “4” at the end of the result represents the index of the matching record. i.e. “Jeffery Bezos” is the 5th name in the df2, thus an index of 4 (Python index starts from 0 remember?). Thanks to our reader Roy who pointed this out to me!

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

process.extractOne("Jeff Bezos", df2['Name'])
('Jeffery Bezos', 87, 4)

To make the output look a little cleaner, I’m going to convert the df2['Name'] column into a list, then compute again. Now we are left with the best match, and a similarity score. Much cleaner this time!

process.extractOne("Jeff Bezos", df2['Name'].to_list())
('Jeffery Bezos', 87)

Now, what if there’s no match? The extractOne will always return the best matching results out of all matches, even if the similarity score is very low (i.e. not a true match). Take a look at the below example, “Jack Ma” doesn’t exist in the first dataframe, let’s see what happens if we try to find a “close match”. The limit = 10 argument tells the function to return the first 10 matches, which means we want to see the matching result for everyone.

process.extract("Jack Ma", df1['Name'].to_list(), limit = 10)

[('Alice Walton', 42),
 ('Bernard Arnault\xa0and family', 39),
 ('Amancio Ortega', 39),
 ('Mark Zuckerberg', 39),
 ('Warren Buffett', 26),
 ('Larry Ellison', 26),
 ('Jeff Bezos', 24),
 ('Jim Walton', 24),
 ('S. Robson Walton', 14),
 ('Bill Gates', 12)]

It’s nice that the extract function already sorts the results from highest to lowest. “Alice Walton” has the highest score (42) out of all ten people. Now if we blindly use extractOne, this is what will happen:

process.extractOne("Jack Ma", df1['Name'].to_list())

('Alice Walton', 42)

Hmm… we all know Alice Walton is not Jack Ma! How sad that the computer doesn’t know that.

Setting a threshold for fuzzy matching results

There’s an easy fix to this. extractOne takes an optional argument called score_cutoff, which is set to 0 by default if omitted. We can increase this score cutoff value such that a match is considered to be valid (and returned) only if the score is higher than the threshold, otherwise no result will be returned. When you run the below code, nothing will show up because it returned None. If you do a print() statement you’ll see it.

process.extractOne("Jack Ma", df1['Name'].to_list(), score_cutoff=80)

print(process.extractOne("Jack Ma", df1['Name'].to_list(), score_cutoff=80))
None

Now we know how to do fuzzy matching for one entry, let’s apply that to all the records in a dataframe column. Remember, we should always try to vectorize operations in pandas, and never use a for/while loop due to its poor performance. We have previously talked about this point in the Replicate Excel VLOOKUP, HLOOKUP, XLOOKUP in Python tutorial, and the vectorized solution is to leverage pandas.apply() method, and a Python lambda function.

pandas.apply() refresher

pandas.apply() method basically applies a function to the entire column of a pandas dataframe, the syntax goes like this:

pd.apply(func)

In this case, the process.extractOne() method (from fuzzywuzzy) is the function we want to apply for. In order to use process.extractOne() inside apply(), we can put the extractOne() method into a lambda function.

Lambda function refresher

Python lambda function doesn’t require a name, and can take any number of arguments and returns an expression. The syntax goes like this:

lambda arguments: expression

We will take the “Name” column from df1, then fuzzy match to the “Name” column from df2. If we find a valid match from df2, we’ll place the matched value back into df1, in a column named “name_from_df2”.

df1['name_from_df2'] = df1['Name'].apply(lambda x: process.extractOne(x, df2['Name'].to_list(),score_cutoff=80))

>>> df1[['Name','name_from_df2']]
                         Name                   name_from_df2
0                  Jeff Bezos             (Jeffery Bezos, 87)
1                  Bill Gates            (Bill Gates III, 95)
2  Bernard Arnault and family  (Bernard Arnauld & Family, 88)
3              Warren Buffett           (Warrant Buffett, 90)
4               Larry Ellison            (Larry Ellison, 100)
5              Amancio Ortega      (Amancio Ortega Gaona, 95)
6             Mark Zuckerberg          (Mark Zuckerberg, 100)
7                  Jim Walton                            None
8                Alice Walton             (Alice Walton, 100)
9            S. Robson Walton             (Robson Walton, 95)

In the above code, we can consider the x is each element of the name list from the first dataframe, df1. We have effectively (fuzzy) matched names from df1 and df2. Given the score_cutoff value of 80, we are also confident that only true matches will be returned.

As we saw in the output above, the process.extractOne() returns a tuple containing the matched result and a similarity score. However, we are only interested in the matched value, not the similarity score, so we’ll keep just the names.

temp_name_list = df1['name_from_df2'].to_list()
temp_name_list = [_[0] if _ != None else None for _ in temp_name_list]
df1['name_from_df2'] = temp_name_list

The name_from_df2 column from df1 now contains the same values as the names in df2, except for the records where the fuzzy match failed (similarity score < 80). We have essentially transfer names from df2 into df1. Finally we can use the merge operation to combine the two dataframes.

df1 = df1.merge(df2, left_on = 'temp_name', right_on = 'Name', suffixes=('','_2'))
df1.drop(['Name_2','temp_name'],axis=1, inplace=True)

The last line df1.drop() is just to remove extra name columns that we created along the way.

Putting it together

Now you understand how to use fuzzy matching together with pandas. This technique has helped me a ton and saved me lots of time in the past. I hope you find it helpful as well!

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process


df1 = pd.read_excel('Top 10 richest.xlsx', sheet_name= "Sheet1")
df2 = pd.read_excel('Top 10 richest.xlsx', sheet_name= "Sheet2")


df1['name_from_df2'] = df1['Name'].apply(lambda x: process.extractOne(x, df2['Name'].to_list(),score_cutoff=80))
name_from_df2_list = df1['name_from_df2'].to_list()
name_from_df2_list = [_[0] if _ != None else None for _ in name_from_df2_list]
df1['name_from_df2'] = name_from_df2_list

df1 = df1.merge(df2, left_on = 'name_from_df2', right_on = 'Name', suffixes=('','_2'))
df1.drop(['Name_2','name_from_df2'],axis=1, inplace=True)

5 comments

  1. Excellent article. The “4” returned from extractOne seems to be the index to the matching entry.

    1. I think you are absolutely right!
      Thank you for pointing that out!
      Just updated the article with that new knowledge 🙂

Comments are closed.