If your job requires working with large amount of data in Excel, and needing to match up certain records from time to time, you must have experienced the pain trying to match records with slightly different spellings. This tutorial might save you some time and lots of headaches! We talked about fuzzy string matching previously, now let’s try to use it together with pandas. Today we’ll walk through how to do fuzzy matching within dataframes.
The idea is that given two (or more) datasets, each contains a column of unique key identifiers that we can use to match up records. However, those unique key identifiers might have different spellings. This could be resulted from typo, or maybe the data is from different sources, etc.
If we were going to do this kind of matching in Excel, we can use the last argument in the vlookup formula to get an approximate match.
A simple example
Feel free to download the example Excel workbook to follow along. Although the information is outdated (Bezos no longer the richest person on earth), it doesn’t affect the demonstration so I won’t bother with refreshing to the most up-to-date data.
We have two datasets on the top 10 richest people on earth. One dataset contains their names, net worth, age, and nationality. The other dataset contains their source of wealth. We want to combine these two datasets, but some names are spelled differently between these two datasets, such as “Jeff Bezos” and “Jeffery Bezos”.
We will begin by loading the two datasets into Python.
import pandas as pd df1 = pd.read_excel('Top 10 richest.xlsx', sheet_name= "Sheet1") df2 = pd.read_excel('Top 10 richest.xlsx', sheet_name= "Sheet2")
A look at the process module
process module from the
fuzzywuzzy library is handy in this case, since it allows to compute the similarity score for a given keyword against a vector of string values. Let’s try an example with “Jeffery Bezos”. Since we want only the closest match, and ignore everything else, we’ll use the
In the code below we are trying to find the best match for “Jeff Bezos” from the “Name” column in the
Not surprisingly, it found the best match out of all names, which is “Jeffery Bezos”, with a similarity score of 87. The “4” at the end of the result represents the index of the matching record. i.e. “Jeffery Bezos” is the 5th name in the
df2, thus an index of 4 (Python index starts from 0 remember?). Thanks to our reader Roy who pointed this out to me!
from fuzzywuzzy import fuzz from fuzzywuzzy import process process.extractOne("Jeff Bezos", df2['Name']) ('Jeffery Bezos', 87, 4)
To make the output look a little cleaner, I’m going to convert the
df2['Name'] column into a list, then compute again. Now we are left with the best match, and a similarity score. Much cleaner this time!
process.extractOne("Jeff Bezos", df2['Name'].to_list()) ('Jeffery Bezos', 87)
Now, what if there’s no match? The
extractOne will always return the best matching results out of all matches, even if the similarity score is very low (i.e. not a true match). Take a look at the below example, “Jack Ma” doesn’t exist in the first dataframe, let’s see what happens if we try to find a “close match”. The
limit = 10 argument tells the function to return the first 10 matches, which means we want to see the matching result for everyone.
process.extract("Jack Ma", df1['Name'].to_list(), limit = 10) [('Alice Walton', 42), ('Bernard Arnault\xa0and family', 39), ('Amancio Ortega', 39), ('Mark Zuckerberg', 39), ('Warren Buffett', 26), ('Larry Ellison', 26), ('Jeff Bezos', 24), ('Jim Walton', 24), ('S. Robson Walton', 14), ('Bill Gates', 12)]
It’s nice that the extract function already sorts the results from highest to lowest. “Alice Walton” has the highest score (42) out of all ten people. Now if we blindly use
extractOne, this is what will happen:
process.extractOne("Jack Ma", df1['Name'].to_list()) ('Alice Walton', 42)
Hmm… we all know Alice Walton is not Jack Ma! How sad that the computer doesn’t know that.
Setting a threshold for fuzzy matching results
There’s an easy fix to this.
extractOne takes an optional argument called
score_cutoff, which is set to 0 by default if omitted. We can increase this score cutoff value such that a match is considered to be valid (and returned) only if the score is higher than the threshold, otherwise no result will be returned. When you run the below code, nothing will show up because it returned None. If you do a print() statement you’ll see it.
process.extractOne("Jack Ma", df1['Name'].to_list(), score_cutoff=80) print(process.extractOne("Jack Ma", df1['Name'].to_list(), score_cutoff=80)) None
Now we know how to do fuzzy matching for one entry, let’s apply that to all the records in a dataframe column. Remember, we should always try to vectorize operations in pandas, and never use a
for/while loop due to its poor performance. We have previously talked about this point in the Replicate Excel VLOOKUP, HLOOKUP, XLOOKUP in Python tutorial, and the vectorized solution is to leverage
pandas.apply() method, and a Python lambda function.
pandas.apply() method basically applies a function to the entire column of a pandas dataframe, the syntax goes like this:
In this case, the
process.extractOne() method (from
fuzzywuzzy) is the function we want to apply for. In order to use
apply(), we can put the
extractOne() method into a lambda function.
Lambda function refresher
Python lambda function doesn’t require a name, and can take any number of arguments and returns an expression. The syntax goes like this:
lambda arguments: expression
We will take the “Name” column from df1, then fuzzy match to the “Name” column from df2. If we find a valid match from df2, we’ll place the matched value back into df1, in a column named “name_from_df2”.
df1['name_from_df2'] = df1['Name'].apply(lambda x: process.extractOne(x, df2['Name'].to_list(),score_cutoff=80)) >>> df1[['Name','name_from_df2']] Name name_from_df2 0 Jeff Bezos (Jeffery Bezos, 87) 1 Bill Gates (Bill Gates III, 95) 2 Bernard Arnault and family (Bernard Arnauld & Family, 88) 3 Warren Buffett (Warrant Buffett, 90) 4 Larry Ellison (Larry Ellison, 100) 5 Amancio Ortega (Amancio Ortega Gaona, 95) 6 Mark Zuckerberg (Mark Zuckerberg, 100) 7 Jim Walton None 8 Alice Walton (Alice Walton, 100) 9 S. Robson Walton (Robson Walton, 95)
In the above code, we can consider the
x is each element of the name list from the first dataframe, df1. We have effectively (fuzzy) matched names from df1 and df2. Given the
score_cutoff value of 80, we are also confident that only true matches will be returned.
As we saw in the output above, the
process.extractOne() returns a tuple containing the matched result and a similarity score. However, we are only interested in the matched value, not the similarity score, so we’ll keep just the names.
temp_name_list = df1['name_from_df2'].to_list() temp_name_list = [_ if _ != None else None for _ in temp_name_list] df1['name_from_df2'] = temp_name_list
The name_from_df2 column from
df1 now contains the same values as the names in
df2, except for the records where the fuzzy match failed (similarity score < 80). We have essentially transfer names from df2 into df1. Finally we can use the merge operation to combine the two dataframes.
df1 = df1.merge(df2, left_on = 'temp_name', right_on = 'Name', suffixes=('','_2')) df1.drop(['Name_2','temp_name'],axis=1, inplace=True)
The last line
df1.drop() is just to remove extra name columns that we created along the way.
Putting it together
Now you understand how to use fuzzy matching together with pandas. This technique has helped me a ton and saved me lots of time in the past. I hope you find it helpful as well!
import pandas as pd from fuzzywuzzy import fuzz from fuzzywuzzy import process df1 = pd.read_excel('Top 10 richest.xlsx', sheet_name= "Sheet1") df2 = pd.read_excel('Top 10 richest.xlsx', sheet_name= "Sheet2") df1['name_from_df2'] = df1['Name'].apply(lambda x: process.extractOne(x, df2['Name'].to_list(),score_cutoff=80)) name_from_df2_list = df1['name_from_df2'].to_list() name_from_df2_list = [_ if _ != None else None for _ in name_from_df2_list] df1['name_from_df2'] = name_from_df2_list df1 = df1.merge(df2, left_on = 'name_from_df2', right_on = 'Name', suffixes=('','_2')) df1.drop(['Name_2','name_from_df2'],axis=1, inplace=True)