I work in the insurance industry and deal with lots of data every day. One time, I got tasked to merge multiple Excel files into one “master spreadsheet”. Each Excel file has different data fields on insurance policies, such as policy number, age, gender, insured amounts, etc. These files have one column in common, which is the policy IDs. In the past I would just use Excel and vlookup formulas, or PowerQuery’s merge data function. Those tool used to work well; however, they become a liability when we need to deal with large datasets. And this is where Python can help.
Note this article talks about merging Excel files with with a common ID but different data fields. You might be also interested in appending Excel files, which is used to aggregate files with the same format.
This tutorial assumes that you have a basic understanding of Python and the pandas library. If you feel lost, this series of articles might help.
The Excel files
Below are a few mock-up spreadsheets, you can also go to here to grab a copy and follow along. These are very small datasets just for the purpose of the demonstration.
We notice the “PolicyID” columns contain what is known as a “Unique Key Identifier”, which can be used to link up policies among the three spreadsheets. Being proficient in Excel, my first reaction was: this is easy, vlookup will do the job. I can use vlookups to find values for each PolicyID, and bring all data fields into one spreadsheet! It turned out to be a bad idea since I was dealing with hundreds of thousands of records, and I spent about a whole day to build a massive spreadsheet with millions of vlookup and other formulas.
It was one of the worst Excel files that I’ve ever created. The spreadsheet was 150MB in size, and it takes about 30 minutes to re-calculate whenever I make a change. What a waste of time and effort. Just terrible.
I wish I knew Python back then, it could have saved me tons of time and headaches. (i.e. waiting for the spreadsheet to re-calculate)
The Python🐍 way
As usual, we start off by importing our favorite library –
pandas. Then we read all three Excel files into Python.
import pandas as pd df_1 = pd.read_excel('sheet1.xslx') df_2 = pd.read_excel('sheet2.xslx') df_3 = pd.read_excel('sheet3.xslx')
As a reminder, a pandas dataframe is a tabular data object which looks exactly like an Excel spreadsheet – rows, columns and cells!
Pandas has a method
.merge() to merge multiple datasets efficiently.
df_combine = df_1.merge(df_2, left_on='PolicyID', right_on = "ID") df_combine = df_combine.merge(df_3, on='PolicyID')
The first merge()
df_1 is called the left dataframe, and the
df_2 is called the right dataframe. Merging
df_1 basically means we are bringing all the data from the two dataframes together, matching each record from
df_1, using a common unique key.
Just like the Excel vlookup formula, except that we’ve achieved the same result with 1 line of code instead of millions of formulas!
Notice that in the first Excel file, the column “PolicyID” contains policy numbers. In contrast, in the second Excel file, the column “ID” contains the policy numbers, so we had to specify that, for the left dataframe (
df_1), we want to use “PolicyID” column as the unique key. For the right dataframe (
df_2), we want to use “ID” column as the unique key.
There is the same number of records in both
df_2, so we can get a one-to-one match and bring both dataframes together.
The second merge()
We take the result from the first merge operation, then merge with another
df_3. This time, because both dfs have the same common column “PolicyID”, we just need to specify it with
on = 'PolicyID'. The final combined dataframe has 8 rows and 11 columns.
A few interesting observations about the final combined dataframe:
- Both PolicyID (from
df_1) and ID (from
df_2) got brought into the dataframe, we’ll have to drop one to clean up the data.
- There are two “Account Value” columns. Account Value_x (from
df_2) and Account Value_y (from
df_3). When there are two identical columns, by default,
pandaswill assign suffix “_x”, “_y”, etc. to the end of column names. We can change the suffix by using the optional argument
suffixes=('_x', '_y')inside the
- There are only 8 rows in the final dataframe. This is due to
df_3has only 8 records. And by default, the
merge()performs an “inner” merge which use intersection of keys from both frames, similar to a SQL inner join.
Putting it All Together
import pandas as pd #load datasets from Excel files df_1 = pd.read_excel('sheet1.xlsx') df_2 = pd.read_excel('sheet2.xlsx') df_3 = pd.read_excel('sheet3.xlsx') #merge datasets df_combine = df_1.merge(df_2, left_on='PolicyID', right_on = "ID") df_combine = df_combine.merge(df_3, on='PolicyID') #output back into Excel df_combine.to_excel('combine_df.xslx')
Merge Performance: Excel vs Python
You are probably already familiar with Excel and know how slow it can be if you have thousands of lookup formulas. I want to show you how fast Python can merge two large datasets.
- The two dataframes come from two csv files, which are 27MB and 25MB in size, respectively.
- 2.03 seconds to load both files into Python.
- 1.26 seconds to merge the two datasets together.
You can boost your work efficiency by x1000 times by using Python to handle your data needs. Now you know how, so it’s time to start using it!