How to merge multiple Excel files in Python

I work in the insurance industry and deal with lots of data every day. One time, I got tasked to merge multiple Excel files into one “master spreadsheet”. Each Excel file has different data fields on insurance policies, such as policy number, age, gender, insured amounts, etc. These files have one column in common, which is the policy IDs. In the past I would just use Excel and vlookup formulas, or PowerQuery’s merge data function. Those tool used to work well; however, they become a liability when we need to deal with large datasets. And this is where Python can help.

Note this article talks about merging Excel files with with a common ID but different data fields. You might be also interested in appending Excel files, which is used to aggregate files with the same format.

This tutorial assumes that you have a basic understanding of Python and the pandas library. If you feel lost, this series of articles might help.

The Excel files

Below are a few mock-up spreadsheets, you can also go to here to grab a copy and follow along. These are very small datasets just for the purpose of the demonstration.

Example spreadsheet 1
Dataset/Spreadsheet 1: Basic policyholder information
Example spreadsheet 2
Dataset/Spreadsheet 2: Policyholder insured amount and type

Example spreadsheet 3
Dataset/Spreadsheet 3: Policyholder death report

We notice the “PolicyID” columns contain what is known as a “Unique Key Identifier”, which can be used to link up policies among the three spreadsheets. Being proficient in Excel, my first reaction was: this is easy, vlookup will do the job. I can use vlookups to find values for each PolicyID, and bring all data fields into one spreadsheet! It turned out to be a bad idea since I was dealing with hundreds of thousands of records, and I spent about a whole day to build a massive spreadsheet with millions of vlookup and other formulas.

It was one of the worst Excel files that I’ve ever created. The spreadsheet was 150MB in size, and it takes about 30 minutes to re-calculate whenever I make a change. What a waste of time and effort. Just terrible.

I wish I knew Python back then, it could have saved me tons of time and headaches. (i.e. waiting for the spreadsheet to re-calculate)

The Python🐍 way

As usual, we start off by importing our favorite library – pandas. Then we read all three Excel files into Python.

import pandas as pd
df_1 = pd.read_excel('sheet1.xslx')
df_2 = pd.read_excel('sheet2.xslx') 
df_3 = pd.read_excel('sheet3.xslx') 

As a reminder, a pandas dataframe is a tabular data object which looks exactly like an Excel spreadsheet – rows, columns and cells!

pandas dataframe just looks like an Excel spreadsheet
Pandas dataframe just looks like an Excel spreadsheet

Pandas has a method .merge() to merge multiple datasets efficiently.

df_combine = df_1.merge(df_2, left_on='PolicyID', right_on = "ID")
df_combine = df_combine.merge(df_3, on='PolicyID')

The first merge()

Here, df_1 is called the left dataframe, and the df_2 is called the right dataframe. Merging df_2 with df_1 basically means we are bringing all the data from the two dataframes together, matching each record from df_2 to df_1, using a common unique key.

Just like the Excel vlookup formula, except that we’ve achieved the same result with 1 line of code instead of millions of formulas!

Notice that in the first Excel file, the column “PolicyID” contains policy numbers. In contrast, in the second Excel file, the column “ID” contains the policy numbers, so we had to specify that, for the left dataframe (df_1), we want to use “PolicyID” column as the unique key. For the right dataframe (df_2), we want to use “ID” column as the unique key.

There is the same number of records in both df_1 and df_2, so we can get a one-to-one match and bring both dataframes together.

combined dataframe
Combined dataframe, total 21 rows and 8 columns

The second merge()

We take the result from the first merge operation, then merge with another df_3. This time, because both dfs have the same common column “PolicyID”, we just need to specify it with on = 'PolicyID'. The final combined dataframe has 8 rows and 11 columns.

final combined dataframe
Merge multiple Excel files into one dataset using Python

A few interesting observations about the final combined dataframe:

  1. Both PolicyID (from df_1) and ID (from df_2) got brought into the dataframe, we’ll have to drop one to clean up the data.
  2. There are two “Account Value” columns. Account Value_x (from df_2) and Account Value_y (from df_3). When there are two identical columns, by default, pandas will assign suffix “_x”, “_y”, etc. to the end of column names. We can change the suffix by using the optional argument suffixes=('_x', '_y') inside the merge() method.
  3. There are only 8 rows in the final dataframe. This is due to df_3 has only 8 records. And by default, the merge() performs an “inner” merge which use intersection of keys from both frames, similar to a SQL inner join.

Putting it All Together

import pandas as pd

#load datasets from Excel files
df_1 = pd.read_excel('sheet1.xlsx')
df_2 = pd.read_excel('sheet2.xlsx')
df_3 = pd.read_excel('sheet3.xlsx')

#merge datasets
df_combine = df_1.merge(df_2, left_on='PolicyID', right_on = "ID")
df_combine = df_combine.merge(df_3, on='PolicyID')

#output back into Excel

Merge Performance: Excel vs Python

You are probably already familiar with Excel and know how slow it can be if you have thousands of lookup formulas. I want to show you how fast Python can merge two large datasets.

Superior Python merge performance
Superior Python merge performance
  • The two dataframes come from two csv files, which are 27MB and 25MB in size, respectively.
  • 2.03 seconds to load both files into Python.
  • 1.26 seconds to merge the two datasets together.

You can boost your work efficiency by x1000 times by using Python to handle your data needs. Now you know how, so it’s time to start using it!