Last Updated on July 14, 2022 by Jay
In this tutorial, we’ll learn how to use Python to concatenate/append/merge/combine multiple Excel files.
Programming languages keep evolving, and Python/pandas are no exception. At the time of writing, the latest pandas version is v1.4.1. If you followed my other tutorial on how to combine Excel files, you will want to learn the method I’m about to show in this tutorial because the df.append() method is going away soon.
TL;DR
Use concat() to replace append() when combining data vertically, and use merge() to combine horizontally.
>>> pd.__version__
'1.4.1'
>>> df = pd.DataFrame()
>>> df.append(df)
Warning (from warnings module):
File "<pyshell#3>", line 1
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Library
For this tutorial, we’ll use the pandas library which is a 3rd party library. The standard process to install any 3rd party library is to use pip, a program that came with the Python installation. Go to your command prompt and type:
pip install pandas
#if you want to upgrade your pandas to the latest version, type:
pip install pandas --upgrade
Terminologies
Before going into the details, let’s get some terminologies out of the way. There are two ways of concatenating/combining data: vertically and horizontally.
- concatenating vertically – add data to the bottom of another data
- concatenating horizontally – add data to the “right” side of another data
Also, note I’ll be using the words concatenate and combine interchangeably throughout the tutorial.
Loading Excel Files Into Python
This is fairly simple and can be done with the read_excel() or read_csv() method, depending on your file type.
import pandas as pd
df1 = pd.read_excel('file1.xlsx')
df2 = pd.read_excel('file2.xlsx')
Use Python To Concatenate Excel Files Vertically
This is the old append() method, which will be replaced with concat() going forward. Combining Excel files vertically means we take data from 1 file/sheet, and put it at the bottom of the data on another file/sheet. Graphically it looks like the following:
The concat() method
concat() does exactly the same thing as append(), thus it’s going to replace append completely. Note the concat() is a pandas method, not a Dataframe method. You might be used to writing df.append(). However, if you follow that habit and write df.concat(), you’ll get an error because the Dataframe class doesn’t have a method called concat.
AttributeError: 'DataFrame' object has no attribute 'concat'
The top class pandas has the concat method. There are several key arguments:
pd.concat(objs, axis=0, join='outer')
- The 1st argument inside concat is a list/array of Series of Dataframe objects. Of course, we can pass in more than 2 items at once.
- The 2nd argument is axis to concatenate along. By default this is 0, meaning index/rows. Leaving this default value or setting axis = 0 will combine data vertically.
- The 3rd argument is join that specifies how to handel indexes on other axis. By default this is ‘outer’. In this case (concat rows), it means we’ll keep all columns from both dataframes. If we set this to ‘inner’, the resulting dataframe will only have the columns that are common to both df1 and df2. (In this case, the “Name” column)
To achieve the same results as the old append(), simply just pass in a list of dataframes, and leave all other arguments as default.
pd.concat([df1,df2,df1])
The append() method
This used to be my go-to approach when using Python to concatenate Excel files, although the concat() method also existed. Check out this tutorial here if you still want to learn more about append(), I will also update that tutorial to reflect the deprecation.
Use Python To Concatenate Excel Files Horizontally
This can be achieved using either concat() or merge() method. Combining Excel files horizontally means we combine data from multiple sources together using a unique key. Essentially it’s like adding more data (columns) to the existing data. Graphically it looks like the below, note here we use only unique keys (the ID column) from the table on the left; therefore, records 5 and 6 got dropped after the combination.
The concat() method
Although this method can combine data horizontally, it lacks flexibility compared with the merge() method.
The same arguments apply in this case, we’ll discuss just the three important ones:
- The 1st argument inside concat is a list/array of Series of Dataframe objects. Of course, we can pass in more than 2 items at once.
- The 2nd argument is axis to concatenate along. By default this is 0, meaning index/rows. To concatenate data along the column axis, set axis = 1.
- The 3rd argument is join that specifies how to handel indexes on other axis. By default this is ‘outer’. When concatenating by column axis, this argument controls how to handle the rows during concatenation. If we set this to ‘inner’, the resulting dataframe will only have the rows that are common to both df1 and df2. In our example, row ID 5 and 6 only exist in the right table, therefore they got dropped in the final dataframe.
pd.concat([df1,df2],axis = 1)
The above code uses the default join = ‘outer’ by leaving out the argument. Therefore it keeps all records from both dataframes.
There are a few problems with this concat method when combining data horizontally:
- The resultign dataframe has duplicated columns that we have to take clean up
- It only uses the index from both df as the unique key. As we see in this above example, the 3rd row is matched by the index, not by ID column. This method lacks the flexibility to specify what unique key to use.
Given these limitations, I recommend always use the merge() method when combining data horizontally.
The merge() method
The merge method is a Dataframe method, not a pandas method. The dataframe that initiates the merge is called the “left” dataframe and the dataframe to add is called the “right” dataframe (first argument inside the method).
The syntax is as the following, I only list the important arguments here:
left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, suffixes=('_x','_y')
- how controls the type of merge, possible chocies are “left”, “right”, “outer”, “inner”, and “cross”. Default is “inner”.
- on specifies the column to use as the unique key to merge. This can be a single column or a list of them. Use this argument if the unique keys have the same names.
- left_on specifies the unique keys/columns to use from the left dataframe for the merge. Useful when left and right dataframes contain different column names.
- right_on specifies the unique keys/columns to use from the right dataframe for the merge. Useful when left and right dataframes contain different column names.
- suffixes adds suffixes to columns that has the same names. By default it will add _x to the columns in left df, and _y to the columns in the right df.
As the above example shows, merge() method provides lots of flexibility and is more versatile when it comes to merging files/data. If the above short example doesn’t answer your questions, I have another more detailed tutorial on how to use the merge() method.
Conclusion
If you have an existing Python process to concatenate Excel files, you might want to update your append() to concat() before getting the nasty error. If you are new, then don’t worry and you are good to follow this tutorial!