Use Python to combine multiple Excel files into one master spreadsheet

March 26, 2020 Jay Beginner, machine learning, Office Automation, Python

Last Updated on April 9, 2023 by Jay

In this short tutorial, I’ll show you how to use Python to combine multiple Excel files into one master spreadsheet. Imagine that you have dozens of Excel files with the same data fields, and your job is to aggregate sheets from those files. Manually doing this job is super inefficient, and Python will save you a lot of time in the long run, so let’s all work smarter!

Note that this article talks about appending Excel files with the same format/data fields. Merging multiple dataset is a different task.

If you are new to Python, this series Integrate Python with Excel offers some tips on how to use Python to supercharge your Excel spreadsheets.

The workflow

To solve the problem, we’ll need to follow the below work flow:

Identify the files we need to combine
Get data from the file
Move data from step 2) to a master dataset (we will call it “dataframe”)
Report 2-3 for the number of files
Save the master dataset into an Excel spreadsheet

Import libraries

Alright, let’s see how to code the above work flow in Python. For this exercise, we’ll need to use two Python libraries: os and pandas. If you want to follow along, feel free to grab the source code and files used in this tutorial from here. Although you can combine as many Excel files as you wish, we’ll use three files to demonstrate the process.

If you need help with installing Python or libraries, here’s a guide on how to do that.

os library gives a way of using operating system dependent functionalities. Such as manipulating folder and file paths. We use this library to get all the Excel file names, including their paths.

pandas library is the gold standard for data analysis and manipulation. It is fast, powerful, and flexible. We use this library to load Excel data into Python, manipulate data, and recreate the master spreadsheet.

We’ll start by importing these two libraries. Then find the current working directory, as well as all the file names within the directory.

import os
import pandas as pd
cwd = os.path.abspath('') 
files = os.listdir(cwd)

Getting current working directory and files within it

The variable cwd shows the path to the current working directory, and the variable files is a list of all the file names within the current working directory. Notice there are non-Excel files, and we don’t want to open those, so we’ll handle that soon.

Combine multiple Excel files into one spreadsheet

Next, we create an empty dataframe df for storing the data for master spreadsheet. We loop through all the files within the current working directory, but only process the Excel files whose name ends with “.xlsx”. This is done by this line of code
if file.endswith('.xlsx'):

pd.read_excel() will read Excel data into Python and store it as a pandas DataFrame object. Be aware that this method reads only the first tab/sheet of the Excel file by default. If your Excel file contains more than 1 sheet, continue reading to the next section.

df.append() will append/combine data from one file to another. Think about copying a block of data from one Excel file and pasting it into another. Instead of opening up Excel, data is stored inside your computer’s memory.

df = pd.DataFrame()
for file in files:
     if file.endswith('.xlsx'):
         df = df.append(pd.read_excel(file), ignore_index=True) 
df.head()

The above code does the following:

Loop through all the files in the current working directory, determine if a file is Excel by checking the file name ends with “.xlsx”.
If yes, read the file content (data), and append/add it to the master dataframe variable called df.
Save the master dataframe into an Excel spreadsheet.

We can examine the master dataframe by checking df.head(), which shows the first 5 rows of the data.

Checking the first 5 rows of data in the dataframe

Seems good! Just another quick check to make sure we have loaded everything in the DataFrame. df.shape will show us the dimension (36 rows, 5 columns) of the data:

Everything looks good, so let’s output the data back into Excel. The last line df.to_excel() will do that.

Combine multiple sheets from the same Excel file

I talked about the two techniques to read multiple sheets from the same Excel file, so I won’t repeat it. However, I’ll walk through an example here with a slightly different setting.

We have 2 files each contains a number of sheets. We don’t know how many sheets are in each file, but we know the format is the same for all sheets. Our goal is to aggregate all sheets into one spreadsheet (and one file).

The workflow is similar:

Get all Excel files
Loop through the Excel files
For each file, loop through all sheets
Read each sheet into a dataframe, then combine all dataframes together.

df_total = pd.DataFrame()
for file in files:  # loop through Excel files
    if file.endswith('.xlsx'):
        excel_file = pd.ExcelFile(file)
        sheets = excel_file.sheet_names
        for sheet in sheets: # loop through sheets inside an Excel file
            df = excel_file.parse(sheet_name = sheet)
            df_total = df_total.append(df)
df_total.to_excel('combined_file.xlsx')

Putting it all together

Below is the full code put together. 10 lines of code will help you combine all your Excel files or sheets into one master spreadsheet. Enjoy!

import os
import pandas as pd
cwd = os.path.abspath('') 
files = os.listdir(cwd)  

## Method 1 gets the first sheet of a given file
df = pd.DataFrame()
for file in files:
    if file.endswith('.xlsx'):
        df = df.append(pd.read_excel(file), ignore_index=True) 
df.head() 
df.to_excel('total_sales.xlsx')



## Method 2 gets all sheets of a given file
df_total = pd.DataFrame()
for file in files:                         # loop through Excel files
    if file.endswith('.xlsx'):
        excel_file = pd.ExcelFile(file)
        sheets = excel_file.sheet_names
        for sheet in sheets:               # loop through sheets inside an Excel file
            df = excel_file.parse(sheet_name = sheet)
            df_total = df_total.append(df)
df_total.to_excel('combined_file.xlsx')

25 comments

Aayushi Jajodia says:

February 7, 2022 at 4:54 pm

Hi, is there a way to combine 2 different data frames from 2 different excels/csvs into different sheets of a single workbook?

Reply
1. Jay says:
  
  February 7, 2022 at 9:53 pm
  
  Hi Aayushi,
  
  Yes there is! I’ll make a tutorial about it shortly, please stay stuned!
  
  Reply
2. Jay says:
  
  February 9, 2022 at 12:06 am
  
  Just published:
  https://pythoninoffice.com/combine-excel-sheets-using-python/
  
  Hope it helps. Let me know if you have other questions!
  
  Reply
sankari says:

February 15, 2022 at 12:45 pm

Hi Jay I followed the step in combining multiple excel files with multiple sheets .I am getting empty combined file .do i need to mention the sheet names ? what i am missing.

Thanks in advance

Reply
1. Jay says:
  
  February 15, 2022 at 5:44 pm
  
  Hi Sankari,
  
  It depends on what your excel files look like. If you can share a screenshot of the file or the actual spreadsheets and your code, that’ll help solve your problem.
  
  Feel free to use this Discord server to post your screenshot/files: https://discord.gg/ChP87jJY7c
  
  Reply
Muhammad says:

February 17, 2022 at 12:39 pm

Hi Jay, I have an excel file which has sheets formatted differently, I am trying to make it so that these sheets are read and combined into another excel file in which I want the sheet to be continuous flowing together after formatting has been made the same for them….

Reply
1. Jay says:
  
  February 22, 2022 at 11:05 pm
  
  Muhammad,
  
  Thank you for dropping by!
  If you can share an example of your sheets that will make it a lot easier to answer your question.
  Feel free to reach out at my email: pythoninoffice@gmail.com
  or discord: https://discord.gg/ChP87jJY7c
  
  Reply
Rick says:

February 25, 2022 at 2:46 pm

Hi Jay,

I have a process as follows

use 2 separate csv files in a folder and run a power query in a separate excel to filter results by date.

then next step is to use a combination of macro buttons in an excel to filter the combined data.

Next the process is to cut and paste separate ranges to three separate excel files.

Tedious much…. any help here would be greatly appreciated

Reply
1. Jay says:
  
  February 25, 2022 at 10:49 pm
  
  Hi Rick,
  
  Thanks for dropping by!
  
  I was in a similar situation a few years ago, that we had a combination of PQ, VBA, and Excel formulas. The process worked fine but it was tedious and slow (takes 2-3 days to update everything). Then we coded that process in Python and now it takes 30 mins.
  
  I’m confident that the process you outlined can be automated with Python. But I’ll need more details before I know how to help.
  
  Reply
Pingback: Use Python To Concatenate Excel Files - Python In Office
Supriya says:

May 3, 2022 at 10:39 am

Hi Jay, I have multiple excel files with multiple sheets in it and also I need to have one master sheet with combining only one sheet from each excel file and also need to eliminate few rows in it. how do I do this?
can you please help me?

Reply
1. Jay says:
  
  May 7, 2022 at 11:00 pm
  
  Hi Supriya,
  
  Thanks for stopping by.
  
  For your problem, you can use the example in the tutorial with some tweaks.
  1. When you read the Excel file into Python, make sure to include a sheet_name argument to specify which sheet you want from the file.
  2. You’ll have to do some filtering to eliminate a few rows, but it really depends on what you need to remove.
  
  Hope that helps. You can also send me further questions or your code to my email address: pythoninoffice@gmail.com.
  
  Reply
Sarthak says:

June 3, 2022 at 4:21 pm

Hi Jay,

I have multiple excel files with multiple sheets in them. I have to map the data from all these excel files into one master excel file based on particular cells.

For example: Getting data from Cell B5 in sheet1 in all excel workbooks to cell I5 and onwards in the master excel file.
Then later on getting data from Cell B7 from sheet 2 into Cell H5 and onwards in the master excel file and so on.

I am stuck with this! Hope you can help!

Reply
1. Jay says:
  
  June 3, 2022 at 6:17 pm
  
  Hi Sarthak,
  Thanks for your question. For your task I would suggest using a library like openpyxl/xlsxwriter/xlwings.
  In fact, we have a tutorial (still in progress) that touches on a similar task: copying data from excel files and then combining them into one file. Stay tuned and I’ll drop you a message once that’s available!
  
  Regards,
  Jay
  
  Reply
Fetene Muluken says:

August 1, 2022 at 3:32 am

Hello, I have two types of meteorological data, i,.e station and gridded data both are the same format but station data has missing value and there is no missing value in gridded data. So I want to fill these missing value from gridded data by direct replace the value. My question is how to replace the gridded value in to the missing value in python??

Regards!
Fetene

Reply
1. Jay says:
  
  August 4, 2022 at 10:33 pm
  
  Hi Fetene,
  
  Thanks for dropping by. If you could share an example of the data that will help me form a more concrete answer. There are many ways to fill missing values, and it depends on how you want to fill them. Let me know.
  
  Reply
O Magro says:

August 3, 2022 at 5:22 pm

If you have the time and dont mind, (new to python) How would you go about taking multiple files, but only extracting specific columns or depending on the best use scenario, ignoring certain columns? Also would the master file bug out if there are columns not present in one but not the other?

Reply
1. Jay says:
  
  August 4, 2022 at 10:44 pm
  
  Hi O Magro,
  
  Of course, I’m always happy to help someone new to Python (I’ve been there myself!)
  
  We can keep (or drop) certain columns in a dataframe (i.e. data in the Excel file). This tutorial here talks about how to drop columns: https://pythoninoffice.com/drop-column-pandas/
  
  The master file will not bug out if your individual files contain different columns. Let’s say your file_1 contains a column “a”, and file_2 contains a column “b”. When you combine them, the master file will contain both columns “a” and “b”. However, column “b” will be empty for the data records originally from file_1, and similarly file_2 will contain an empty column “a”.
  
  Hope that helps.
  
  Reply
jeremy says:

September 12, 2022 at 12:13 am

Hello, could I choose the files by filedialog?
Thanks.

Reply
Andnet says:

September 17, 2022 at 8:46 am

Hi,
Real-life data come with inconsistent headers and in large sizes.
How do you go about it if you create a single pandas data frame, for example, from 150 excel files, with inconsistent headers?
I know that in this situation, the appending adds up as a new column and creates big dimension.
This is a real challenge. The concept you raised is great and your presentation is good, too.
Thanks
And

Reply
1. Jay says:
  
  September 29, 2022 at 12:33 am
  
  Hi Andnet,
  
  Very good question and indeed data is never consistent in the real world.
  I have used two approaches for the situations you described:
  1. If I know all the possible header names, I’d create a dictionary to map them into a consistent name. For example, we can create mapping = {'unique id': 'id', 'UNIQUE_ID':'id', 'member_id':'id'} to map those three columns “unique id”, “UNIQUE_ID” and “member_id” into name “id”.
  2. If not feasible to know all potential header names. I’d use a fuzzy string match. This involves solving a levenshtein distance (# of steps required to change from one string to another) to find the closet string matches. You can read about the string matching here: https://pythoninoffice.com/how-to-find-similar-strings-using-python
  And how to use string matching in pandas here: https://pythoninoffice.com/use-fuzzy-string-matching-in-pandas
  
  These 2 approaches don’t always work perfectly, but in most cases they helped me get 90% of the job done.
  
  Reply
sabbyyyy says:

October 4, 2022 at 2:47 am

Hi !

How can I merge multiple worksheets from different workbook .

For example :
wb1 has worksheet : Sheet1 , Sheet2, Sheet3
wb2 has worksheet : Sheet1, Sheet2, Sheet3
and so on..

How can I merge sheet1 from all the work book together and then followed by sheet2 merge them all together into one total workbook.

Sorry if my question is complicated.

Reply
1. Jay says:
  
  October 6, 2022 at 11:47 am
  
  No problem and not complicated at all.
  Assuming all files have the same format and number of sheets, I would do something like this:
  
  file_list = ['file1.xlsx', 'file2.xlsx', 'file3.xlsx'] sheet_names = pd.ExcelFile('file1.xlsx').sheet_names
  
  with pd.ExcelWriter('file.xlsx') as writer:
  for s in sheet_names:
  temp_df = pd.DataFrame()
  for f in file_list:
  temp_df = pd.concat([temp_df, pd.read_excel(f, sheet_name = s)])
  temp_df.to_excel(writer, sheet_name = s)
  
  Note that the pd.append() method is being deprecated and not available in more recent version of pandas, so you should switch to .concat(). I will update this tutorial soon.
  
  Reply
Rita says:

November 18, 2022 at 2:40 am

Hi there,
I would like to combine different sheets from different excel files (same format) to one sheet of one excel file. I tried to leverage below code but it only combined 1st sheet in each file, not all the sheets. Kindly advise.

import pandas as pd
import glob

path = “C\doc”

filenames = glob.glob(path +”\*.xlsx’)

df_total = pd.DataFrame()

for file in files:
if file.endswith(‘.xlsx’):
excel_file = pd.ExcelFile(file)
sheets = excel_file.sheet_names
for sheet in sheets:
df = excel_file.parse(sheet_name = sheet)
df_total = df_total.append(df)

df_total.to_excel(“C\combined_file.xlsx”,index=False)

Reply
Rita says:

November 18, 2022 at 2:50 am

Hi there,

I want to combine different sheets from different excel files into one sheet in one file. I tried to leverage below coodes but it can only combined 1st sheet in each file. Kindly advise.

import pandas as pd
import glob

path = “C\Doc”

filenames = glob.glob(path +”\*.xlsx”)

df_total = pd.DataFrame()

for file in files:
if file.endswith(‘.xlsx’):
excel_file = pd.ExcelFile(file)
sheets = excel_file.sheet_names
for sheet in sheets:
df = excel_file.parse(sheet_name = sheet)
df_total = df_total.append(df)
df_total.to_excel(“C\doc\combined.xlsx”,index=false)

Reply

The workflow

Import libraries

Combine multiple Excel files into one spreadsheet

Combine multiple sheets from the same Excel file

Putting it all together

Leave a Reply Cancel reply

Leave a Reply