Use Python to combine multiple Excel files into one master spreadsheet

Sharing is caring!

Last Updated on April 9, 2023 by Jay

In this short tutorial, I’ll show you how to use Python to combine multiple Excel files into one master spreadsheet. Imagine that you have dozens of Excel files with the same data fields, and your job is to aggregate sheets from those files. Manually doing this job is super inefficient, and Python will save you a lot of time in the long run, so let’s all work smarter!

Note that this article talks about appending Excel files with the same format/data fields. Merging multiple dataset is a different task.

If you are new to Python, this series Integrate Python with Excel offers some tips on how to use Python to supercharge your Excel spreadsheets.

The workflow

To solve the problem, we’ll need to follow the below work flow:

  1. Identify the files we need to combine
  2. Get data from the file
  3. Move data from step 2) to a master dataset (we will call it “dataframe”)
  4. Report 2-3 for the number of files
  5. Save the master dataset into an Excel spreadsheet

Import libraries

Alright, let’s see how to code the above work flow in Python. For this exercise, we’ll need to use two Python libraries: os and pandas. If you want to follow along, feel free to grab the source code and files used in this tutorial from here. Although you can combine as many Excel files as you wish, we’ll use three files to demonstrate the process.

If you need help with installing Python or libraries, here’s a guide on how to do that.

os library gives a way of using operating system dependent functionalities. Such as manipulating folder and file paths. We use this library to get all the Excel file names, including their paths.

pandas library is the gold standard for data analysis and manipulation. It is fast, powerful, and flexible. We use this library to load Excel data into Python, manipulate data, and recreate the master spreadsheet.

We’ll start by importing these two libraries. Then find the current working directory, as well as all the file names within the directory.

import os
import pandas as pd
cwd = os.path.abspath('') 
files = os.listdir(cwd) 

The variable cwd shows the path to the current working directory, and the variable files is a list of all the file names within the current working directory. Notice there are non-Excel files, and we don’t want to open those, so we’ll handle that soon.

Combine multiple Excel files into one spreadsheet

Next, we create an empty dataframe df for storing the data for master spreadsheet. We loop through all the files within the current working directory, but only process the Excel files whose name ends with “.xlsx”. This is done by this line of code
if file.endswith('.xlsx'):

pd.read_excel() will read Excel data into Python and store it as a pandas DataFrame object. Be aware that this method reads only the first tab/sheet of the Excel file by default. If your Excel file contains more than 1 sheet, continue reading to the next section.


df.append() will append/combine data from one file to another. Think about copying a block of data from one Excel file and pasting it into another. Instead of opening up Excel, data is stored inside your computer’s memory.

df = pd.DataFrame()
for file in files:
     if file.endswith('.xlsx'):
         df = df.append(pd.read_excel(file), ignore_index=True) 
df.head()

The above code does the following:

  1. Loop through all the files in the current working directory, determine if a file is Excel by checking the file name ends with “.xlsx”.
  2. If yes, read the file content (data), and append/add it to the master dataframe variable called df.
  3. Save the master dataframe into an Excel spreadsheet.

We can examine the master dataframe by checking df.head(), which shows the first 5 rows of the data.

Seems good! Just another quick check to make sure we have loaded everything in the DataFrame. df.shape will show us the dimension (36 rows, 5 columns) of the data:

Everything looks good, so let’s output the data back into Excel. The last line df.to_excel() will do that.

Combine multiple sheets from the same Excel file

I talked about the two techniques to read multiple sheets from the same Excel file, so I won’t repeat it. However, I’ll walk through an example here with a slightly different setting.

We have 2 files each contains a number of sheets. We don’t know how many sheets are in each file, but we know the format is the same for all sheets. Our goal is to aggregate all sheets into one spreadsheet (and one file).

The workflow is similar:

  1. Get all Excel files
  2. Loop through the Excel files
  3. For each file, loop through all sheets
  4. Read each sheet into a dataframe, then combine all dataframes together.
df_total = pd.DataFrame()
for file in files:  # loop through Excel files
    if file.endswith('.xlsx'):
        excel_file = pd.ExcelFile(file)
        sheets = excel_file.sheet_names
        for sheet in sheets: # loop through sheets inside an Excel file
            df = excel_file.parse(sheet_name = sheet)
            df_total = df_total.append(df)
df_total.to_excel('combined_file.xlsx')

Putting it all together

Below is the full code put together. 10 lines of code will help you combine all your Excel files or sheets into one master spreadsheet. Enjoy!

import os
import pandas as pd
cwd = os.path.abspath('') 
files = os.listdir(cwd)  

## Method 1 gets the first sheet of a given file
df = pd.DataFrame()
for file in files:
    if file.endswith('.xlsx'):
        df = df.append(pd.read_excel(file), ignore_index=True) 
df.head() 
df.to_excel('total_sales.xlsx')



## Method 2 gets all sheets of a given file
df_total = pd.DataFrame()
for file in files:                         # loop through Excel files
    if file.endswith('.xlsx'):
        excel_file = pd.ExcelFile(file)
        sheets = excel_file.sheet_names
        for sheet in sheets:               # loop through sheets inside an Excel file
            df = excel_file.parse(sheet_name = sheet)
            df_total = df_total.append(df)
df_total.to_excel('combined_file.xlsx')

25 comments

  1. Hi, is there a way to combine 2 different data frames from 2 different excels/csvs into different sheets of a single workbook?

  2. Hi Jay I followed the step in combining multiple excel files with multiple sheets .I am getting empty combined file .do i need to mention the sheet names ? what i am missing.

    Thanks in advance

  3. Hi Jay, I have an excel file which has sheets formatted differently, I am trying to make it so that these sheets are read and combined into another excel file in which I want the sheet to be continuous flowing together after formatting has been made the same for them….

  4. Hi Jay,

    I have a process as follows

    use 2 separate csv files in a folder and run a power query in a separate excel to filter results by date.

    then next step is to use a combination of macro buttons in an excel to filter the combined data.

    Next the process is to cut and paste separate ranges to three separate excel files.

    Tedious much…. any help here would be greatly appreciated

    1. Hi Rick,

      Thanks for dropping by!

      I was in a similar situation a few years ago, that we had a combination of PQ, VBA, and Excel formulas. The process worked fine but it was tedious and slow (takes 2-3 days to update everything). Then we coded that process in Python and now it takes 30 mins.

      I’m confident that the process you outlined can be automated with Python. But I’ll need more details before I know how to help.

  5. Hi Jay, I have multiple excel files with multiple sheets in it and also I need to have one master sheet with combining only one sheet from each excel file and also need to eliminate few rows in it. how do I do this?
    can you please help me?

    1. Hi Supriya,

      Thanks for stopping by.

      For your problem, you can use the example in the tutorial with some tweaks.
      1. When you read the Excel file into Python, make sure to include a sheet_name argument to specify which sheet you want from the file.
      2. You’ll have to do some filtering to eliminate a few rows, but it really depends on what you need to remove.

      Hope that helps. You can also send me further questions or your code to my email address: pythoninoffice@gmail.com.

  6. Hi Jay,

    I have multiple excel files with multiple sheets in them. I have to map the data from all these excel files into one master excel file based on particular cells.

    For example: Getting data from Cell B5 in sheet1 in all excel workbooks to cell I5 and onwards in the master excel file.
    Then later on getting data from Cell B7 from sheet 2 into Cell H5 and onwards in the master excel file and so on.

    I am stuck with this! Hope you can help!

    1. Hi Sarthak,
      Thanks for your question. For your task I would suggest using a library like openpyxl/xlsxwriter/xlwings.
      In fact, we have a tutorial (still in progress) that touches on a similar task: copying data from excel files and then combining them into one file. Stay tuned and I’ll drop you a message once that’s available!

      Regards,
      Jay

  7. Hello, I have two types of meteorological data, i,.e station and gridded data both are the same format but station data has missing value and there is no missing value in gridded data. So I want to fill these missing value from gridded data by direct replace the value. My question is how to replace the gridded value in to the missing value in python??

    Regards!
    Fetene

    1. Hi Fetene,

      Thanks for dropping by. If you could share an example of the data that will help me form a more concrete answer. There are many ways to fill missing values, and it depends on how you want to fill them. Let me know.

  8. If you have the time and dont mind, (new to python) How would you go about taking multiple files, but only extracting specific columns or depending on the best use scenario, ignoring certain columns? Also would the master file bug out if there are columns not present in one but not the other?

    1. Hi O Magro,

      Of course, I’m always happy to help someone new to Python (I’ve been there myself!)

      We can keep (or drop) certain columns in a dataframe (i.e. data in the Excel file). This tutorial here talks about how to drop columns: https://pythoninoffice.com/drop-column-pandas/

      The master file will not bug out if your individual files contain different columns. Let’s say your file_1 contains a column “a”, and file_2 contains a column “b”. When you combine them, the master file will contain both columns “a” and “b”. However, column “b” will be empty for the data records originally from file_1, and similarly file_2 will contain an empty column “a”.

      Hope that helps.

  9. Hi,
    Real-life data come with inconsistent headers and in large sizes.
    How do you go about it if you create a single pandas data frame, for example, from 150 excel files, with inconsistent headers?
    I know that in this situation, the appending adds up as a new column and creates big dimension.
    This is a real challenge. The concept you raised is great and your presentation is good, too.
    Thanks
    And

    1. Hi Andnet,

      Very good question and indeed data is never consistent in the real world.
      I have used two approaches for the situations you described:
      1. If I know all the possible header names, I’d create a dictionary to map them into a consistent name. For example, we can create mapping = {'unique id': 'id', 'UNIQUE_ID':'id', 'member_id':'id'} to map those three columns “unique id”, “UNIQUE_ID” and “member_id” into name “id”.
      2. If not feasible to know all potential header names. I’d use a fuzzy string match. This involves solving a levenshtein distance (# of steps required to change from one string to another) to find the closet string matches. You can read about the string matching here: https://pythoninoffice.com/how-to-find-similar-strings-using-python
      And how to use string matching in pandas here: https://pythoninoffice.com/use-fuzzy-string-matching-in-pandas

      These 2 approaches don’t always work perfectly, but in most cases they helped me get 90% of the job done.

  10. Hi !

    How can I merge multiple worksheets from different workbook .

    For example :
    wb1 has worksheet : Sheet1 , Sheet2, Sheet3
    wb2 has worksheet : Sheet1, Sheet2, Sheet3
    and so on..

    How can I merge sheet1 from all the work book together and then followed by sheet2 merge them all together into one total workbook.

    Sorry if my question is complicated.

    1. No problem and not complicated at all.
      Assuming all files have the same format and number of sheets, I would do something like this:

      file_list = ['file1.xlsx', 'file2.xlsx', 'file3.xlsx']
      sheet_names = pd.ExcelFile('file1.xlsx').sheet_names

      with pd.ExcelWriter('file.xlsx') as writer:
      for s in sheet_names:
      temp_df = pd.DataFrame()
      for f in file_list:
      temp_df = pd.concat([temp_df, pd.read_excel(f, sheet_name = s)])
      temp_df.to_excel(writer, sheet_name = s)

      Note that the pd.append() method is being deprecated and not available in more recent version of pandas, so you should switch to .concat(). I will update this tutorial soon.

  11. Hi there,
    I would like to combine different sheets from different excel files (same format) to one sheet of one excel file. I tried to leverage below code but it only combined 1st sheet in each file, not all the sheets. Kindly advise.

    import pandas as pd
    import glob

    path = “C\doc”

    filenames = glob.glob(path +”\*.xlsx’)

    df_total = pd.DataFrame()

    for file in files:
    if file.endswith(‘.xlsx’):
    excel_file = pd.ExcelFile(file)
    sheets = excel_file.sheet_names
    for sheet in sheets:
    df = excel_file.parse(sheet_name = sheet)
    df_total = df_total.append(df)

    df_total.to_excel(“C\combined_file.xlsx”,index=False)

  12. Hi there,

    I want to combine different sheets from different excel files into one sheet in one file. I tried to leverage below coodes but it can only combined 1st sheet in each file. Kindly advise.

    import pandas as pd
    import glob

    path = “C\Doc”

    filenames = glob.glob(path +”\*.xlsx”)

    df_total = pd.DataFrame()

    for file in files:
    if file.endswith(‘.xlsx’):
    excel_file = pd.ExcelFile(file)
    sheets = excel_file.sheet_names
    for sheet in sheets:
    df = excel_file.parse(sheet_name = sheet)
    df_total = df_total.append(df)
    df_total.to_excel(“C\doc\combined.xlsx”,index=false)

Leave a Reply

Your email address will not be published. Required fields are marked *