Last Updated on July 14, 2022 by Jay
We discussed how to read data from a single Excel file. Next we’ll learn how to read multiple Excel files into Python using the pandas library.
My personal approach are the following two ways, and depending on the situation I prefer one way over the other.
Method 1: Get Files From Folder – PowerQuery style
Excel PowerQuery has a feature “Get Data From Folder” that allows us load all files from a specific folder. We can do this easily in Python. The workflow goes like this:
- Given a folder, find all files within it.
- Narrow down the file selection, which files do I need to load?
- Load data from the selected files, one by one.
To achieve the above workflow, we’ll need
os library provides ways to interact with your computer’s operating system, such as finding out what files exist in a folder.
os.listdir() returns a list of all file names (string) within a specific folder. Once we have the list of file names, we can iterate through them and load data into Python.
import os import pandas as pd folder = r'C:\Users\JZ\Desktop\PythonInOffice\python_excel_series_read_multiple_excel_files' files = os.listdir(folder) for file in files: if file.endswith('.xlsx'): df = pd.read_excel(os.path.join(folder,file)) >>> files ['data.pdf', 'File_1.xlsx', 'File_2.xlsx', 'File_3.xlsx', 'header-background.PNG', 'multiple files.py']
Our working folder contains various file types (PDf, Excel, Image, and Python files). But the
file.endswith('.xlsx') makes sure that we read only the Excel files into Python.
os.path.join() provides an efficient way to create file path. This should always be used where possible, instead of
folder + "\" + file.
Method 2: Using an Excel input file
The second method requires us to have a separate Excel file acts as an “input file”. It contains links to individual files that we intend to read into Python. To replicate the example we just walked through, we need to create an Excel file looks like the below, essentially just a column with links to other files.
I like this method a lot, because:
- I can organize and store information (file names, links, etc) in an environment (spreadsheet) I’m familiar with.
- If I need to update or add new files to be read, I just need to update the input file. No coding change is required.
The workflow is similar to the previous method. First we need to let Python know the file paths, which can be obtained from the input file.
df_files = pd.read_excel('Excel_input.xlsx') >>> df_files File path 0 C:\Users\JZ\Desktop\PythonInOffice\python_exce... 1 C:\Users\JZ\Desktop\PythonInOffice\python_exce... 2 C:\Users\JZ\Desktop\PythonInOffice\python_exce... >>>
This is basically a simple dataframe with only one column, that contains the file links. Now we can iterate through the list and read Excel files.
for file in df_files['File path']: df = pd.read_excel(file)
When to use Get Files From Folder vs Excel Input File
I ask two simple questions when determining which method to use.
- Does the source folder contain extra files that I don’t need?
- For example, if a folder contains 20 csv files, and I need only 10 of them. It’s probably easier to use the Excel Input File method. Editing an Excel Input file is much easier and faster than writing code to handle different scenarios in Python.
- However, if the folder contains 50 files, of which 20 are csv, and I need them all. Then I’ll use the Get File From Folder method, because we can easily select all the .csv files from the list of files.
- Do all the files live inside the same folder?
- If files are in different folders, it makes more sense to use an Excel Input File to store the file paths.
How can I see the data frame of all the files loaded at once?
Thanks for your question. This tutorial shows how to iterate through each file and load data into Python. Are you trying to combine all the Excel files into one spreadsheet using Python? If that’s the case, you can check out this tutorial here that talks about it: