Last Updated on July 14, 2022 by Jay
I will show you how to split an Excel file into multiple files using Python in this short tutorial. Splitting an Excel file is a common task, and pretty easy to do by hand. However, if the file contains lots of data and many categories, this task will become repetitive, which means we need a solution to automate it!
Library
We need two libraries for this exercise, pandas and openpyxl. Type the following into the command prompt to install them:
pip install pandas openpyxl
The pandas library is used for working with data (filtering to be specific in our exercise), and openpyxl is used to create new Excel files.
Sample Excel File
For convenience, you can download the sample data from this github repo:
Or just read it directly into a pandas dataframe:
df = pd.read_csv(https://raw.githubusercontent.com/pythoninoffice/blog_example_code/main/split_excel_file/master_file.csv)
df.head()
order_id sales_amount sale_product country
0 1 756 RAM Canada
1 2 760 Harddrive Australia
2 3 763 Laptop Australia
3 4 707 Monitor Australia
4 5 577 Harddrive Australia
The dataset contains sales information for an imaginary computer shop – the sales amount, the product, and the country. Our task is to split the data into different files based on the sale_product column.
The underlying mechanism is simple:
- First, we read the data into Python/pandas.
- Second, apply a filter to group data into different categories.
- Last but not least, save the groups of data into different Excel files.
Filter Data
It’s easy to filter data in a pandas dataframe. There are several ways but we’ll use the most straightforward one.
Let’s say we want to filter on data by selecting all Laptop sales.
df.loc[df['sale_product'] == 'Laptop']
order_id sales_amount sale_product country
2 3 763 Laptop Australia
7 8 881 Laptop UK
9 10 738 Laptop USA
10 11 765 Laptop Australia
13 14 974 Laptop Canada
16 17 956 Laptop Australia
26 27 984 Laptop Australia
28 29 608 Laptop UK
Finding Categories
Next, we need to extract the categories from the data. They are basically values inside the sale_product column. We can simply return all the unique values from that column.
df['sale_product'].unique()
array(['RAM', 'Harddrive', 'Laptop', 'Monitor', 'GPU'], dtype=object)
Split An Excel Into Multiple Tabs
As shown above, the unique values from sale_product column are inside an array, which means we can loop through it to retrieve each value, such as “Laptop”, “Harddrive”, etc. Then, we can use those values as the filter condition to split the dataset. Finally, we can save each dataset into a separate tab within the same Excel file.
#creat new sheets
with pd.ExcelWriter('new_file.xlsx') as new_file:
for p in df['sale_product'].unique():
df.loc[df['sale_product'] == p].to_excel(new_file, sheet_name = p, index=False)
Split An Exel Into Multiple File Using Python
In case we need to split the data into different Excel files (instead of tabs). We can modify the above code a little bit, and just output data from each category into its own files.
#create new files
for p in df['sale_product'].unique():
df.loc[df['sale_product'] == p].to_excel(f'{p}_sales.xlsx', index=False)
Additional Resources
Filter a pandas dataframe (think Excel filters but more powerful)
How to Filter Pandas Dataframe by Date
Filter Dataframe With Pandas Query Method – With Examples