Python pandas read Excel files

Sharing is caring!

Last Updated on August 19, 2020 by Jay

Previously, we looked at Excel and Python side by side. Now is the time to dive into the details. The first thing when working with data is loading it into Python, so we’ll start by using Python pandas to read Excel files.

pandas is the de facto standard for data manipulate within the Python programming language. If you work with data in any form using Python, you need pandas. To get pandas, simply do a pip install.

pip install pandas --upgrade

pandas library provides several convenient methods to read from different data sources, including Excel and CSV files. We’ll explore two methods here: pd.read_excel() and pd.read_csv().

pd.read_excel() method and arguments

The read_excel() method contains about two dozens of arguments, most of which are optional. For starters, we’ll only look at a few arguments here.

read_excel(io, sheet_name=0, header=0, names=None, usecols=None)
  • io is usually one of two things: a string representing a file path, or an ExcelFile object, more on this topic in a later post.
  • sheet_name can be either string or integer, this is the sheet you want pandas to read.
  • header usually an integer to tell pandas which row of the sheet to use as the dataframe header.
  • names usually a list of names that you can use as the column header
  • usecols can be integer, string or a list to instruct pandas to only extract certain columns from the Excel file

Let’s look at a few examples using this sample workbook.

By convention, “pd” is short for “pandas”, and “df” is short for “dataframe”.

import pandas as pd
df= pd.read_excel('users.xlsx')

io and sheet_name

pd.read_excel('users.xlsx') is the simplest form, which (by default) will give us the first sheet of the input Excel file, which is the “User_info” sheet.

pf.read_excel('users.xlsx', sheet_name = 'purchase') means we’ll get the 2nd sheet, which is named “purchase”.

pf.read_excel('users.xlsx', sheet_name = [0,2]) will return the first and third sheet of the Excel file. The returned value is a dictionary of dataframes.

header

If for any reason, data on your Excel sheet doesn’t start from row 1, you can use header to tell pandas “hey, the header of this data is at row X”. The fourth sheet in the sample Excel file starts from row 5. Reading that sheet without special instruction, pandas will think our data doesn’t have column names.

df = pd.read_excel('users.xlsx', sheet_name = 3)
Non-standard header – data doesn’t start from row 1

That was not good and the dataframe requires some cleanup. Instead, we can change the code up a little bit by specifying the header argument. Remember Python uses 0-based index, so the 5th row has an index of 4. Much better this time.

df = pd.read_excel('users.xlsx', sheet_name = 3, header = 4)
Specifying the header row in read_excel()

names

If you don’t like the header names in the source Excel file, feel free to create your own by using the names argument.

df = pd.read_excel('users.xlsx', sheet_name = 3, header = 4, names = ['Customer_ID','Customer_Name','Customer_Purchase','Purchase_Date'])
Customize header names

usecols

By specifying usecols, we are limiting the Excel columns to load into Python, which is a good practice if you have a large dataset and you don’t require all the columns. The below example will read only the customer name and purchase columns into Python.

df = pd.read_excel('users.xlsx', sheet_name = 3, header = 4, usecols = "B:C")
Specify the columns we want

pd.read_csv() method and arguments

As the name suggests, this method reads a csv file.

CSV stands for “comma separated value”, so a .csv file is basically a text file with values separated by commas. It means that you can also use this method to read any .txt file into Python.

The argument for read_csv() is similar to read_excel(), so we won’t repeat them here. However, one argument is worth pointing out – sep or delimiter. It’s used to tell pandas what delimiter to use to separate the data. Use this sample text file to see that you can basically use any character as the delimiter.

df = pd.read_csv('question-separated-value.txt',sep='?')
question mark (?) separated values text
We can use “?” as the delimiter!

In the next post, we’ll explore how to read multiple sheets (from the same file) at the same time.

Leave a Reply

Your email address will not be published. Required fields are marked *