Last Updated on August 19, 2020 by Jay
Previously, we looked at Excel and Python side by side. Now is the time to dive into the details. The first thing when working with data is loading it into Python, so we’ll start by using Python pandas to read Excel files.
pandas
is the de facto standard for data manipulate within the Python programming language. If you work with data in any form using Python, you need pandas
. To get pandas, simply do a pip install.
pip install pandas --upgrade
pandas
library provides several convenient methods to read from different data sources, including Excel and CSV files. We’ll explore two methods here: pd.read_excel()
and pd.read_csv()
.
pd.read_excel() method and arguments
The read_excel() method contains about two dozens of arguments, most of which are optional. For starters, we’ll only look at a few arguments here.
read_excel(io, sheet_name=0, header=0, names=None, usecols=None)
io
is usually one of two things: a string representing a file path, or an ExcelFile object, more on this topic in a later post.sheet_name
can be either string or integer, this is the sheet you want pandas to read.header
usually an integer to tell pandas which row of the sheet to use as the dataframe header.names
usually a list of names that you can use as the column header- usecols can be integer, string or a list to instruct pandas to only extract certain columns from the Excel file
Let’s look at a few examples using this sample workbook.
By convention, “pd” is short for “pandas”, and “df” is short for “dataframe”.
import pandas as pd
df= pd.read_excel('users.xlsx')
io and sheet_name
pd.read_excel('users.xlsx')
is the simplest form, which (by default) will give us the first sheet of the input Excel file, which is the “User_info” sheet.
pf.read_excel('users.xlsx', sheet_name = 'purchase')
means we’ll get the 2nd sheet, which is named “purchase”.
pf.read_excel('users.xlsx', sheet_name = [0,2])
will return the first and third sheet of the Excel file. The returned value is a dictionary of dataframes.
header
If for any reason, data on your Excel sheet doesn’t start from row 1, you can use header to tell pandas “hey, the header of this data is at row X”. The fourth sheet in the sample Excel file starts from row 5. Reading that sheet without special instruction, pandas will think our data doesn’t have column names.
df = pd.read_excel('users.xlsx', sheet_name = 3)
That was not good and the dataframe requires some cleanup. Instead, we can change the code up a little bit by specifying the header
argument. Remember Python uses 0-based index, so the 5th row has an index of 4. Much better this time.
df = pd.read_excel('users.xlsx', sheet_name = 3, header = 4)
names
If you don’t like the header names in the source Excel file, feel free to create your own by using the names
argument.
df = pd.read_excel('users.xlsx', sheet_name = 3, header = 4, names = ['Customer_ID','Customer_Name','Customer_Purchase','Purchase_Date'])
usecols
By specifying usecols
, we are limiting the Excel columns to load into Python, which is a good practice if you have a large dataset and you don’t require all the columns. The below example will read only the customer name and purchase columns into Python.
df = pd.read_excel('users.xlsx', sheet_name = 3, header = 4, usecols = "B:C")
pd.read_csv() method and arguments
As the name suggests, this method reads a csv file.
CSV stands for “comma separated value”, so a .csv file is basically a text file with values separated by commas. It means that you can also use this method to read any .txt file into Python.
The argument for read_csv()
is similar to read_excel()
, so we won’t repeat them here. However, one argument is worth pointing out – sep
or delimiter
. It’s used to tell pandas what delimiter to use to separate the data. Use this sample text file to see that you can basically use any character as the delimiter.
df = pd.read_csv('question-separated-value.txt',sep='?')
In the next post, we’ll explore how to read multiple sheets (from the same file) at the same time.