Last Updated on July 14, 2022 by Jay
Being able to slice and dice the data is essential for working with data. Similar to the filter in Excel, we can also apply a filter on a pandas dataframe. The only difference is that the filter in Python (pandas) is much more powerful and efficient. I bet you do remember the last time you applied a filter to a 500k-row Excel spreadsheet, which probably took 30 mins of your life. If you are tired of Excel’s poor performance in some aspects, maybe it’s time to consider learning a more efficient tool – Python.
This tutorial is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation.
I’m splitting this topic into two sub-posts. Because I think it’s essential to understand how pandas achieve dataframe filtering, and how to use advanced filtering techniques. Without making this post a super long one, it makes more sense to split the material I want to cover.
Prepare a dataframe for demo
Again, we’ll be using the S&P 500 company dataset for this tutorial. First, we’ll fire up pandas and load the data from Wikipedia.
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]
Filter dataframe by a single criteria
To select companies in the “Information Technology” category from the SP500 list. We can use .loc[]
to achieve this. Note on the square brackets instead of the parenthesis () used here. The basic notation goes like this:
df.loc [ column == 'criteria' ]
df_tech = df.loc[df['GICS Sector'] == 'Information Technology']
>>> df_tech
Symbol Security ... CIK Founded
4 ACN Accenture plc ... 1467373 1989
6 ADBE Adobe Inc. ... 796343 1982
7 AMD Advanced Micro Devices Inc ... 2488 1969
13 AKAM Akamai Technologies Inc ... 1086222 1998
38 APH Amphenol Corp ... 820313 1932
.. ... ... ... ... ...
488 WDC Western Digital ... 106040 1970
489 WU Western Union Co ... 1365135 1851
497 XRX Xerox ... 108772 1906
498 XLNX Xilinx ... 743988 1984
501 ZBRA Zebra Technologies ... 877212 1969
[71 rows x 9 columns]
The result is a new dataframe containing 71 companies that belong to the “Information Technology” sector. This dataframe includes all columns from the original dataset. The best part is that we can use this as a standalone table (dataframe) without needing extra steps (e.g., If we do filter in Excel, we need to copy it to another sheet or delete other rows to make it into “one table”).
If you don’t need all the columns in the new dataframe, simply pass the required column names into .loc[]
. For example, to select only Symbol, Security, and Category, we can do the following. Note it only returns the 3 columns we have specified.
df_tech_select_columns = df.loc[df['GICS Sector'] == 'Information Technology', ['Symbol','Security', 'GICS Sector']]
>>> df_tech_select
Symbol Security GICS Sector
4 ACN Accenture plc Information Technology
6 ADBE Adobe Inc. Information Technology
7 AMD Advanced Micro Devices Inc Information Technology
13 AKAM Akamai Technologies Inc Information Technology
38 APH Amphenol Corp Information Technology
.. ... ... ...
488 WDC Western Digital Information Technology
489 WU Western Union Co Information Technology
497 XRX Xerox Information Technology
498 XLNX Xilinx Information Technology
501 ZBRA Zebra Technologies Information Technology
[71 rows x 3 columns]
What’s going on under the hood
It’s important to know what’s really going on under the hood, which will help us understand how to use filtering in pandas.
The criteria we pass into the loc[]
, df['GICS Sector'] == 'Information Technology'
is actually a boolean index, which is a list of True or False values. The above line of code translates to this: for every single row, if the GICS Sector is “Information Technology”, evaluate True, otherwise False.
To better visualize this idea, let me show how this looks like in Excel. I know we can just apply a filter on the GICS Sector column, then choose the desired sector. But that doesn’t really help us understand what’s going on, so bear with me here – I’ll show this from the first principle.
Now take a look at the below Excel screenshot. I’ve created a new column called “IT or not”. I also used a simple IF formula to evaluate whether a row has “Information Technology” as the sector. The formula returns either 1 or 0 based on the sector. Effectively, I’m checking the value for every single row. Once the formula check is done, I can filter on the column “IT or not”, and then select all rows with a value of 1.
Python utilizes a somewhat similar approach, let’s examine what a boolean index really is.
>>> df['GICS Sector'] == 'Information Technology'
0 False
1 False
2 False
3 False
4 True
...
500 False
501 True
502 False
503 False
504 False
Name: GICS Sector, Length: 505, dtype: bool
Note the bottom of the above code snippet – Length: 505
. The above line of code creates a list those length is the same as the dataframe itself, and it’s filled with True or False. Which is basically what we just did inside Excel. When you pass this boolean index into df.loc[]
, it will return only the rows where there’s a True value. (i.e. selecting the 1 from the Excel filter) Rows with a value of False will be dropped as a result.
In real life, we often need to filter on multiple criteria, in the next post we’ll take a look at how to do some of the advanced filtering in pandas.