Filter a pandas dataframe (think Excel filters but more powerful)

Sharing is caring!

Last Updated on July 14, 2022 by Jay

Being able to slice and dice the data is essential for working with data. Similar to the filter in Excel, we can also apply a filter on a pandas dataframe. The only difference is that the filter in Python (pandas) is much more powerful and efficient. I bet you do remember the last time you applied a filter to a 500k-row Excel spreadsheet, which probably took 30 mins of your life. If you are tired of Excel’s poor performance in some aspects, maybe it’s time to consider learning a more efficient tool – Python.

This tutorial is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation.

I’m splitting this topic into two sub-posts. Because I think it’s essential to understand how pandas achieve dataframe filtering, and how to use advanced filtering techniques. Without making this post a super long one, it makes more sense to split the material I want to cover.

Prepare a dataframe for demo

Again, we’ll be using the S&P 500 company dataset for this tutorial. First, we’ll fire up pandas and load the data from Wikipedia.

import pandas as pd
df = pd.read_html('')[0]

Filter dataframe by a single criteria

To select companies in the “Information Technology” category from the SP500 list. We can use .loc[] to achieve this. Note on the square brackets instead of the parenthesis () used here. The basic notation goes like this:

df.loc [ column == 'criteria' ]

df_tech = df.loc[df['GICS Sector'] == 'Information Technology']

>>> df_tech
    Symbol                    Security  ...      CIK Founded
4      ACN               Accenture plc  ...  1467373    1989
6     ADBE                  Adobe Inc.  ...   796343    1982
7      AMD  Advanced Micro Devices Inc  ...     2488    1969
13    AKAM     Akamai Technologies Inc  ...  1086222    1998
38     APH               Amphenol Corp  ...   820313    1932
..     ...                         ...  ...      ...     ...
488    WDC             Western Digital  ...   106040    1970
489     WU            Western Union Co  ...  1365135    1851
497    XRX                       Xerox  ...   108772    1906
498   XLNX                      Xilinx  ...   743988    1984
501   ZBRA          Zebra Technologies  ...   877212    1969

[71 rows x 9 columns]

The result is a new dataframe containing 71 companies that belong to the “Information Technology” sector. This dataframe includes all columns from the original dataset. The best part is that we can use this as a standalone table (dataframe) without needing extra steps (e.g., If we do filter in Excel, we need to copy it to another sheet or delete other rows to make it into “one table”).

If you don’t need all the columns in the new dataframe, simply pass the required column names into .loc[]. For example, to select only Symbol, Security, and Category, we can do the following. Note it only returns the 3 columns we have specified.

df_tech_select_columns = df.loc[df['GICS Sector'] == 'Information Technology', ['Symbol','Security', 'GICS Sector']]

>>> df_tech_select
    Symbol                    Security             GICS Sector
4      ACN               Accenture plc  Information Technology
6     ADBE                  Adobe Inc.  Information Technology
7      AMD  Advanced Micro Devices Inc  Information Technology
13    AKAM     Akamai Technologies Inc  Information Technology
38     APH               Amphenol Corp  Information Technology
..     ...                         ...                     ...
488    WDC             Western Digital  Information Technology
489     WU            Western Union Co  Information Technology
497    XRX                       Xerox  Information Technology
498   XLNX                      Xilinx  Information Technology
501   ZBRA          Zebra Technologies  Information Technology

[71 rows x 3 columns]

What’s going on under the hood

It’s important to know what’s really going on under the hood, which will help us understand how to use filtering in pandas.

The criteria we pass into the loc[], df['GICS Sector'] == 'Information Technology' is actually a boolean index, which is a list of True or False values. The above line of code translates to this: for every single row, if the GICS Sector is “Information Technology”, evaluate True, otherwise False.

To better visualize this idea, let me show how this looks like in Excel. I know we can just apply a filter on the GICS Sector column, then choose the desired sector. But that doesn’t really help us understand what’s going on, so bear with me here – I’ll show this from the first principle.

Now take a look at the below Excel screenshot. I’ve created a new column called “IT or not”. I also used a simple IF formula to evaluate whether a row has “Information Technology” as the sector. The formula returns either 1 or 0 based on the sector. Effectively, I’m checking the value for every single row. Once the formula check is done, I can filter on the column “IT or not”, and then select all rows with a value of 1.

An Excel example to show how filters are created in Python

Python utilizes a somewhat similar approach, let’s examine what a boolean index really is.

>>> df['GICS Sector'] == 'Information Technology'

0      False
1      False
2      False
3      False
4       True
500    False
501     True
502    False
503    False
504    False
Name: GICS Sector, Length: 505, dtype: bool

Note the bottom of the above code snippet – Length: 505. The above line of code creates a list those length is the same as the dataframe itself, and it’s filled with True or False. Which is basically what we just did inside Excel. When you pass this boolean index into df.loc[], it will return only the rows where there’s a True value. (i.e. selecting the 1 from the Excel filter) Rows with a value of False will be dropped as a result.

In real life, we often need to filter on multiple criteria, in the next post we’ll take a look at how to do some of the advanced filtering in pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *