Last Updated on March 4, 2022 by Jay
When doing queries we often need to filter a pandas dataframe by a list of values instead of a single value.
Sample Dataset
import numpy as np
import pandas as pd
df = pd.DataFrame({'date':pd.date_range(start='2021-12-01', periods=10, freq='MS'),
'country':['USA','India','Germany','France','Canada','Netherland',
'UK','Singapore', 'Australia', 'Canada'],
'a': np.random.randint(10, size=10),
'b': np.random.randint(10, size=10)})
df
date country a b
0 2021-12-01 USA 8 7
1 2022-01-01 India 7 7
2 2022-02-01 Germany 7 4
3 2022-03-01 France 2 6
4 2022-04-01 Canada 3 9
5 2022-05-01 Netherland 3 4
6 2022-06-01 UK 2 7
7 2022-07-01 Singapore 3 9
8 2022-08-01 Australia 7 6
9 2022-09-01 Canada 1 4
Filter By Using A Boolean Index
A boolean index is essentially a list of True and False values. This method gives the most flexibility and control.
Let’s filter data to have records with country = Canada or USA, note we need to use the bitwise OR — “|” between the two criteria.
To help understand what a boolean index is, see the below example, which returns a list/Series of True or False values.
(df['country'] == 'Canada') | (df['country'] == 'USA')
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
Name: country, dtype: bool
Then, we pass this Series of True and False values into the pandas dataframe to filter:
df[(df['country'] == 'Canada') | (df['country'] == 'USA')]
date country a b
0 2021-12-01 USA 8 7
4 2022-04-01 Canada 3 9
9 2022-09-01 Canada 1 4
However, when we have a long list, the above method becomes awkward to use – we need to write the criteria many times to include all items in the list. This is where the isin() method becomes handy.
Filter By Using Pandas isin() Method On A List
In Python we can check if an item is in a list by using the in keyword:
'Canada' in ['Canada', 'USA', 'India']
True
However, this doesn’t work in pandas. Fortunately, there’s the isin() method. We just need to pass in the list of values we want to filter by:
df[df['country'].isin(['Canada', 'USA', 'India'])]
date country a b
0 2021-12-01 USA 8 7
1 2022-01-01 India 7 7
4 2022-04-01 Canada 3 9
9 2022-09-01 Canada 1 4
The isin() method works on numbers too:
df[df['a'].isin([2,3,4,5])]
date country a b
3 2022-03-01 France 2 6
4 2022-04-01 Canada 3 9
5 2022-05-01 Netherland 3 4
6 2022-06-01 UK 2 7
7 2022-07-01 Singapore 3 9
Filter By Using Pandas query() Method
The pandas query() method takes a String expression as the filter criteria. For example in the case of a single value:
df.query("country == 'Canada'")
date country a b
4 2022-04-01 Canada 3 9
9 2022-09-01 Canada 1 4
For multiple values, we can either the normal logical operators or the bitwise operators. The following examples return the same result:
df.query("country == 'Canada' | country == 'USA'")
df.query("country == 'Canada' or country == 'USA'")
date country a b
0 2021-12-01 USA 8 7
4 2022-04-01 Canada 3 9
9 2022-09-01 Canada 1 4
Finally, we can’t use the isin as a part of the String expression. We have to use the keyword in:
df.query("country in ['Canada','USA','India']")
date country a b
0 2021-12-01 USA 8 7
1 2022-01-01 India 7 7
4 2022-04-01 Canada 3 9
9 2022-09-01 Canada 1 4
df.query("a in [2,3,4,5]")
date country a b
3 2022-03-01 France 2 6
4 2022-04-01 Canada 3 9
5 2022-05-01 Netherland 3 4
6 2022-06-01 UK 2 7
7 2022-07-01 Singapore 3 9
Additional Resources
9 Examples on How to Filter Dataframe with Pandas Query() Method
How to Filter Pandas Dataframe Using Boolean Index