Last Updated on March 1, 2022 by Jay
There are a few ways to filter a pandas dataframe by date. Some methods are easier while others are more flexible. Let’s take a look at them.
Sample Dataset
We’ll use a hypothetical stock price dataframe that contains the date, and stock open, high, low, close prices for individual dates.
df = pd.DataFrame({'date':['2021-12-01', '2022-01-01','2022-02-01','2022-03-01','2022-04-01','2022-05-01'],
'open':[100, 101, 102, 103, 104, 105],
'high':[1000, 1001, 1002, 1003, 1004, 1005],
'low':[0, 1,2,3,4,5],
'close':[200, 201,202,203,204,205]})
date open high low close
0 2021-12-01 100 1000 0 200
1 2022-01-01 101 1001 1 201
2 2022-02-01 102 1002 2 202
3 2022-03-01 103 1003 3 203
4 2022-04-01 104 1004 4 204
5 2022-05-01 105 1005 5 205
Note we used text values when creating the date column. Therefore it’s a String data type, we can check it to confirm. A dtype of ‘O’ means Object or simply String value.
df['date'].dtype
dtype('O')
Method 1: Filter dataframe by date string value
I find this method funny while convenient. You can use logical comparison (greater than, less than, etc) with string values. The below filters the dataframe by selecting dates after ‘2022-03-01’.
# Days after (not including) 20222-03-01
df[df['date'] > '2022-03-01']
date open high low close
4 2022-04-01 104 1004 4 204
5 2022-05-01 105 1005 5 205
We can also filter the dataframe by providing a date range using the bitwise and & operator. Note that we have to use parathesis around the two conditions, or it will throw an error. The below filters the dataframe by selecting dates between 2022-02-01 and 2022-04-01 inclusive.
# Days between 2022-02-01 and 2022-04-01 inclusive
df[(df['date'] >= '2022-02-01') & (df['date'] <= '2022-04-01')]
date open high low close
2 2022-02-01 102 1002 2 202
3 2022-03-01 103 1003 3 203
4 2022-04-01 104 1004 4 204
Using partial dates
Warning – using partial dates for comparison is somewhat confusing. I do not recommend this approach until you know the caveats.
To filter all dates after the year 2021, we can do so by just passing in the string value ‘2022’ into the comparison. Note, > actually means >= greater than and equal to in this case… Therefore, only the first record which is in 2021 got dropped.
df[df['date'] > '2022']
date open high low close
1 2022-01-01 101 1001 1 201
2 2022-02-01 102 1002 2 202
3 2022-03-01 103 1003 3 203
4 2022-04-01 104 1004 4 204
5 2022-05-01 105 1005 5 205
df[df['date'] >= '2022']
date open high low close
1 2022-01-01 101 1001 1 201
2 2022-02-01 102 1002 2 202
3 2022-03-01 103 1003 3 203
4 2022-04-01 104 1004 4 204
5 2022-05-01 105 1005 5 205
We can take this further by adding the month into the string value. However, this also can cause confusion sometimes. Note the subtle difference in the below code and results.
I would say the first one is correct, because 2022-02-01 is not > 2022-02-01 technically, so it shouldn’t be included in the selection.
#correct selection
df[df['date'] > '2022-02-01']
date open high low close
3 2022-03-01 103 1003 3 203
4 2022-04-01 104 1004 4 204
5 2022-05-01 105 1005 5 205
#incorrect selection
df[df['date'] > '2022-02']
date open high low close
2 2022-02-01 102 1002 2 202
3 2022-03-01 103 1003 3 203
4 2022-04-01 104 1004 4 204
5 2022-05-01 105 1005 5 205
Method 2: Filter dataframe by datetime
While filtering dataframe by “string” date value is convenient, it lacks flexibility in many cases. To gain full control of the date data, we should consider converting the column into a datetime date type first. We are going to overwrite the original “string” date with the datetime values by assigning the new data back to the date column.
df['date'] = pd.to_datetime(df['date'])
print(df['date'].dtype)
datetime64[ns]
Now our date is a datetime, and we can access each of the year, month, day easily by using the .dt accessor. For example, we can access the month of each date using dt.month. This way, all dates with month > 2 (Feb) are selected.
Similarly, we can access the year and days by calling dt.year and dt.day, respectively.
In addition to filtering with .dt accessors, we can still use string values. They work the same as before:
df[(df['date'] >= '2022-02-01') & (df['date'] <= '2022-04-01')]
date open high low close
2 2022-02-01 102 1002 2 202
3 2022-03-01 103 1003 3 203
4 2022-04-01 104 1004 4 204
One comment