Last Updated on July 14, 2022 by Jay
Sometimes we need to perform data binning and the pandas method between() can help us achieve that goal. Data binning refers to the process in which we place data into discrete intervals or bands/bins like the below example.
We’ll create some random samples that present 100 persons’ age and their net worth in monetary amounts. Then we’ll bin the data into different buckets by age.
import pandas as pd
import numpy as np
age = np.random.randint(0,121, size=100)
net_worth = np.random.randint(100,10000, size=100)
df = pd.DataFrame({'Age':age, 'Net_Worth':net_worth})
df
Age Net_Worth
0 101 2095
1 15 1331
2 100 3138
3 62 518
4 23 8429
.. ... ...
95 35 7325
96 37 3105
97 22 1887
98 115 2582
99 62 3217
[100 rows x 2 columns]
The Between() Method
The pandas’ between method checks whether data is between two values, and it has the following syntax:
between(left, right, inclusive=’both’)
- left: the lower endpoint for a band/range
- right: the higher endpoint for a band/range
- inclusive: whether we want to include the left and right endpoints. Can take the values of “both”, “neither”, “left” or “right”
The method returns a boolean index containing a list of True and False values.
Let’s check which record is between the ages of 0 to 20:
df['Age'].between(0, 20, inclusive='right')
0 False
1 True
2 False
3 False
4 False
...
95 False
96 False
97 False
98 False
99 False
Name: Age, Length: 100, dtype: bool
You probably noticed that the between method is essentially equivalent to the following:
(df['Age'] > 0) & (df['Age'] <= 20)
Assigning Bins
Now we can check whether data is within a band with the help of the boolean index. We also need to use the loc method to access/assign values for individual records matching the criteria. Note the NaN values are because we haven’t assigned a band to them yet.
df.loc[df['Age'].between(left=0, right=20, inclusive='right'), 'band'] = '(0, 20]'
Age Net_Worth band
0 101 2095 NaN
1 15 1331 (0, 20]
2 100 3138 NaN
3 62 518 NaN
4 23 8429 NaN
.. ... ... ...
95 35 7325 NaN
96 37 3105 NaN
97 22 1887 NaN
98 115 2582 NaN
99 62 3217 NaN
[100 rows x 3 columns]
Basically, the loc property returns only record matching the criteria, i.e. the True values from the boolean index. See the below example which returns only the records that are between 0 and 20 years old.
df.loc[df['Age'].between(left=0, right=20, inclusive='right')]
Age Net_Worth band
1 15 1331 (0, 20]
6 8 4230 (0, 20]
9 17 1794 (0, 20]
20 13 9135 (0, 20]
36 9 5553 (0, 20]
52 13 5538 (0, 20]
55 8 7323 (0, 20]
60 2 8619 (0, 20]
62 17 7349 (0, 20]
73 8 9320 (0, 20]
77 15 4368 (0, 20]
79 16 3203 (0, 20]
83 15 2325 (0, 20]
90 14 7467 (0, 20]
93 20 3457 (0, 20]
Binning Data With Pandas Between Method
Let’s say we want to put all data into the following age bins. Note the parenthesis “()” means not inclusive and square brackets “[]” mean inclusive. so (0, 20] means from ages of 1 to 20.
- (0, 20]
- (20,60]
- (60, 90]
- (90, 120]
# age intervals
age_band = [0,20,60,90,120]
# create age bins
age_bins = []
for i in range(4):
age_bins.append([age_band[i],age_band[i+1]])
[[0, 20], [20, 60], [60, 90], [90, 120]]
# binning the data
for b in age_bins:
df.loc[df['Age'].between(left=b[0], right=b[1], inclusive='right'), 'band'] = f'({b[0]},{b[1]}]'
Age Net_Worth band
0 101 2095 (90,120]
1 15 1331 (0,20]
2 100 3138 (90,120]
3 62 518 (60,90]
4 23 8429 (20,60]
.. ... ... ...
95 35 7325 (20,60]
96 37 3105 (20,60]
97 22 1887 (20,60]
98 115 2582 (90,120]
99 62 3217 (60,90]
[100 rows x 3 columns]
Unfortunately, there’s no easy way to bin data using the between and loc methods. Although using a loop is not too bad, this approach can become inefficient when handling a large number of bins because we need to repeat the process N (number of bins) times. An easier approach to bin data is to use the pandas’ cut method. You can find below a link to that tutorial.