Binning Data With Pandas Between Method

Sharing is caring!

Last Updated on July 14, 2022 by Jay

Sometimes we need to perform data binning and the pandas method between() can help us achieve that goal. Data binning refers to the process in which we place data into discrete intervals or bands/bins like the below example.

Pandas Binning Between Method

We’ll create some random samples that present 100 persons’ age and their net worth in monetary amounts. Then we’ll bin the data into different buckets by age.

import pandas as pd
import numpy as np

age = np.random.randint(0,121, size=100)
net_worth = np.random.randint(100,10000, size=100)

df = pd.DataFrame({'Age':age, 'Net_Worth':net_worth})
df
    Age  Net_Worth
0   101       2095
1    15       1331
2   100       3138
3    62        518
4    23       8429
..  ...        ...
95   35       7325
96   37       3105
97   22       1887
98  115       2582
99   62       3217

[100 rows x 2 columns]

The Between() Method

The pandas’ between method checks whether data is between two values, and it has the following syntax:

between(left, right, inclusive=’both’)

  • left: the lower endpoint for a band/range
  • right: the higher endpoint for a band/range
  • inclusive: whether we want to include the left and right endpoints. Can take the values of “both”, “neither”, “left” or “right”

The method returns a boolean index containing a list of True and False values.

Let’s check which record is between the ages of 0 to 20:

df['Age'].between(0, 20, inclusive='right')
0     False
1      True
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Name: Age, Length: 100, dtype: bool

You probably noticed that the between method is essentially equivalent to the following:

(df['Age'] > 0) & (df['Age'] <= 20)

Assigning Bins

Now we can check whether data is within a band with the help of the boolean index. We also need to use the loc method to access/assign values for individual records matching the criteria. Note the NaN values are because we haven’t assigned a band to them yet.

df.loc[df['Age'].between(left=0, right=20, inclusive='right'), 'band'] = '(0, 20]'
    Age  Net_Worth     band
0   101       2095      NaN
1    15       1331  (0, 20]
2   100       3138      NaN
3    62        518      NaN
4    23       8429      NaN
..  ...        ...      ...
95   35       7325      NaN
96   37       3105      NaN
97   22       1887      NaN
98  115       2582      NaN
99   62       3217      NaN

[100 rows x 3 columns]

Basically, the loc property returns only record matching the criteria, i.e. the True values from the boolean index. See the below example which returns only the records that are between 0 and 20 years old.

df.loc[df['Age'].between(left=0, right=20, inclusive='right')]
    Age  Net_Worth     band
1    15       1331  (0, 20]
6     8       4230  (0, 20]
9    17       1794  (0, 20]
20   13       9135  (0, 20]
36    9       5553  (0, 20]
52   13       5538  (0, 20]
55    8       7323  (0, 20]
60    2       8619  (0, 20]
62   17       7349  (0, 20]
73    8       9320  (0, 20]
77   15       4368  (0, 20]
79   16       3203  (0, 20]
83   15       2325  (0, 20]
90   14       7467  (0, 20]
93   20       3457  (0, 20]

Binning Data With Pandas Between Method

Let’s say we want to put all data into the following age bins. Note the parenthesis “()” means not inclusive and square brackets “[]” mean inclusive. so (0, 20] means from ages of 1 to 20.

  • (0, 20]
  • (20,60]
  • (60, 90]
  • (90, 120]
# age intervals
age_band = [0,20,60,90,120]

# create age bins
age_bins = []
for i in range(4):
    age_bins.append([age_band[i],age_band[i+1]])
[[0, 20], [20, 60], [60, 90], [90, 120]]

# binning the data
for b in age_bins:
    df.loc[df['Age'].between(left=b[0], right=b[1], inclusive='right'), 'band'] = f'({b[0]},{b[1]}]'
    Age  Net_Worth      band
0   101       2095  (90,120]
1    15       1331    (0,20]
2   100       3138  (90,120]
3    62        518   (60,90]
4    23       8429   (20,60]
..  ...        ...       ...
95   35       7325   (20,60]
96   37       3105   (20,60]
97   22       1887   (20,60]
98  115       2582  (90,120]
99   62       3217   (60,90]

[100 rows x 3 columns]

Unfortunately, there’s no easy way to bin data using the between and loc methods. Although using a loop is not too bad, this approach can become inefficient when handling a large number of bins because we need to repeat the process N (number of bins) times. An easier approach to bin data is to use the pandas’ cut method. You can find below a link to that tutorial.

Additional Resources

Binning Data With Pandas Cut Method

Filter Pandas Dataframe

Leave a Reply

Your email address will not be published. Required fields are marked *