Binning Data With Pandas Cut Method

Last Updated on July 14, 2022 by Jay

Sometimes we need to perform data binning and pandas provides a convenient method cut for exactly that purpose. Essentially we are putting data into discrete intervals or bands/bins like the below example.

In the following simple dataset, we have a group of 100 people with their ages and net worth in dollar amounts. We want to put these people into different age bands and perform analysis.

import pandas as pd
import numpy as np

age = np.random.randint(0,121, size=100)
net_worth = np.random.randint(100,10000, size=100)

df = pd.DataFrame({'Age':age, 'Net_Worth':net_worth})
df
    Age  Net_Worth
0    73       2860
1    69       6222
2    21       6004
3     1       1259
4    26       7637
..  ...        ...
95   95       9031
96  113       9527
97  118       2823
98   79        604
99   62       7079

[100 rows x 2 columns]

Now let’s create the age bins, aka age bands. We can simply use a list of integer numbers such as the below to construct the bins:

age_band = [0,20,30,40,50,65,70,90]

These integer numbers act as the lower and upper edges for the bands. For example:

(0, 20]
(20, 30]
(30, 40]
…

There’s currently a problem with this age band, we’ll see in a second.

Pandas Binning With Cut()

Let’s perform a simple cut operation. Note the first argument should be a 1-dimensional array type object. The result is a pandas series containing the age bands for each record like the following:

pd.cut(df['Age'], bins=age_band)

0     (70.0, 90.0]
1     (60.0, 70.0]
2     (20.0, 30.0]
3      (0.0, 20.0]
4     (20.0, 30.0]
          ...     
95             NaN
96             NaN
97             NaN
98    (70.0, 90.0]
99    (60.0, 70.0]
Name: Age, Length: 100, dtype: category
Categories (7, interval[int64, right]): [(0, 20] < (20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 90]]

We can store this age band column into the dataframe to retain the band information for each record. Taking a look at the below results: the first record age is 73; therefore, it’s been placed into the (70,90] age band. We also notice some NaN values at the bottom of the dataset.

df['band'] = pd.cut(df['Age'], bins=age_band)
    Age  Net_Worth          band
0    73       2860  (70.0, 90.0]
1    69       6222  (60.0, 70.0]
2    21       6004  (20.0, 30.0]
3     1       1259   (0.0, 20.0]
4    26       7637  (20.0, 30.0]
..  ...        ...           ...
95   95       9031           NaN
96  113       9527           NaN
97  118       2823           NaN
98   79        604  (70.0, 90.0]
99   62       7079  (60.0, 70.0]

[100 rows x 3 columns]

Bins Should Cover The Minimum And Maximum Values

The reason for the NaN values is that the bin we created doesn’t cover the maximum value in the dataset. To bin all the records, we need to include a data point for the maximum age of 120. Essentially we are adding another age band 90~120. Binning the data again will create age bands for all records like the below.

age_band = [0,20,30,40,50,65,70,90, 120]

df['band'] = pd.cut(df['Age'], bins=age_band)
    Age  Net_Worth       band
0    73       2860   (70, 90]
1    69       6222   (65, 70]
2    21       6004   (20, 30]
3     1       1259    (0, 20]
4    26       7637   (20, 30]
..  ...        ...        ...
95   95       9031  (90, 120]
96  113       9527  (90, 120]
97  118       2823  (90, 120]
98   79        604   (70, 90]
99   62       7079   (50, 65]

[100 rows x 3 columns]

Left Or Right Edge Inclusive

By default, the bins are “right inclusive” meaning the bins includes the rightmost edge. For example, the age band for the first data point is (70,90]. Note the different brackets used here. A parenthesis “(” means does not include; while a square bracket “]” means include. So (70,90] means from ages 71 to age 90 (inclusive).

We can control whether to include the left or the right endpoints by using the right argument. The below example becomes left-inclusive as suggested by the square bracket “[” on the left-hand side of the band column.

df['band'] = pd.cut(df['Age'], bins=age_band, right=False)
    Age  Net_Worth       band
0    73       2860   [70, 90)
1    69       6222   [65, 70)
2    21       6004   [20, 30)
3     1       1259    [0, 20)
4    26       7637   [20, 30)
..  ...        ...        ...
95   95       9031  [90, 120)
96  113       9527  [90, 120)
97  118       2823  [90, 120)
98   79        604   [70, 90)
99   62       7079   [50, 65)

[100 rows x 3 columns]

Add New Bin Labels

We can also change the bin labels instead of using the band/range by passing a list of desired labels to the labels argument. Note that the length of the labels must be equal to the number of bins. For example:

df['band'] = pd.cut(df['Age'], bins=age_band, right=False, labels=list('ABCDEFGH'))
print(df)
    Age  Net_Worth band
0    73       2860    G
1    69       6222    F
2    21       6004    B
3     1       1259    A
4    26       7637    B
..  ...        ...  ...
95   95       9031    H
96  113       9527    H
97  118       2823    H
98   79        604    G
99   62       7079    E

[100 rows x 3 columns]

Summarize Data By Bins

Now we have created bins for all the records, we can summarize the data using the groupby method. The below example shows the average net worth of each age group. Note the data was randomly generated so it makes little sense, but the technique remains legitimate.

df['band'] = pd.cut(df['Age'], bins=age_band)
df.groupby(by='band').agg({'Net_Worth':'mean'})

             Net_Worth
band                  
(0, 20]    4615.470588
(20, 30]   5014.142857
(30, 40]   6230.818182
(40, 50]   4785.714286
(50, 65]   4608.000000
(65, 70]   2967.500000
(70, 90]   4267.937500
(90, 120]  4926.241379

Additional Resources

How To Use Pandas Groupby To Summarize Data