Last Updated on July 14, 2022 by Jay
Sometimes we need to perform data binning and pandas provides a convenient method cut for exactly that purpose. Essentially we are putting data into discrete intervals or bands/bins like the below example.
In the following simple dataset, we have a group of 100 people with their ages and net worth in dollar amounts. We want to put these people into different age bands and perform analysis.
import pandas as pd
import numpy as np
age = np.random.randint(0,121, size=100)
net_worth = np.random.randint(100,10000, size=100)
df = pd.DataFrame({'Age':age, 'Net_Worth':net_worth})
df
Age Net_Worth
0 73 2860
1 69 6222
2 21 6004
3 1 1259
4 26 7637
.. ... ...
95 95 9031
96 113 9527
97 118 2823
98 79 604
99 62 7079
[100 rows x 2 columns]
Now let’s create the age bins, aka age bands. We can simply use a list of integer numbers such as the below to construct the bins:
age_band = [0,20,30,40,50,65,70,90]
These integer numbers act as the lower and upper edges for the bands. For example:
- (0, 20]
- (20, 30]
- (30, 40]
- …
There’s currently a problem with this age band, we’ll see in a second.
Pandas Binning With Cut()
Let’s perform a simple cut operation. Note the first argument should be a 1-dimensional array type object. The result is a pandas series containing the age bands for each record like the following:
pd.cut(df['Age'], bins=age_band)
0 (70.0, 90.0]
1 (60.0, 70.0]
2 (20.0, 30.0]
3 (0.0, 20.0]
4 (20.0, 30.0]
...
95 NaN
96 NaN
97 NaN
98 (70.0, 90.0]
99 (60.0, 70.0]
Name: Age, Length: 100, dtype: category
Categories (7, interval[int64, right]): [(0, 20] < (20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 90]]
We can store this age band column into the dataframe to retain the band information for each record. Taking a look at the below results: the first record age is 73; therefore, it’s been placed into the (70,90] age band. We also notice some NaN values at the bottom of the dataset.
df['band'] = pd.cut(df['Age'], bins=age_band)
Age Net_Worth band
0 73 2860 (70.0, 90.0]
1 69 6222 (60.0, 70.0]
2 21 6004 (20.0, 30.0]
3 1 1259 (0.0, 20.0]
4 26 7637 (20.0, 30.0]
.. ... ... ...
95 95 9031 NaN
96 113 9527 NaN
97 118 2823 NaN
98 79 604 (70.0, 90.0]
99 62 7079 (60.0, 70.0]
[100 rows x 3 columns]
Bins Should Cover The Minimum And Maximum Values
The reason for the NaN values is that the bin we created doesn’t cover the maximum value in the dataset. To bin all the records, we need to include a data point for the maximum age of 120. Essentially we are adding another age band 90~120. Binning the data again will create age bands for all records like the below.
age_band = [0,20,30,40,50,65,70,90, 120]
df['band'] = pd.cut(df['Age'], bins=age_band)
Age Net_Worth band
0 73 2860 (70, 90]
1 69 6222 (65, 70]
2 21 6004 (20, 30]
3 1 1259 (0, 20]
4 26 7637 (20, 30]
.. ... ... ...
95 95 9031 (90, 120]
96 113 9527 (90, 120]
97 118 2823 (90, 120]
98 79 604 (70, 90]
99 62 7079 (50, 65]
[100 rows x 3 columns]
Left Or Right Edge Inclusive
By default, the bins are “right inclusive” meaning the bins includes the rightmost edge. For example, the age band for the first data point is (70,90]. Note the different brackets used here. A parenthesis “(” means does not include; while a square bracket “]” means include. So (70,90] means from ages 71 to age 90 (inclusive).
We can control whether to include the left or the right endpoints by using the right argument. The below example becomes left-inclusive as suggested by the square bracket “[” on the left-hand side of the band column.
df['band'] = pd.cut(df['Age'], bins=age_band, right=False)
Age Net_Worth band
0 73 2860 [70, 90)
1 69 6222 [65, 70)
2 21 6004 [20, 30)
3 1 1259 [0, 20)
4 26 7637 [20, 30)
.. ... ... ...
95 95 9031 [90, 120)
96 113 9527 [90, 120)
97 118 2823 [90, 120)
98 79 604 [70, 90)
99 62 7079 [50, 65)
[100 rows x 3 columns]
Add New Bin Labels
We can also change the bin labels instead of using the band/range by passing a list of desired labels to the labels argument. Note that the length of the labels must be equal to the number of bins. For example:
df['band'] = pd.cut(df['Age'], bins=age_band, right=False, labels=list('ABCDEFGH'))
print(df)
Age Net_Worth band
0 73 2860 G
1 69 6222 F
2 21 6004 B
3 1 1259 A
4 26 7637 B
.. ... ... ...
95 95 9031 H
96 113 9527 H
97 118 2823 H
98 79 604 G
99 62 7079 E
[100 rows x 3 columns]
Summarize Data By Bins
Now we have created bins for all the records, we can summarize the data using the groupby method. The below example shows the average net worth of each age group. Note the data was randomly generated so it makes little sense, but the technique remains legitimate.
df['band'] = pd.cut(df['Age'], bins=age_band)
df.groupby(by='band').agg({'Net_Worth':'mean'})
Net_Worth
band
(0, 20] 4615.470588
(20, 30] 5014.142857
(30, 40] 6230.818182
(40, 50] 4785.714286
(50, 65] 4608.000000
(65, 70] 2967.500000
(70, 90] 4267.937500
(90, 120] 4926.241379