Last Updated on July 14, 2022 by Jay
In Python, the pandas groupby() function provides a convenient way to summarize data in any way we want. The function actually does more than just summarize data. We’ll walk through a real-life example of how to use the function, then take a deeper dive into what’s actually behind the scene – which is the so-called “split-apply-combine” process.
A simple tool to track your credit card spending
Almost everyone has a credit card nowadays. It’s so convenient to use – with one quick tap or swipe, the transaction is done. However, at the end of each payment period, do you ever wonder “where the heck did I spend all these money on?” The goal is to gain insights from our credit card transaction data. Hopefully after this exercise we’ll learn something about our own spending habits and maybe can form an action plan to help improve our personal finance situation.
I’ve attached a mocked-up credit card statement here if you want to download and follow along the tutorial.
Let’s examine what data are available. First of all, we’ll load it into the Python environment. Note in read_cvs
line I’ve included a parse_dates
argument to indicate that the “Transaction Date” column is a datetime-type data. This will make later processing easier.
import pandas as pd
df = pd.read_csv('cc_statement.csv', parse_dates=['Transaction Date'])
If we don’t specify this parse_dates
argument now, pandas will likely consider the column to be a text data. Another way to convert text into datetime type is to use the following:
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])
The following quick checks show that there are 533 rows / transactions. We have transaction dates, purchase description, purchase category and amount (debit means cash outflow/our spending, credit means cash inflow/credit card payment).
>>> df.shape
(533, 7)
>>> df.columns
Index(['Transaction Date', 'Posted Date', 'Card No.', 'Description',
'Category', 'Debit', 'Credit'],
dtype='object')
Also using the describe()
function on the “Transaction Date” column reveals that we are working with 2020 full year data (min = 2020-01-02, and max = 2020-12-30). The datetime_is_numeric
argument also helps pandas understand that we are using datetime type data.
>>> df['Transaction Date'].describe(datetime_is_numeric=True)
count 533
mean 2020-07-20 08:36:01.350844160
min 2020-01-02 00:00:00
25% 2020-03-21 00:00:00
50% 2020-08-17 00:00:00
75% 2020-10-23 00:00:00
max 2020-12-30 00:00:00
Name: Transaction Date, dtype: object
Adding more information to our data
We will go ahead and add two more columns for days and months for our transactions. Because we have already specified that Transaction Date column is a datetime like object, we can access these properties via the .dt
accessor, which allows for vectorized operations, i.e. the appropriate way to deal with data in pandas.
df['Day'] = df['Transaction Date'].dt.day_name()
df['Month'] = df['Transaction Date'].dt.month
The day_name()
returns name of the days, e.g. Monday to Sunday. Whereas the month
returns a numeric value for the give month (1-12).
Summarizing data with pandas groupby
Unorganized transaction data doesn’t provide much value, but when we organize and summarize them in a meaningful way, we can gain more insights on our spending habits. Let’s take a look at the following examples.
In the below example, we first group our data by day of the week, then we specify the column we want to look at – “Debit”, finally we perform an operation – count or sum on the “Debit” column of the grouped data. The following summary tells that we shop the most (in terms of number of transactions) on Fridays; whereas we spend the most (in terms of dollars) on Sundays.
daily_spend_count = df.groupby('Day')['Debit'].count()
daily_spend_sum = df.groupby('Day')['Debit'].sum()
>>> daily_spend_count
Day
Friday 90
Monday 81
Saturday 62
Sunday 71
Thursday 79
Tuesday 62
Wednesday 55
Name: Debit, dtype: int64
>>> daily_spend_sum
Day
Friday 4879.35
Monday 4767.77
Saturday 4308.41
Sunday 6338.73
Thursday 5837.16
Tuesday 3781.82
Wednesday 6151.71
Name: Debit, dtype: float64
We can actually combine the above two lines of code into one line by using the .agg()
method of the groupby
object. We simply pass a dictionary into agg()
. The dictionary key is the data column we want to work on, and the dictionary values (can be either a single value or a list) are the operations we want to perform.
daily_spend = df.groupby('Day').agg({'Debit':['sum','count']})
>>> daily_spend
Debit
sum count
Day
Friday 4879.35 90
Monday 4767.77 81
Saturday 4308.41 62
Sunday 6338.73 71
Thursday 5837.16 79
Tuesday 3781.82 62
Wednesday 6151.71 55
With the dictionary approach, we can’t change the column name unless using the rename() method. To change the column name within the agg()
method, we need to do the following where:
- The keywords are the new column names
- The values are namedtuple
pd.NamedAgg
with the first argument for column and second argument so specify the operations
pd.NamedAgg
is a namedtuple, and regular tuples are allowed as well, so we can simplify the above even further:
spend_category = df.groupby('Category').agg(Total_spend=('Debit','sum'),
Avg_spend=('Debit','mean'))
>>> spend_category
Total_spend Avg_spend
Category
Dining 1212.05 28.858333
Entertainment 369.03 92.257500
Fee/Interest Charge 372.10 31.008333
Gas/Automotive 1218.81 32.940811
Health Care 3198.70 71.082222
Insurance 238.30 238.300000
Internet 2999.96 39.999467
Merchandise 24126.63 102.666511
Other 50.00 50.000000
Other Services 346.42 38.491111
Other Travel 1237.88 38.683750
Payment/Credit 0.00 NaN
Phone/Cable 615.99 102.665000
Professional Services 79.08 79.080000
Grouping by multiple columns
Remember that our goal is to hopefully gain some insights from our spending data and try to improve personal finance situation. The “Fee/Interest Charge” category looks suspicious, also we want to see if we can reduce some of the “Entertainment” expense, so let’s break down those expenses into monthly figures.
We’ll select only “Entertainment” and “Fee/Interest Charge” from the category and examine the new dataset. If you need help understanding how to select a subset of data in pandas, read this tutorial here.
smaller_df =df.loc[(df['Category']=='Fee/Interest Charge') | (df['Category']=='Entertainment')]
monthly_by_category = smaller_df.groupby(['Category','Month'])['Debit'].sum()
>>> monthly_by_category
Category Month
Entertainment 2 220.00
7 121.80
8 27.23
Fee/Interest Charge 1 29.49
2 27.94
3 29.35
4 56.03
5 136.46
9 30.23
10 0.17
11 61.78
12 0.65
Name: Debit, dtype: float64
It seems like we have paid a “Fee/Interest Charge” almost every month and that’s an obvious red flag and a bad spending habit – maybe we didn’t pay back the credit card in time which resulted in those interest charges. We should pay more attention to payment deadlines and be more diligent on paying on time!
Now you have a basic understanding of how to summarize data with pandas groupby
function. The next section is more advanced topic and we’ll discuss what’s under the hood when we use that function.
Pandas groupby: a split-apply-combine process
In essence, the groupby
refers to a process that involves one or more of the following steps:
- Split – split the data into groups
- Apply – apply an operation to each group (from the split step) independently
- Combine – combine the results together
Split the dataset
Splitting data happens at the groupby()
stage. Let’s split the data by spending category, and the result is actually a pandas
object. It’s hard to visualize what this object is if you simply print it out:D
ataFrameGroupBy
>>> category = df.groupby('Category')
>>> print(category)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028E72E17E50>
The good news is that we can iterate over the GroupBy
object to see what’s inside. The full output is way too long so here I’m only showing one of the items:
for i in category:
print(i)
.........
('Entertainment', Transaction Date Posted Date Card No. ... 10 11 12
283 2020-08-01 2020-08-03 1 ... 0.0 0.0 0.0
284 2020-08-01 2020-08-03 2 ... 0.0 0.0 0.0
313 2020-07-17 2020-07-18 4 ... 0.0 0.0 0.0
469 2020-02-01 2020-02-03 1 ... 0.0 0.0 0.0
[4 rows x 21 columns])
.........
Notice the bracket around this item? It seems like a tuple that contains a text and a dataframe… let’s confirm that by printing out the type of each item within the GroupBy
object.
>>> for i in category:
print(f'{type(i)} contains: 1) {type(i[0])} 2) {type(i[1])} ')
<class 'tuple'> contains: 1) <class 'str'> 2) <class 'pandas.core.frame.DataFrame'>
<class 'tuple'> contains: 1) <class 'str'> 2) <class 'pandas.core.frame.DataFrame'>
..........
Now that we have a confirmation! The GroupBy
object contains a bunch of tuples (one for each group). Within the tuple, the first element is the category name, and the second element is the subset data that belongs to the particular category. Hence this is the splitting step.
Instead of iterating over it, we can also access the split dataset using pandas built-in attributes or method. For example the attribute groups
gives us a dictionary containing the group names (dictionary keys) and index locations for the rows belonging to the given group.
>>> category.groups
{'Dining': [8, 15, 44, 46, 67, 98, 126, 134, 150, 182, 251, 256, 261, 274, 277, 278,....
....
'Phone/Cable': [39, 87, 144, 211, 266, 286], 'Professional Services': [191]}
To get a particular group, simply use get_group()
.
>>> category.get_group('Fee/Interest Charge')
Transaction Date Debit Day Month
30 2020-12-19 0.65 Saturday 12
79 2020-11-20 61.78 Friday 11
.....
447 2020-02-20 27.94 Thursday 2
497 2020-01-20 29.00 Monday 1
498 2020-01-20 0.49 Monday 1
>>>
Apply an operation
Once we have the split dataset, we can easily apply an operation on a subset of data. To calculate the total spending of the “Fee/Interest Charge” group, we can simply sum up the “Debit” column.
>>> category.get_group('Fee/Interest Charge')['Debit'].sum()
372.1
You might have also noticed, we can use the .loc
approach to achieve exactly same results as the above groupby
method. However, the .loc
approach only does the operation once at a time, whereas the groupby
approach automatically applies the same operation on each group.
>>> df.loc[df['Category'] == 'Fee/Interest Charge']['Debit'].sum()
372.1
If we are to replicate the split & apply process using .loc
approach, it will look like the following. Let’s also compare .loc
with the groupby
method. It’s very clear that the latter is definitely easier to use and also puts the results back into a dataframe structure, which is much more convenient for further processing!
>>> for c in df['Category'].unique():
print(c + ": " + str(df.loc[df['Category'] == c]['Debit'].sum().round(2)))
Internet: 2999.96
Other Travel: 1237.88
Gas/Automotive: 1218.81
Merchandise: 24126.63
Dining: 1212.05
Health Care: 3198.7
Fee/Interest Charge: 372.1
Phone/Cable: 615.99
Payment/Credit: 0.0
Other Services: 346.42
Professional Services: 79.08
Entertainment: 369.03
Other: 50.0
Insurance: 238.3
>>> df.groupby('Category')['Debit'].sum()
Category
Dining 1212.05
Entertainment 369.03
Fee/Interest Charge 372.10
Gas/Automotive 1218.81
Health Care 3198.70
Insurance 238.30
Internet 2999.96
Merchandise 24126.63
Other 50.00
Other Services 346.42
Other Travel 1237.88
Payment/Credit 0.00
Phone/Cable 615.99
Professional Services 79.08
Name: Debit, dtype: float64
Combining the results
Finally, the combining step is easy to visualize from the results we obtained above, it basically puts the results back into a dataframe and display it in a more meaningful way like the results on the right hand side. The cosmetics are not the focus of this tutorial so we won’t go into that part.