How To Use Pandas Groupby To Summarize Data

Last Updated on July 14, 2022 by Jay

In Python, the pandas groupby() function provides a convenient way to summarize data in any way we want. The function actually does more than just summarize data. We’ll walk through a real-life example of how to use the function, then take a deeper dive into what’s actually behind the scene – which is the so-called “split-apply-combine” process.

A simple tool to track your credit card spending

Almost everyone has a credit card nowadays. It’s so convenient to use – with one quick tap or swipe, the transaction is done. However, at the end of each payment period, do you ever wonder “where the heck did I spend all these money on?” The goal is to gain insights from our credit card transaction data. Hopefully after this exercise we’ll learn something about our own spending habits and maybe can form an action plan to help improve our personal finance situation.

I’ve attached a mocked-up credit card statement here if you want to download and follow along the tutorial.

Mock up credit card statement Download

Let’s examine what data are available. First of all, we’ll load it into the Python environment. Note in read_cvs line I’ve included a parse_dates argument to indicate that the “Transaction Date” column is a datetime-type data. This will make later processing easier.

import pandas as pd

df = pd.read_csv('cc_statement.csv', parse_dates=['Transaction Date'])

If we don’t specify this parse_dates argument now, pandas will likely consider the column to be a text data. Another way to convert text into datetime type is to use the following:

df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])

The following quick checks show that there are 533 rows / transactions. We have transaction dates, purchase description, purchase category and amount (debit means cash outflow/our spending, credit means cash inflow/credit card payment).

>>> df.shape
(533, 7)

>>> df.columns
Index(['Transaction Date', 'Posted Date', 'Card No.', 'Description',
'Category', 'Debit', 'Credit'],
dtype='object')

Also using the describe() function on the “Transaction Date” column reveals that we are working with 2020 full year data (min = 2020-01-02, and max = 2020-12-30). The datetime_is_numeric argument also helps pandas understand that we are using datetime type data.


>>> df['Transaction Date'].describe(datetime_is_numeric=True)
count 533
mean 2020-07-20 08:36:01.350844160
min 2020-01-02 00:00:00
25% 2020-03-21 00:00:00
50% 2020-08-17 00:00:00
75% 2020-10-23 00:00:00
max 2020-12-30 00:00:00
Name: Transaction Date, dtype: object

Adding more information to our data

We will go ahead and add two more columns for days and months for our transactions. Because we have already specified that Transaction Date column is a datetime like object, we can access these properties via the .dt accessor, which allows for vectorized operations, i.e. the appropriate way to deal with data in pandas.

df['Day'] = df['Transaction Date'].dt.day_name()
df['Month'] = df['Transaction Date'].dt.month

The day_name() returns name of the days, e.g. Monday to Sunday. Whereas the month returns a numeric value for the give month (1-12).

Summarizing data with pandas groupby

Unorganized transaction data doesn’t provide much value, but when we organize and summarize them in a meaningful way, we can gain more insights on our spending habits. Let’s take a look at the following examples.

In the below example, we first group our data by day of the week, then we specify the column we want to look at – “Debit”, finally we perform an operation – count or sum on the “Debit” column of the grouped data. The following summary tells that we shop the most (in terms of number of transactions) on Fridays; whereas we spend the most (in terms of dollars) on Sundays.

daily_spend_count = df.groupby('Day')['Debit'].count()
daily_spend_sum = df.groupby('Day')['Debit'].sum()


>>> daily_spend_count
Day
Friday       90
Monday       81
Saturday     62
Sunday       71
Thursday     79
Tuesday      62
Wednesday    55
Name: Debit, dtype: int64

>>> daily_spend_sum
Day
Friday       4879.35
Monday       4767.77
Saturday     4308.41
Sunday       6338.73
Thursday     5837.16
Tuesday      3781.82
Wednesday    6151.71
Name: Debit, dtype: float64

We can actually combine the above two lines of code into one line by using the .agg() method of the groupby object. We simply pass a dictionary into agg(). The dictionary key is the data column we want to work on, and the dictionary values (can be either a single value or a list) are the operations we want to perform.

daily_spend = df.groupby('Day').agg({'Debit':['sum','count']})

>>> daily_spend
             Debit      
               sum count
Day                     
Friday     4879.35    90
Monday     4767.77    81
Saturday   4308.41    62
Sunday     6338.73    71
Thursday   5837.16    79
Tuesday    3781.82    62
Wednesday  6151.71    55

With the dictionary approach, we can’t change the column name unless using the rename() method. To change the column name within the agg() method, we need to do the following where:

The keywords are the new column names
The values are namedtuple pd.NamedAgg with the first argument for column and second argument so specify the operations

pd.NamedAgg is a namedtuple, and regular tuples are allowed as well, so we can simplify the above even further:

spend_category = df.groupby('Category').agg(Total_spend=('Debit','sum'),
                                            Avg_spend=('Debit','mean'))

>>> spend_category
                       Total_spend   Avg_spend
Category                                      
Dining                     1212.05   28.858333
Entertainment               369.03   92.257500
Fee/Interest Charge         372.10   31.008333
Gas/Automotive             1218.81   32.940811
Health Care                3198.70   71.082222
Insurance                   238.30  238.300000
Internet                   2999.96   39.999467
Merchandise               24126.63  102.666511
Other                        50.00   50.000000
Other Services              346.42   38.491111
Other Travel               1237.88   38.683750
Payment/Credit                0.00         NaN
Phone/Cable                 615.99  102.665000
Professional Services        79.08   79.080000

Grouping by multiple columns

Remember that our goal is to hopefully gain some insights from our spending data and try to improve personal finance situation. The “Fee/Interest Charge” category looks suspicious, also we want to see if we can reduce some of the “Entertainment” expense, so let’s break down those expenses into monthly figures.

We’ll select only “Entertainment” and “Fee/Interest Charge” from the category and examine the new dataset. If you need help understanding how to select a subset of data in pandas, read this tutorial here.

smaller_df =df.loc[(df['Category']=='Fee/Interest Charge') | (df['Category']=='Entertainment')]    
                                       
monthly_by_category = smaller_df.groupby(['Category','Month'])['Debit'].sum()


>>> monthly_by_category
Category             Month
Entertainment        2        220.00
                     7        121.80
                     8         27.23
Fee/Interest Charge  1         29.49
                     2         27.94
                     3         29.35
                     4         56.03
                     5        136.46
                     9         30.23
                     10         0.17
                     11        61.78
                     12         0.65
Name: Debit, dtype: float64

It seems like we have paid a “Fee/Interest Charge” almost every month and that’s an obvious red flag and a bad spending habit – maybe we didn’t pay back the credit card in time which resulted in those interest charges. We should pay more attention to payment deadlines and be more diligent on paying on time!

Now you have a basic understanding of how to summarize data with pandas groupby function. The next section is more advanced topic and we’ll discuss what’s under the hood when we use that function.

Pandas groupby: a split-apply-combine process

In essence, the groupby refers to a process that involves one or more of the following steps:

Split – split the data into groups
Apply – apply an operation to each group (from the split step) independently
Combine – combine the results together

Split the dataset

Splitting data happens at the groupby() stage. Let’s split the data by spending category, and the result is actually a pandas DataFrameGroupBy object. It’s hard to visualize what this object is if you simply print it out:

>>> category = df.groupby('Category')
>>> print(category)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028E72E17E50>

The good news is that we can iterate over the GroupBy object to see what’s inside. The full output is way too long so here I’m only showing one of the items:

for i in category:
    print(i)

.........
('Entertainment',     Transaction Date Posted Date  Card No.  ...   10   11   12
283       2020-08-01  2020-08-03         1  ...  0.0  0.0  0.0
284       2020-08-01  2020-08-03         2  ...  0.0  0.0  0.0
313       2020-07-17  2020-07-18         4  ...  0.0  0.0  0.0
469       2020-02-01  2020-02-03         1  ...  0.0  0.0  0.0

[4 rows x 21 columns])
.........

Notice the bracket around this item? It seems like a tuple that contains a text and a dataframe… let’s confirm that by printing out the type of each item within the GroupBy object.

>>> for i in category:
	print(f'{type(i)} contains:  1) {type(i[0])}   2) {type(i[1])} ')

<class 'tuple'> contains:  1) <class 'str'>   2) <class 'pandas.core.frame.DataFrame'> 
<class 'tuple'> contains:  1) <class 'str'>   2) <class 'pandas.core.frame.DataFrame'> 
..........

Now that we have a confirmation! The GroupBy object contains a bunch of tuples (one for each group). Within the tuple, the first element is the category name, and the second element is the subset data that belongs to the particular category. Hence this is the splitting step.

Instead of iterating over it, we can also access the split dataset using pandas built-in attributes or method. For example the attribute groups gives us a dictionary containing the group names (dictionary keys) and index locations for the rows belonging to the given group.

>>> category.groups
{'Dining': [8, 15, 44, 46, 67, 98, 126, 134, 150, 182, 251, 256, 261, 274, 277, 278,....
....
'Phone/Cable': [39, 87, 144, 211, 266, 286], 'Professional Services': [191]}

To get a particular group, simply use get_group().

>>> category.get_group('Fee/Interest Charge')

    Transaction Date   Debit        Day  Month
30        2020-12-19    0.65   Saturday     12
79        2020-11-20   61.78     Friday     11
.....
447       2020-02-20   27.94   Thursday      2
497       2020-01-20   29.00     Monday      1
498       2020-01-20    0.49     Monday      1
>>>

Apply an operation

Once we have the split dataset, we can easily apply an operation on a subset of data. To calculate the total spending of the “Fee/Interest Charge” group, we can simply sum up the “Debit” column.

>>> category.get_group('Fee/Interest Charge')['Debit'].sum()
372.1

You might have also noticed, we can use the .loc approach to achieve exactly same results as the above groupby method. However, the .loc approach only does the operation once at a time, whereas the groupby approach automatically applies the same operation on each group.

>>> df.loc[df['Category'] == 'Fee/Interest Charge']['Debit'].sum()
372.1

If we are to replicate the split & apply process using .loc approach, it will look like the following. Let’s also compare .loc with the groupby method. It’s very clear that the latter is definitely easier to use and also puts the results back into a dataframe structure, which is much more convenient for further processing!

>>> for c in df['Category'].unique():
	print(c + ": " + str(df.loc[df['Category'] == c]['Debit'].sum().round(2)))

	
Internet: 2999.96
Other Travel: 1237.88
Gas/Automotive: 1218.81
Merchandise: 24126.63
Dining: 1212.05
Health Care: 3198.7
Fee/Interest Charge: 372.1
Phone/Cable: 615.99
Payment/Credit: 0.0
Other Services: 346.42
Professional Services: 79.08
Entertainment: 369.03
Other: 50.0
Insurance: 238.3

>>> df.groupby('Category')['Debit'].sum()

Category
Dining                    1212.05
Entertainment              369.03
Fee/Interest Charge        372.10
Gas/Automotive            1218.81
Health Care               3198.70
Insurance                  238.30
Internet                  2999.96
Merchandise              24126.63
Other                       50.00
Other Services             346.42
Other Travel              1237.88
Payment/Credit               0.00
Phone/Cable                615.99
Professional Services       79.08
Name: Debit, dtype: float64

Combining the results

Finally, the combining step is easy to visualize from the results we obtained above, it basically puts the results back into a dataframe and display it in a more meaningful way like the results on the right hand side. The cosmetics are not the focus of this tutorial so we won’t go into that part.

Additional Resources

How to Filter A pandas Dataframe

How to Convert Column to Datetime in Pandas