SUMIF And COUNTIF In Pandas

This tutorial will walk through how to perform the SUMIF and COUNTIF Excel functions using the Python pandas library.

SUMIF is probably one of the most commonly used functions in Excel. As its name suggests, the function adds numbers when specific criteria are met.

Example Dataset

For this tutorial, I’ll use an interesting dataset that I found from Kaggle. It contains noise complaint calls related to “loud music/party” that the NYC police received in 2016. Let’s find out where to have some fun in NYC!

**NOTE** I modified the dataset a little bit (the “Location Type” column) to show the sumifs example later on. If you download the original dataset from Kaggle, you won’t replicate the results we show in this tutorial.

For convenience, I’ve uploaded the dataset to my Github, and you can read the file directly with pandas (see below code).

If you prefer to download from the source, click here: https://www.kaggle.com/somesnm/partynyc. Then download the “bar_locations.csv” file.

To load the dataset:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pythoninoffice/pandas_sumif_tut/main/modified_bar_locations.csv')
Read data into pandas
Read data into pandas

The dataset and the labeling are pretty straightforward, so I won’t explain what they are. If you need clarification, feel free to leave a comment down below.

SUMIF In Pandas

.Using boolean indexing

We’ll find out the total number of calls for the Manhattan borough.

The boolean index is a very common technique in pandas. Essentially it applies a filter to the dataframe and selects only the records that meet our criteria. For example, if we want all records from the Manhattan borough:

df[df['Borough']=='MANHATTAN']
Select rows using pandas boolean index
Select rows using pandas boolean index

Out of the entire dataset, we see 1,076 records from Manhattan.

Inside the df[], this expression df['Borough']=='MANHATTAN' returns a full-length (2,440 entries) list of True or False values, hence the name “boolean index”. Only the records with True values will return once we pass this boolean index into the df[]. That’s how we got the 1,076 entries in the above screenshot.

Python pandas boolean index
Python pandas boolean index

With the filtered dataframe, we can select the num_calls column and calculate the sum().

df[df['Borough']=='MANHATTAN']['num_calls'].sum()

What if we want to find out the number of complaint calls for all 5 boroughs of New York? We could use the above method and loop through the five borough names and calculate one by one, but that’s a bit inefficient.

.Using groupby() method

pandas library has a groupby() method that allows easy operations (e.g., sum) on groups. To use this function, we need to provide a group name, a data column, and an operation to perform. In our example:

  • group: the column Borough
  • data column: the num_calls column
  • operation: sum()
df.groupby('Borough')['num_calls'].sum()
pandas groupby function
pandas groupby function

So we found out that people in live Manhattan like to have the most fun by contributing to 39,926 complaint calls. The “unspecified” category is probably due to some missing data, which we won’t focus on here.

SUMIFS In Pandas

Of course, SUMIFS is another popular Excel formula that we can’t forget! It basically allows us to use multiple criteria when performing the sum calculation.

This time, we’ll pinpoint our search by combining Borough and Location Type columns. Note data in the Location Type column is randomly generated for demonstration purposes.

.Using boolean indexing

Let’s look at how many complaints are for the Manhattan borough and the location type “Store/Commercial”.

Since you are already familiar with the boolean index, the below should be straightforward. It’s essentially combining two criteria using the bitwise-and operator &. Note the brackets around the two criteria are essential.

Similar to the SUMIF example where we pass only 1 condition Borough == 'MANHATTAN', here in the SUMIFS, we pass in multiple conditions (as many as you need). In this example, we just needed two.

.Using groupby() method

If you are interested in all the Borough and Location Type combinations, we will still use the groupby() method instead of looping through all the possible combinations. We simply pass in a list of column names to the groupby function.

df.groupby(['Borough','Location Type'])['num_calls'].sum()

COUNTIF, COUNTIFs And Many Others In Pandas

Since you have already mastered SUMIF and SUMIFS in pandas, to do COUNTIF(S), all we need is to replace the sum() operation with the count() operation.

In fact, if we replace sum() in the above examples with:

  • mean() – will give AVERAGEIF(S)
  • max() – will give MAXIFS
  • min() – will give MINIFS
  • median() – will give MEDIANIF(S) although this formula doesn’t exist in Excel…
  • mode() – will give MODEIF(S) although this formula doesn’t exist in Excel…

Conclusion

Python and pandas are very versatile. Although there’s no SUMIF function in pandas, as long as we understand how the values are calculated, we can replicate the formula ourselves!

Leave a Reply

Your email address will not be published. Required fields are marked *