Last Updated on July 14, 2022 by Jay
This tutorial will walk through how to perform the SUMIF and COUNTIF Excel functions using the Python pandas library.
SUMIF is probably one of the most commonly used functions in Excel. As its name suggests, the function adds numbers when specific criteria are met.
Example Dataset
For this tutorial, I’ll use an interesting dataset that I found from Kaggle. It contains noise complaint calls related to “loud music/party” that the NYC police received in 2016. Let’s find out where to have some fun in NYC!
**NOTE** I modified the dataset a little bit (the “Location Type” column) to show the sumifs example later on. If you download the original dataset from Kaggle, you won’t replicate the results we show in this tutorial.
For convenience, I’ve uploaded the dataset to my Github, and you can read the file directly with pandas (see below code).
If you prefer to download from the source, click here: https://www.kaggle.com/somesnm/partynyc. Then download the “bar_locations.csv” file.
To load the dataset:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pythoninoffice/pandas_sumif_tut/main/modified_bar_locations.csv')
The dataset and the labeling are pretty straightforward, so I won’t explain what they are. If you need clarification, feel free to leave a comment down below.
SUMIF In Pandas
.Using boolean indexing
We’ll find out the total number of calls for the Manhattan borough.
The boolean index is a very common technique in pandas. Essentially it applies a filter to the dataframe and selects only the records that meet our criteria. For example, if we want all records from the Manhattan borough:
df[df['Borough']=='MANHATTAN']
Out of the entire dataset, we see 1,076 records from Manhattan.
Inside the df[]
, this expression df['Borough']=='MANHATTAN'
returns a full-length (2,440 entries) list of True or False values, hence the name “boolean index”. Only the records with True
values will return once we pass this boolean index into the df[]
. That’s how we got the 1,076 entries in the above screenshot.
With the filtered dataframe, we can select the num_calls
column and calculate the sum()
.
df[df['Borough']=='MANHATTAN']['num_calls'].sum()
What if we want to find out the number of complaint calls for all 5 boroughs of New York? We could use the above method and loop through the five borough names and calculate one by one, but that’s a bit inefficient.
.Using groupby() method
pandas library has a groupby() method that allows easy operations (e.g., sum) on groups. To use this function, we need to provide a group name, a data column, and an operation to perform. In our example:
- group: the column
Borough
- data column: the
num_calls
column - operation:
sum()
df.groupby('Borough')['num_calls'].sum()
So we found out that people in live Manhattan like to have the most fun by contributing to 39,926 complaint calls. The “unspecified” category is probably due to some missing data, which we won’t focus on here.
SUMIFS In Pandas
Of course, SUMIFS is another popular Excel formula that we can’t forget! It basically allows us to use multiple criteria when performing the sum calculation.
This time, we’ll pinpoint our search by combining Borough
and Location Type
columns. Note data in the Location Type
column is randomly generated for demonstration purposes.
.Using boolean indexing
Let’s look at how many complaints are for the Manhattan borough and the location type “Store/Commercial”.
Since you are already familiar with the boolean index, the below should be straightforward. It’s essentially combining two criteria using the bitwise-and operator &
. Note the brackets around the two criteria are essential.
Similar to the SUMIF example where we pass only 1 condition Borough == 'MANHATTAN'
, here in the SUMIFS, we pass in multiple conditions (as many as you need). In this example, we just needed two.
.Using groupby() method
If you are interested in all the Borough and Location Type combinations, we will still use the groupby() method instead of looping through all the possible combinations. We simply pass in a list of column names to the groupby function.
df.groupby(['Borough','Location Type'])['num_calls'].sum()
COUNTIF, COUNTIFs And Many Others In Pandas
Since you have already mastered SUMIF and SUMIFS in pandas, to do COUNTIF(S), all we need is to replace the sum() operation with the count() operation.
In fact, if we replace sum() in the above examples with:
- mean() – will give AVERAGEIF(S)
- max() – will give MAXIFS
- min() – will give MINIFS
- median() – will give MEDIANIF(S) although this formula doesn’t exist in Excel…
- mode() – will give MODEIF(S) although this formula doesn’t exist in Excel…
Conclusion
Python and pandas are very versatile. Although there’s no SUMIF function in pandas, as long as we understand how the values are calculated, we can replicate the formula ourselves!