Last Updated on July 14, 2022 by Jay
In Excel, we can “easily” remove duplicates from a table by clicking on the “Remove Duplicates” button from the Data tab. Indeed it’s pretty easy. However, when the datasets are too big, or there are formulas in the spreadsheet, this can sometimes be slow. So today, we’ll explore how to use Python to remove duplicates from a data table. It’s super easy, fast, and flexible.
This tutorial is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation.
Prepare a dataframe for demo
You can download this sample Excel spreadsheet to follow along. First thing, let’s load the spreadsheet into Python. If you are not familiar with using Python to work with Excel files, check out here for a Python vs. Excel comparison.
import pandas as pd
df = pd.read_excel('users.xlsx')
>>> df
User Name Country City Gender Age
0 Forrest Gump USA New York M 50
1 Mary Jane CANADA Toronto F 30
2 Harry Porter UK London M 10
3 Jean Grey CHINA Shanghai F 30
4 Jean Grey CANADA Montreal F 30
5 Mary Jane CANADA Toronto F 30
A quick observation of the above small table:
- Line 1 and 5 contain exact same information
- Line 3 and 4 contain the same User Name, but different Country and City
Remove duplicates
Depending on what you are trying to achieve, we can use different approaches to remove duplicates. The two most common scenarios are: removing duplicates from the entire table or finding unique values from a column. We’ll see how to handle both situations with different techniques.
Removing duplicates from the entire table
Python provides a method .drop_duplicates()
to help us easily remove duplicates! This method contains the following arguments:
subset
: refers to column headers, use this if you want to consider only specific columns for finding duplicates, default is all columnskeep
: which duplicates to keep. ‘first’ (default): keep the first duplicate. ‘last’: keep the last duplicate. False: drop all duplicates.inplace
: whether to overwrite the original dataframe
## 1 Remove duplicates on all columns
>>> df.drop_duplicates()
User Name Country City Gender Age
0 Forrest Gump USA New York M 50
1 Mary Jane CANADA Toronto F 30
2 Harry Porter UK London M 10
3 Jean Grey CHINA Shanghai F 30
4 Jean Grey CANADA Montreal F 30
In the above code, we chose to not pass any argument, which means we check on all columns for duplicates. The only wholly duplicated record was record #5, which got dropped. The first duplicated value remained as a result.
## 2 Remove duplicates on select column
>>> df.drop_duplicates('User Name', keep='last')
User Name Country City Gender Age
0 Forrest Gump USA New York M 50
2 Harry Porter UK London M 10
4 Jean Grey CANADA Montreal F 30
5 Mary Jane CANADA Toronto F 30
In the 2nd round, we passed in a column name “User Name”, also we told pandas to keep the last duplicates. Now pandas will check for duplicates in the “User Name” column, and drop them accordingly. Record #1 and 3 got dropped because they were the first duplicated values from that column.
Now let’s check the original dataframe. It’s not changed! That’s because we left inplace
argument blank, which by default is False
. If we specify inplace=True
, the original df will be replaced with the new dataframe with duplicates removed.
>>> df
User Name Country City Gender Age
0 Forrest Gump USA New York M 50
1 Mary Jane CANADA Toronto F 30
2 Harry Porter UK London M 10
3 Jean Grey CHINA Shanghai F 30
4 Jean Grey CANADA Montreal F 30
5 Mary Jane CANADA Toronto F 30
Finding unique values in a list or data table column
Sometimes we want to find unique values in a list of a dataframe column. In this case, we wouldn’t use the drop_duplicate()
. I mean, we could, but there are better ways to find unique values.
pandas Series vs pandas Dataframe
For Excel users, it is easy to remember their difference. A pandas Dataframe is a table or a sheet. A pandas Series is a column in that table/sheet. In other words, a Dataframe consists of various Series.
pandas Series method .unique()
pandas Series has a .unique(
) method; however, pandas Dataframe doesn’t have this method.
When we call .unique()
on a pandas Series object, it returns a list of unique elements from that column
>>> df['User Name'].unique()
array(['Forrest Gump', 'Mary Jane', 'Harry Porter', 'Jean Grey'],
dtype=object)
When we call .unique()
on a pandas Dataframe, we’ll get an error message because this method doesn’t exist on Dataframe!
>>> df.unique()
Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
df.unique()
File "C:\Program Files\Python38\lib\site-packages\pandas\core\generic.py", line 5274, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'unique'
Python set
Another way to get unique values is by using s set, a data structure in Python. A set is basically a collection of unique items. Since the set contains unique items only, those duplicates will be automatically removed if we pass duplicates into a set.
>>> set(['A','B','C','A','B'])
{'A', 'B', 'C'}
Here’s an example.
>>> df['User Name']
0 Forrest Gump
1 Mary Jane
2 Harry Porter
3 Jean Grey
4 Jean Grey
5 Mary Jane
Name: User Name, dtype: object
>>> set(df['User Name'])
{'Harry Porter', 'Mary Jane', 'Forrest Gump', 'Jean Grey'}
Our column (or pandas Series) contains two duplicates, “Mary Jane” and “Jean Grey”. By converting the column into a set, we are effectively removing the duplicates!