Remove duplicates from a data table

Last Updated on July 14, 2022 by Jay

In Excel, we can “easily” remove duplicates from a table by clicking on the “Remove Duplicates” button from the Data tab. Indeed it’s pretty easy. However, when the datasets are too big, or there are formulas in the spreadsheet, this can sometimes be slow. So today, we’ll explore how to use Python to remove duplicates from a data table. It’s super easy, fast, and flexible.

This tutorial is part of the “Integrate Python with Excel” series, you can find the table of content here for easier navigation.

Prepare a dataframe for demo

You can download this sample Excel spreadsheet to follow along. First thing, let’s load the spreadsheet into Python. If you are not familiar with using Python to work with Excel files, check out here for a Python vs. Excel comparison.

import pandas as pd

df = pd.read_excel('users.xlsx')

>>> df
      User Name Country      City Gender  Age
0  Forrest Gump     USA  New York      M   50
1     Mary Jane  CANADA   Toronto      F   30
2  Harry Porter      UK    London      M   10
3     Jean Grey   CHINA  Shanghai      F   30
4     Jean Grey  CANADA  Montreal      F   30
5     Mary Jane  CANADA   Toronto      F   30

A quick observation of the above small table:

Line 1 and 5 contain exact same information
Line 3 and 4 contain the same User Name, but different Country and City

Remove duplicates

Depending on what you are trying to achieve, we can use different approaches to remove duplicates. The two most common scenarios are: removing duplicates from the entire table or finding unique values from a column. We’ll see how to handle both situations with different techniques.

Removing duplicates from the entire table

Python provides a method .drop_duplicates() to help us easily remove duplicates! This method contains the following arguments:

subset: refers to column headers, use this if you want to consider only specific columns for finding duplicates, default is all columns
keep: which duplicates to keep. ‘first’ (default): keep the first duplicate. ‘last’: keep the last duplicate. False: drop all duplicates.
inplace: whether to overwrite the original dataframe

## 1 Remove duplicates on all columns

>>> df.drop_duplicates()

      User Name Country      City Gender  Age
0  Forrest Gump     USA  New York      M   50
1     Mary Jane  CANADA   Toronto      F   30
2  Harry Porter      UK    London      M   10
3     Jean Grey   CHINA  Shanghai      F   30
4     Jean Grey  CANADA  Montreal      F   30

In the above code, we chose to not pass any argument, which means we check on all columns for duplicates. The only wholly duplicated record was record #5, which got dropped. The first duplicated value remained as a result.

## 2 Remove duplicates on select column

>>> df.drop_duplicates('User Name', keep='last')
      User Name Country      City Gender  Age
0  Forrest Gump     USA  New York      M   50
2  Harry Porter      UK    London      M   10
4     Jean Grey  CANADA  Montreal      F   30
5     Mary Jane  CANADA   Toronto      F   30

In the 2nd round, we passed in a column name “User Name”, also we told pandas to keep the last duplicates. Now pandas will check for duplicates in the “User Name” column, and drop them accordingly. Record #1 and 3 got dropped because they were the first duplicated values from that column.

Now let’s check the original dataframe. It’s not changed! That’s because we left inplace argument blank, which by default is False. If we specify inplace=True, the original df will be replaced with the new dataframe with duplicates removed.

>>> df
      User Name Country      City Gender  Age
0  Forrest Gump     USA  New York      M   50
1     Mary Jane  CANADA   Toronto      F   30
2  Harry Porter      UK    London      M   10
3     Jean Grey   CHINA  Shanghai      F   30
4     Jean Grey  CANADA  Montreal      F   30
5     Mary Jane  CANADA   Toronto      F   30

Finding unique values in a list or data table column

Sometimes we want to find unique values in a list of a dataframe column. In this case, we wouldn’t use the drop_duplicate(). I mean, we could, but there are better ways to find unique values.

pandas Series vs pandas Dataframe

For Excel users, it is easy to remember their difference. A pandas Dataframe is a table or a sheet. A pandas Series is a column in that table/sheet. In other words, a Dataframe consists of various Series.

pandas Series method .unique()

pandas Series has a .unique() method; however, pandas Dataframe doesn’t have this method.

When we call .unique() on a pandas Series object, it returns a list of unique elements from that column

>>> df['User Name'].unique()
array(['Forrest Gump', 'Mary Jane', 'Harry Porter', 'Jean Grey'],
      dtype=object)

When we call .unique() on a pandas Dataframe, we’ll get an error message because this method doesn’t exist on Dataframe!

>>> df.unique()
Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    df.unique()
  File "C:\Program Files\Python38\lib\site-packages\pandas\core\generic.py", line 5274, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'unique'

Python set

Another way to get unique values is by using s set, a data structure in Python. A set is basically a collection of unique items. Since the set contains unique items only, those duplicates will be automatically removed if we pass duplicates into a set.

>>> set(['A','B','C','A','B'])
{'A', 'B', 'C'}

Here’s an example.

>>> df['User Name']
0    Forrest Gump
1       Mary Jane
2    Harry Porter
3       Jean Grey
4       Jean Grey
5       Mary Jane
Name: User Name, dtype: object

>>> set(df['User Name'])
{'Harry Porter', 'Mary Jane', 'Forrest Gump', 'Jean Grey'}

Our column (or pandas Series) contains two duplicates, “Mary Jane” and “Jean Grey”. By converting the column into a set, we are effectively removing the duplicates!