Remove Duplicate Data In Pandas

Our goal in this article is to demonstrate to you how to remove duplicates using Pandas, and how to do that efficiently.

Let’s take a moment to understand what duplicate data is before we begin the process of removing it.

In a DataFrame, duplicate data refers to the presence of the same data in more than one row of the data frame.



What are Duplicates in Pandas?

Duplicate values in pandas refer to rows with identical values across all columns.

They can occur due to various reasons, including data entry errors, data merging, or data collection.

Duplicate rows can distort data analysis results, and it is, therefore, important to remove them before analysis.


Finding Duplicates

Duplicate rows are rows that have been registered more than one time.

EMPLOYEE_ID EMPLOYEE_NAME SALARY($)
0 1 Harry 400.0
1 2 Jonathan 300.0
2 3 Miguel 320.0
3 '4' Erin 250.0
4 5 Emma 280.0
5 6 Lia NaN
6 7 Samantha 300.0
7 7 Samantha 300.0
8 8 69 280.0
9 9 Dustin 370.0
10 10 Steve NaN

According to our empWrong_data.csv file data set, rows 6 and 7 are exactly the same.

There is a method known as duplicated() that can be applied to identify identical records.

For every row in the table, the duplicated() method provides a Boolean value:

When a row is identical to another in the empWrong_data.csv file, it displays True, else it shows False:

Example: 

import pandas as pdsmrx_df = pds.read_csv("empWrong_data.csv")print(mrx_df.duplicated())

In the following example if there is any repeated row, then the output value is True, else False:

Example: 

import pandas as pdsmrx_df = pds.read_csv("ambiguous_data.csv")print(mrx_df.duplicated())

Remove Duplicate Data

You can call the drop_duplicates() method if you want to eliminate identical items from a list.

From empWrong_data,csv file eliminate the repeated row:

Example: 

import pandas as pdsmrx_df = pds.read_csv("empWrong_data.csv") mrx_df.drop_duplicates(inplace = True) print(mrx_df)

Delete each of the identical entries:

Example: 

import pandas as pdsmrx_df = pds.read_csv("ambiguous_data.csv") mrx_df.drop_duplicates(inplace = True) print(mrx_df)
Reminder: It is important to note that the (inplace = True) option will ensure that the method does not provide a new DataFrame, but will eliminate all copies of the original DataFrame when the method produces the outcome.

Example Explanation

The above example reads in a CSV file named “ambiguous_data.csv” using the read_csv() function from the Pandas library and stores it in a variable called mrx_df.

The drop_duplicates method is then called on the mrx_df dataframe to remove duplicate rows from the dataframe if any. The inplace=True argument is used to modify the mrx_df dataframe in place, i.e., the original dataframe is modified rather than creating a new dataframe.

Finally, the modified mrx_df dataframe is printed to the console using the print function.


Why remove duplicates?

Removing duplicate records is an essential step in data cleaning and preprocessing.

Duplicate records can lead to incorrect analysis and insights, as they can skew the results of statistical analysis and machine learning models.

Additionally, they can increase the size of the dataset, which can slow down processing and take up unnecessary storage space.

We value your feedback.
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0

Subscribe To Our Newsletter
Enter your email to receive a weekly round-up of our best posts. Learn more!
icon

Leave a Reply

Your email address will not be published. Required fields are marked *