Remove Duplicate Data In Pandas
Our goal in this article is to demonstrate to you how to remove duplicates using Pandas, and how to do that efficiently.
Let’s take a moment to understand what duplicate data is before we begin the process of removing it.
In a DataFrame, duplicate data refers to the presence of the same data in more than one row of the data frame.
What are Duplicates in Pandas?
Duplicate values in pandas refer to rows with identical values across all columns.
They can occur due to various reasons, including data entry errors, data merging, or data collection.
Duplicate rows can distort data analysis results, and it is, therefore, important to remove them before analysis.
Duplicate rows are rows that have been registered more than one time.
EMPLOYEE_ID EMPLOYEE_NAME SALARY($) 0 1 Harry 400.0 1 2 Jonathan 300.0 2 3 Miguel 320.0 3 '4' Erin 250.0 4 5 Emma 280.0 5 6 Lia NaN 6 7 Samantha 300.0 7 7 Samantha 300.0 8 8 69 280.0 9 9 Dustin 370.0 10 10 Steve NaN
According to our empWrong_data.csv file data set, rows 6 and 7 are exactly the same.
There is a method known as duplicated() that can be applied to identify identical records.
For every row in the table, the duplicated() method provides a Boolean value:
When a row is identical to another in the empWrong_data.csv file, it displays True, else it shows False:
In the following example if there is any repeated row, then the output value is True, else False:
Remove Duplicate Data
You can call the drop_duplicates() method if you want to eliminate identical items from a list.
From empWrong_data,csv file eliminate the repeated row:
Delete each of the identical entries:
The above example reads in a CSV file named “ambiguous_data.csv” using the read_csv() function from the Pandas library and stores it in a variable called mrx_df.
The drop_duplicates method is then called on the mrx_df dataframe to remove duplicate rows from the dataframe if any. The inplace=True argument is used to modify the mrx_df dataframe in place, i.e., the original dataframe is modified rather than creating a new dataframe.
Finally, the modified mrx_df dataframe is printed to the console using the print function.
Why remove duplicates?
Removing duplicate records is an essential step in data cleaning and preprocessing.
Duplicate records can lead to incorrect analysis and insights, as they can skew the results of statistical analysis and machine learning models.
Additionally, they can increase the size of the dataset, which can slow down processing and take up unnecessary storage space.