Clean Wrong Data Format Through Pandas
Wrong format data refers to data that is not in the expected format or structure. It can be in the form of missing values, wrong data types, duplicates, or inconsistent data.
Wrong format data can cause problems when trying to analyze or visualize data.
Our goal in this article is to explore how we can apply Pandas to clean wrong format data that has been incorrectly formatted utilizing Python.
Wrong Format Data
Cleaning wrong format data involves identifying the issues and correcting them.
Examining data can be challenging or difficult if cells are in the incorrect format.
There are two ways to resolve it:
- Eliminate the rows.
- Modify all cells in the columns to the identical format.
Convert Into a Correct Format
We have one cell with the wrong format in our Data Frame.
On row 3, there must be an int value that indicates an employee ID in the ‘EMPLOYEE_ID’ column:
EMPLOYEE_ID EMPLOYEE_NAME SALARY($) 0 1 Harry 400.0 1 2 Jonathan 300.0 2 3 Miguel 320.0 3 '4' Erin 250.0 4 5 Emma 280.0 5 6 Lia NaN 6 7 Samantha 300.0 7 7 Samantha 300.0 8 8 69 280.0 9 9 Dustin 370.0 10 10 Steve NaN
Utilizing the ‘EMPLOYEE_ID‘ column, let’s change every cell into an int value.
For this, Pandas has a replace() method:
Change the value ‘4’ to 4 in the ‘EMPLOYEE_ID’ column:
Apply the dtypes attribute to display the data type of each column:
You can observe from the outcome that the Employee ID in row 3 is fixed, but the empty salary in row 5 and 10 is NaN, or unfilled.
Rows with no values can be dealt with simply by eliminating the complete row.
By implementing the dropna() method, we can eliminate the row obtained from the operation above, and we can deal with the NaN value as an empty value.
In the “SALARY($)” column, eliminate rows with empty values:
First, delete the rows with empty values then sort the “EMPLOYEE_NAME” column in descending order:
The given code loads a CSV file named “empWrong_data.csv” using the Pandas library’s read_csv function and creates a DataFrame named mrx_df from the data. The data in the CSV file likely represents employee records, including their ID, name, and salary.
The code then uses the .replace() method to replace any occurrences of the string ‘4’ in the “EMPLOYEE_ID” column of the DataFrame with the integer value 4. This is likely done because the ID value was read in as a string rather than an integer.
Next, the code uses the .dropna() method to remove any rows with missing values in the “SALARY($)” column of the DataFrame. The subset parameter is set to [‘SALARY($)’], which means that the dropna() method will only consider missing values in this column. The inplace parameter is set to True, which means that the changes will be applied directly to the DataFrame mrx_df.
Finally, the code uses the .sort_values() method to sort the DataFrame by the “EMPLOYEE_NAME” column in descending order (ascending=False). This means that the employees with the highest names (in alphabetical order) will appear first in the sorted DataFrame. The sorted DataFrame is printed using the .to_string() method, which returns a string representation of the DataFrame.