Pandas Fix Wrong Data
In this article, we will explore how to fix wrong data in Pandas and explain how they work.
Fix Wrong Data
Incorrect data does not have to be null cells or incorrect format, it can just be incorrect, such as “$400” in place of “$40.0”.
You can occasionally spot inaccurate data by looking at the data set, because you already have a sense of what the data must look like.
The employee name in row 8 in our data set is in digit form, but in all the other rows it is in letter form.
Although it doesn’t have to be incorrect, the fact that the person didn’t have a name with digits makes it mostly a letter name.
EMPLOYEE_ID EMPLOYEE_NAME SALARY($) 0 1 Harry 400.0 1 2 Jonathan 300.0 2 3 Miguel 320.0 3 '4' Erin 250.0 4 5 Emma 280.0 5 6 Lia NaN 6 7 Samantha 300.0 7 7 Samantha 300.0 8 8 69 280.0 9 9 Dustin 370.0 10 10 Steve NaN
What can we do to resolve incorrect values, such as the one for “EMPLOYEE_NAME” in row 8?
Changing wrong values with anything else is one way to resolve them.
The value must be “Kate” in place of “69” in our example, and we could just add “Kate” in row 8:
In row 8, change “EMPLOYEE_NAME” to “Kate”:
In row 3, modify “EMPLOYEE_ID” to 4:
You might be in a position to update the incorrect data one by one for small data sets, but not for large data sets.
When working with large data sets, you can implement some rules, for example.
Provide some boundaries for legal values, and substitute any values that are outside of these boundaries with the appropriate values.
Assign $250 salary to employees whose salary is less than $320:
Set every employee’s salary to $350:
The other way of working with incorrect data is to eliminate the rows that hold inaccurate data from the database.
Through this approach, you do not have to figure out what you want to substitute them with, and there is a strong chance that you do not require them to execute your analyses in the initial place.
If SALARY($) is greater than 350, eliminate the row:
Remove the rows where the employee salary is equal to $300:
Above example loads a CSV file named empWrong_data.csv using the Pandas library’s read_csv function and creates a DataFrame named mrx_df from the data. The data in the CSV file likely represents employee records, including their salary.
Then iterates over each row in the mrx_df DataFrame using the .index attribute of the DataFrame. For each row, it checks if the value in the SALARY($) column is equal to 300 using the .loc method to access the value. If the condition is true, it drops the row from the DataFrame using the .drop method with the inplace parameter set to True, which means that the changes will be applied directly to the DataFrame mrx_df.
Finally, the code prints the updated DataFrame using the .to_string() method, which returns a string representation of the DataFrame.