Pandas Clean Empty Cells
This article will explain how to utilize Pandas clean through the example of an empty cell. Dealing with missing data is essential in data cleaning as empty or null values can affect the quality and accuracy of analysis.
Identifying Empty Cells
Before we can clean empty cells, we need to identify them in our data.
Pandas provides several functions to check for missing or null values.
Here are some common functions used for identifying empty cells in Pandas:
|This function returns a boolean DataFrame showing which cells contain null values.
|This function returns the inverse of isnull(), showing which cells do not contain null values.
|This function is an alias for isnull().
|This function is an alias for notnull().
Below an example of using the isnull() function to identify empty cells in a Pandas DataFrame:
Pandas Cleaning Empty Cells
Now that we have identified the empty cells in our data, we can proceed to clean them up.
Here are some techniques for cleaning empty cells in Pandas:
Removing Rows with Empty Cells
A possible way of dealing with empty cells, you can eliminate rows that hold empty cells from your table.
Due to the large size of data sets, it is usually okay, because eliminating a few rows will not have a great effect on the output, since the data sets can be quite large.
For our cleaning examples, we will be working with a file known as ‘ambiguous_data.csv’ which is a CSV file.
Eliminate the empty cells from the ambiguous_data.csv data set:
Utilize the inplace = True argument to modify the original DataFrame:
If there are any rows with empty values, eliminate them all:
Filling Empty Cells with a Specific Value
You can also deal with blank cells by adding a new value in place of leaving them empty.
Therefore, you will not need to eliminate complete rows just because there is an empty cell in one row.
By utilizing the fillna() method, we are capable to substitute blank cells with a value, such as:
In the following example, we will substitute empty values with “Mobile Application“:
Replace Only Specified Columns
It is important to note that the example above substitutes all empty cells throughout the whole Data Frame.
It is possible to only substitute empty values for one column by defining the column name of the DataFrame as follows:
In the “USE” column, substitute empty values with “Mobile Application“:
Replace Using Mean, Median, or Mode
To replace null cells in a column, one of the most popular methods is to compute the mean, median, or mode value of the column.
For the purpose of computing the values for a particular column, Pandas utilizes the mean(), median(), and mode() methods:
In the following examples, we will utilize a CSV file called ’empWrong_data.csv’.
In the Employee Wrong data set, if any values are null, substitute them with the MEAN:
Mean = This is the average value (the total of all values divided by the number of values).
Substitute the MEDIAN for any null values:
Median = When all values have been arranged ascendingly, the central value is displayed.
Applying the MODE as a substitution for null values, compute the following:
Mode = It is the value that occurs the most often in the data set.
The given code loads a CSV file named “empWrong_data.csv” using the Pandas library’s read_csv function and creates a DataFrame named mrx_df from the data. The data in the CSV file likely represents employee records, including their salary.
The code then calculates the mode of the “SALARY($)” column of the DataFrame mrx_df using the mode() function and stores it in a variable named ample. The mode is the most frequently occurring value in the column.
Next, the code uses the fillna() method to replace any missing values in the “SALARY($)” column of the DataFrame with the value stored in the ample variable. The inplace parameter is set to True, which means that the changes will be applied directly to the DataFrame mrx_df.