Pandas Clean Empty Cells

This article will explain how to utilize Pandas clean through the example of an empty cell. Dealing with missing data is essential in data cleaning as empty or null values can affect the quality and accuracy of analysis.



Identifying Empty Cells

Before we can clean empty cells, we need to identify them in our data.

Pandas provides several functions to check for missing or null values.

Here are some common functions used for identifying empty cells in Pandas:

FunctionsOverview
isnull()This function returns a boolean DataFrame showing which cells contain null values.
notnull()This function returns the inverse of isnull(), showing which cells do not contain null values.
isna()This function is an alias for isnull().
notna()This function is an alias for notnull().

Below an example of using the isnull() function to identify empty cells in a Pandas DataFrame:

Example: 

import pandas as pddata = {'Name': ['John', 'Jane', 'Mike', 'Lisa', 'Tom'], 'Age': [25, 30, None, 40, 50], 'Salary': [50000, None, 70000, 80000, None]}df = pd.DataFrame(data)print(df.isnull())
Note: Above example will output a boolean DataFrame with True values in the cells that contain null values.

Pandas Cleaning Empty Cells

Now that we have identified the empty cells in our data, we can proceed to clean them up.

Here are some techniques for cleaning empty cells in Pandas:

Removing Rows with Empty Cells

A possible way of dealing with empty cells, you can eliminate rows that hold empty cells from your table.

Due to the large size of data sets, it is usually okay, because eliminating a few rows will not have a great effect on the output, since the data sets can be quite large.

For our cleaning examples, we will be working with a file known as ‘ambiguous_data.csv’ which is a CSV file.

Download ambiguous_data.csv. or
Open ambiguous_data.csv

Eliminate the empty cells from the ambiguous_data.csv data set:

Example: 

import pandas as pds mrx_df = pds.read_csv('ambiguous_data.csv') ample_df = mrx_df.dropna() print(ample_df.to_string())
Reminder: It is by default that the dropna() method generates a new DataFrame, and will not make any modifications to the original DataFrame.

Utilize the inplace = True argument to modify the original DataFrame:

If there are any rows with empty values, eliminate them all:

Example: 

import pandas as pds mrx_df = pds.read_csv('ambiguous_data.csv') mrx_df.dropna(inplace = True) print(mrx_df.to_string())
Reminder: This means that dropna(inplace = True) is no longer going to produce a new DataFrame, but rather will eliminate any rows in the original DataFrame that hold empty values.

Filling Empty Cells with a Specific Value

You can also deal with blank cells by adding a new value in place of leaving them empty.

Therefore, you will not need to eliminate complete rows just because there is an empty cell in one row.

By utilizing the fillna() method, we are capable to substitute blank cells with a value, such as:

In the following example, we will substitute empty values with “Mobile Application“:

Example: 

import pandas as pds mrx_df = pds.read_csv('ambiguous_data.csv') mrx_df.fillna("Mobile Application", inplace = True) print(mrx_df.to_string())

Replace Only Specified Columns

It is important to note that the example above substitutes all empty cells throughout the whole Data Frame.

It is possible to only substitute empty values for one column by defining the column name of the DataFrame as follows:

In the “USE” column, substitute empty values with “Mobile Application“:

Example: 

import pandas as pds mrx_df = pds.read_csv('ambiguous_data.csv') mrx_df["USE"].fillna("Mobile Application", inplace = True) print(mrx_df)

Replace Using Mean, Median, or Mode

To replace null cells in a column, one of the most popular methods is to compute the mean, median, or mode value of the column.

For the purpose of computing the values for a particular column, Pandas utilizes the mean(), median(), and mode() methods:

In the following examples, we will utilize a CSV file called ’empWrong_data.csv’.

Download empWrong_data.csv. or

Open empWrong_data.csv

In the Employee Wrong data set, if any values are null, substitute them with the MEAN:

Example: 

import pandas as pdsmrx_df = pds.read_csv('empWrong_data.csv')ample = mrx_df["SALARY($)"].mean()mrx_df["SALARY($)"].fillna(ample, inplace = True)print(mrx_df.to_string())#In row 5 and 10, you can observe that the blank values from "SALARY($)" have been filled with the mean value of 311.111111

Mean = This is the average value (the total of all values divided by the number of values).

Substitute the MEDIAN for any null values:

Example: 

import pandas as pdsmrx_df = pds.read_csv('empWrong_data.csv')ample = mrx_df["SALARY($)"].median()mrx_df["SALARY($)"].fillna(ample, inplace = True)print(mrx_df.to_string())#In row 5 and 10, you can observe that the blank values from "SALARY($)" have been filled with the median value of 300.0

Median = When all values have been arranged ascendingly, the central value is displayed.

Applying the MODE as a substitution for null values, compute the following:

Example: 

import pandas as pdsmrx_df = pds.read_csv('empWrong_data.csv')ample = mrx_df["SALARY($)"].mode()[0]mrx_df["SALARY($)"].fillna(ample, inplace = True)print(mrx_df.to_string())#In row 5 and 10, you can observe that the blank values from "SALARY($)" have been filled with the mode value of 300.0

Mode = It is the value that occurs the most often in the data set.

Example Explanation

The given code loads a CSV file named “empWrong_data.csv” using the Pandas library’s read_csv function and creates a DataFrame named mrx_df from the data. The data in the CSV file likely represents employee records, including their salary.

The code then calculates the mode of the “SALARY($)” column of the DataFrame mrx_df using the mode() function and stores it in a variable named ample. The mode is the most frequently occurring value in the column.

Next, the code uses the fillna() method to replace any missing values in the “SALARY($)” column of the DataFrame with the value stored in the ample variable. The inplace parameter is set to True, which means that the changes will be applied directly to the DataFrame mrx_df.

We value your feedback.
+1
1
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0

Subscribe To Our Newsletter
Enter your email to receive a weekly round-up of our best posts. Learn more!
icon

Leave a Reply

Your email address will not be published. Required fields are marked *