There are many ways to address missing data, each with pros and cons.
Let’s take a look at the less complex options:
Option 1: Delete data with missing rows.
When we have a model that cannot handle missing data, the most prudent thing to do is to remove rows that have information missing.
Let’s remove some data from the Embarked
column, which only has two rows with missing data.
# Create a "clean" dataset, where we cumulatively fix missing values
# Start by removing rows ONLY where "Embarked" has no values
print(f"The original size of our dataset was", dataset.shape)
clean_dataset = dataset.dropna(subset=["Embarked"])
clean_dataset = clean_dataset.reindex()
# How many rows do we have now?
print("The shape for the clean dataset is", clean_dataset.shape)
The original size of our dataset was (891, 13)
The shape for the clean dataset is (889, 13)
We can see that this removed the offending two rows from our new, clean dataset.
Option 2: Replace empty values with the mean or median for that data.
Sometimes, our model cannot handle missing values, and we also cannot afford to remove too much data. In this case, we can sometimes fill in missing data with an average calculated on the basis of the rest of the dataset. Note that imputing data like this can affect model performance in a negative way. Usually, it’s better to simply remove missing data, or to use a model designed to handle missing values.
Below, we impute data for the Age
field. We use the mean Age
from the remaining rows, given that >80% of these aren’t empty:
# Calculate the mean value for the Age column
mean_age = clean_dataset["Age"].mean()
print("The mean age is", mean_age)
# Replace empty values in "Age" with the mean calculated above
clean_dataset["Age"].fillna(mean_age, inplace=True)
# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))
The mean age is 29.64209269662921
Empty Cells
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
Age_2 0
Option 3: Assign a new category to unknown categorical data
In our dataset, the Cabin
field is a categorical field, because the cabins have a finite number of possible options. Unfortunately, many records have no cabin listed.
For this exercise, it makes perfect sense to create an Unknown
category, and assign it to the cases where the cabin is unknown:
# Assign unknown to records where "Cabin" is empty
clean_dataset["Cabin"].fillna("Unknown", inplace=True)
# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))
Empty Cells
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
Age_2 0
That’s it! No more missing data!
We only lost two records (where Embarked
was empty).
That said, we had to make some approximations to fill the missing gaps for the Age
and Cabin
columns, and those will certainly influence the performance of any model we train on this data.
Summary
Missing values can affect the way a Machine Learning model works in a negative way. It’s important to quickly verify the existence of data gaps, and the locations of those gaps.
You can now get a “big picture” of what is missing, and select only those items that you must address, by the use of lists and charts:
- Finding and visualization of missing dataset values, using the
pandas
andmissingno
packages. - Checking whether a dataset uses the value ‘0’ to represent missing values.
- Handling missing data in three ways: removing of rows that contain missing values, replacement of the missing values with the mean or median of that particular feature, and creation of a new
Unknown
category, if dealing with categorical data.