Handling missing data in a dataset – Beyond Knowledge Innovation

There are many ways to address missing data, each with pros and cons.

Let’s take a look at the less complex options:

Option 1: Delete data with missing rows.

When we have a model that cannot handle missing data, the most prudent thing to do is to remove rows that have information missing.

Let’s remove some data from the Embarked column, which only has two rows with missing data.

# Create a "clean" dataset, where we cumulatively fix missing values
# Start by removing rows ONLY where "Embarked" has no values
print(f"The original size of our dataset was", dataset.shape)
clean_dataset = dataset.dropna(subset=["Embarked"])
clean_dataset = clean_dataset.reindex()

# How many rows do we have now?
print("The shape for the clean dataset is", clean_dataset.shape)

The original size of our dataset was (891, 13)
The shape for the clean dataset is (889, 13)

We can see that this removed the offending two rows from our new, clean dataset.

Option 2: Replace empty values with the mean or median for that data.

Sometimes, our model cannot handle missing values, and we also cannot afford to remove too much data. In this case, we can sometimes fill in missing data with an average calculated on the basis of the rest of the dataset. Note that imputing data like this can affect model performance in a negative way. Usually, it’s better to simply remove missing data, or to use a model designed to handle missing values.

Below, we impute data for the Age field. We use the mean Age from the remaining rows, given that >80% of these aren’t empty:

# Calculate the mean value for the Age column
mean_age = clean_dataset["Age"].mean()

print("The mean age is", mean_age)

# Replace empty values in "Age" with the mean calculated above
clean_dataset["Age"].fillna(mean_age, inplace=True)

# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))

The mean age is 29.64209269662921
             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                687
Embarked               0
Age_2                  0

Option 3: Assign a new category to unknown categorical data

In our dataset, the Cabin field is a categorical field, because the cabins have a finite number of possible options. Unfortunately, many records have no cabin listed.

For this exercise, it makes perfect sense to create an Unknown category, and assign it to the cases where the cabin is unknown:

# Assign unknown to records where "Cabin" is empty
clean_dataset["Cabin"].fillna("Unknown", inplace=True)

# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))

             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                  0
Embarked               0
Age_2                  0

That’s it! No more missing data!

We only lost two records (where Embarked was empty).

That said, we had to make some approximations to fill the missing gaps for the Age and Cabin columns, and those will certainly influence the performance of any model we train on this data.

Summary

Missing values can affect the way a Machine Learning model works in a negative way. It’s important to quickly verify the existence of data gaps, and the locations of those gaps.

You can now get a “big picture” of what is missing, and select only those items that you must address, by the use of lists and charts:

Finding and visualization of missing dataset values, using the pandas and missingno packages.
Checking whether a dataset uses the value ‘0’ to represent missing values.
Handling missing data in three ways: removing of rows that contain missing values, replacement of the missing values with the mean or median of that particular feature, and creation of a new Unknown category, if dealing with categorical data.

next step

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Option 1: Delete data with missing rows.

Option 2: Replace empty values with the mean or median for that data.

Option 3: Assign a new category to unknown categorical data

Summary

You Might Also Like