Finding missing data in a dataset – Beyond Knowledge Innovation

Do we have a complete dataset in a real-world scenario?

No. We know from history that there is missing information in our data! How can we tell if the data we have available is complete?

We could print the entire dataset, but this could involve human error, and it would become impractical with this many samples. A better option would use pandas to report the columns that have “empty” cells:

# Calculate the number of empty cells in each column
missing_data = dataset.isnull().sum().to_frame()

# Rename column holding the sums
missing_data = missing_data.rename(columns={0:'Empty Cells'})

# Print the results
print(missing_data)

             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                  177
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                687
Embarked               2

It looks like we don’t know the age of 177 passengers, and we don’t know if two of them even embarked. Cabin information for a whopping 687 persons is also missing.

Missing Data Visualizations

Sometimes it can help if we can see if the missing data form some kind of pattern.

We can plot the absence of data in a few ways. One of the most helpful is to literally plot gaps in the dataset:

# import missingno package
import missingno as msno

# Plot a matrix chart, set chart and font size
msno.matrix(dataset, figsize=(10,5), fontsize=11)

The white bars in column Age and Cabin of the graph show missing data.

Missing as Zero

Some datasets may have missing values that appear as zero, but in some other datasets, zero could be a valid data. It can be important to time the review of your raw data before you run the analyses.

This example below, our analyses have considered the values of 0 to not be ‘missing’ but rather to be actual ages:

import numpy as np

# Print out the average age of passengers for whom we have age data
mean_age = np.mean(dataset.Age)
print("The average age on the ship was", mean_age, "years old")

# Now, make another model where missing ages contained a '0'
dataset['Age_2'] = dataset['Age'].fillna(0)
mean_age = np.mean(dataset.Age_2)
print("The average age on the ship was", mean_age, "years old")

The average age on the ship was 29.69911764705882 years old
The average age on the ship was 23.79929292929293 years old

next step

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Missing Data Visualizations

Missing as Zero

You Might Also Like