Skip to content
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS
Close
Beyond Knowledge Innovation

Beyond Knowledge Innovation

Where Data Unveils Possibilities

  • Home
  • AI & ML Insights
  • Machine Learning
    • Supervised Learning
      • Introduction
      • Regression
      • Classification
    • Unsupervised Learning
      • Introduction
      • Clustering
      • Association
      • Dimensionality Reduction
    • Reinforcement Learning
    • Generative AI
  • Knowledge Base
    • Introduction To Python
    • Introduction To Data
    • Introduction to EDA
  • References
HomeImplementationHandling missing data in a dataset
Implementation

Handling missing data in a dataset

January 16, 2024January 16, 2024CEO 252 views

There are many ways to address missing data, each with pros and cons.

Let’s take a look at the less complex options:

Option 1: Delete data with missing rows.

When we have a model that cannot handle missing data, the most prudent thing to do is to remove rows that have information missing.

Let’s remove some data from the Embarked column, which only has two rows with missing data.

# Create a "clean" dataset, where we cumulatively fix missing values
# Start by removing rows ONLY where "Embarked" has no values
print(f"The original size of our dataset was", dataset.shape)
clean_dataset = dataset.dropna(subset=["Embarked"])
clean_dataset = clean_dataset.reindex()

# How many rows do we have now?
print("The shape for the clean dataset is", clean_dataset.shape)
The original size of our dataset was (891, 13)
The shape for the clean dataset is (889, 13)

We can see that this removed the offending two rows from our new, clean dataset.

Option 2: Replace empty values with the mean or median for that data.

Sometimes, our model cannot handle missing values, and we also cannot afford to remove too much data. In this case, we can sometimes fill in missing data with an average calculated on the basis of the rest of the dataset. Note that imputing data like this can affect model performance in a negative way. Usually, it’s better to simply remove missing data, or to use a model designed to handle missing values.

Below, we impute data for the Age field. We use the mean Age from the remaining rows, given that >80% of these aren’t empty:

# Calculate the mean value for the Age column
mean_age = clean_dataset["Age"].mean()

print("The mean age is", mean_age)

# Replace empty values in "Age" with the mean calculated above
clean_dataset["Age"].fillna(mean_age, inplace=True)

# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))
The mean age is 29.64209269662921
             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                687
Embarked               0
Age_2                  0

Option 3: Assign a new category to unknown categorical data

In our dataset, the Cabin field is a categorical field, because the cabins have a finite number of possible options. Unfortunately, many records have no cabin listed.

For this exercise, it makes perfect sense to create an Unknown category, and assign it to the cases where the cabin is unknown:

# Assign unknown to records where "Cabin" is empty
clean_dataset["Cabin"].fillna("Unknown", inplace=True)

# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))
             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                  0
Embarked               0
Age_2                  0

That’s it! No more missing data!

We only lost two records (where Embarked was empty).

That said, we had to make some approximations to fill the missing gaps for the Age and Cabin columns, and those will certainly influence the performance of any model we train on this data.

Summary

Missing values can affect the way a Machine Learning model works in a negative way. It’s important to quickly verify the existence of data gaps, and the locations of those gaps.

You can now get a “big picture” of what is missing, and select only those items that you must address, by the use of lists and charts:

  • Finding and visualization of missing dataset values, using the pandas and missingno packages.
  • Checking whether a dataset uses the value ‘0’ to represent missing values.
  • Handling missing data in three ways: removing of rows that contain missing values, replacement of the missing values with the mean or median of that particular feature, and creation of a new Unknown category, if dealing with categorical data.
previous
next step
data, missing data, preprocessing

Post navigation

Previous Post
Previous post: Finding missing data in a dataset
Next Post
Next post: Feature Engineering: Scaling, Normalization, and Standardization

You Might Also Like

No image
Handling missing values with SimpleImputer
April 24, 2024 Comments Off on Handling missing values with SimpleImputer
No image
Standardizing features by StandardScaler
March 11, 2024 Comments Off on Standardizing features by StandardScaler
No image
One-Hot Encoding
February 29, 2024 Comments Off on One-Hot Encoding
No image
How-to: give a specific sorting order to…
February 7, 2024 Comments Off on How-to: give a specific sorting order to categorical values
No image
How-to: cap/clip outliers in a column
February 6, 2024 Comments Off on How-to: cap/clip outliers in a column
  • Recent
  • Popular
  • Random
  • No image
    7 months ago Low-Rank Factorization
  • No image
    7 months ago Perturbation Test for a Regression Model
  • No image
    7 months ago Calibration Curve for Classification Models
  • No image
    March 15, 20240Single linkage hierarchical clustering
  • No image
    April 17, 20240XGBoost (eXtreme Gradient Boosting)
  • No image
    April 17, 20240Gradient Boosting
  • No image
    March 11, 2024What are the common Distance Measures in…
  • No image
    January 16, 2024Process of Fitting the models in machine…
  • No image
    October 21, 2024Calibration Curve for Classification Models
  • Implementation (55)
    • EDA (4)
    • Neural Networks (10)
    • Supervised Learning (26)
      • Classification (17)
      • Linear Regression (8)
    • Unsupervised Learning (11)
      • Clustering (8)
      • Dimensionality Reduction (3)
  • Knowledge Base (44)
    • Python (27)
    • Statistics (6)
May 2025
M T W T F S S
 1234
567891011
12131415161718
19202122232425
262728293031  
« Oct    

We are on

FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS

Subscribe

© 2025 Beyond Knowledge Innovation
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS