SimpleImputer
is a class in scikit-learn, a popular machine learning library in Python, used for handling missing values in datasets. It provides a simple strategy for imputing missing values, such as filling missing entries with the mean, median, most frequent value, or a constant.
Here’s a basic example of how you might use SimpleImputer
:
from sklearn.impute import SimpleImputer
import numpy as np
# Example dataset with missing values
X = np.array([[1, 2, np.nan],
[3, np.nan, 4],
[np.nan, 5, 6]])
# Create a SimpleImputer instance with strategy 'mean'
imputer = SimpleImputer(strategy='mean')
# Fit the imputer to the data and transform it
X_imputed = imputer.fit_transform(X)
print(X_imputed)
This code will replace missing values in the dataset X
with the mean of the respective columns. You can replace 'mean'
with 'median'
, 'most_frequent'
, or 'constant'
as per your requirement. Additionally, you can specify a constant value if you choose the 'constant'
strategy.
Here is another example:
# impute the missing values with median
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")
# fit the imputer on train data and transform the train data
X_train["income"] = imp_median.fit_transform(X_train[["income"]])