- Mode Imputation:
- Replace missing categorical values with the mode (most frequent category) of the respective column.
- Use
df['column'].fillna(df['column'].mode()[0], inplace=True)
.
- Constant Imputation:
- Replace missing categorical values with a predefined constant category.
- Use
df['column'].fillna('Unknown', inplace=True)
or any other relevant constant.
- Backfill (or Forward Fill):
- Fill missing categorical values with the nearest non-null value in the same column.
- Use
df['column'].fillna(method='bfill', inplace=True)
for backfill ordf['column'].fillna(method='ffill', inplace=True)
for forward fill.
- Random Sample Imputation:
- Replace missing values with a randomly sampled value from the existing non-null values in the column.
- Use
df['column'].fillna(df['column'].sample(), inplace=True)
.
- Imputation Based on Other Features:
- Use information from other features to impute missing categorical values. For example, if a similar observation has a known category, use that category for imputation.
- Use
df['column'].fillna(df.groupby('another_column')['column'].transform('mode'), inplace=True)
.
- Predictive Imputation:
- Train a machine learning model to predict missing categorical values based on other features.
- This is a more advanced approach and may involve using techniques like decision trees, random forests, or other models for imputation.
The choice of imputation method depends on the nature of the data, the underlying patterns, and the goals of the analysis. Always consider the context of the data and the potential impact of imputation on the analysis results.