Tomek Link Undersampling is a technique used to address class imbalance in machine learning datasets. It involves identifying Tomek links, which are pairs of instances from different classes that are nearest neighbors of each other, and removing instances from the majority class that form these links.
The main idea behind Tomek Link Undersampling is to selectively remove instances from the majority class that are close to instances of the minority class, particularly those that are ambiguous or near the decision boundary. By doing so, Tomek Link Undersampling aims to create a more balanced dataset and improve the performance of classifiers, especially in situations where the majority class overwhelms the minority class.
Here’s how Tomek Link Undersampling works:
- Identifying Tomek Links: First, pairs of instances are identified as Tomek links if they belong to different classes and are nearest neighbors of each other. Formally, for a pair of instances ( x_i ) and ( x_j ), where ( x_i ) belongs to the majority class and ( x_j ) belongs to the minority class, if there are no other instances closer to ( x_i ) than ( x_j ) (and vice versa), then ( x_i ) and ( x_j ) form a Tomek link.
- Removing Instances: Once Tomek links are identified, instances from the majority class that form Tomek links are removed. By selectively removing instances that are close to the minority class, Tomek Link Undersampling aims to reduce the imbalance between classes while preserving the structure of the dataset.
- Model Training: After undersampling, the balanced dataset can be used to train machine learning models. Removing instances from the majority class helps prevent the classifier from being biased towards the majority class and encourages it to better learn the patterns in the minority class.
Tomek Link Undersampling is often used in combination with other techniques for handling class imbalance, such as oversampling the minority class using methods like SMOTE (Synthetic Minority Over-sampling Technique) or ADASYN (Adaptive Synthetic Sampling). By integrating multiple strategies, it’s possible to create a more balanced and representative dataset, leading to improved model performance, especially for classifiers trained on imbalanced datasets.
from imblearn.under_sampling import TomekLinks
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate a synthetic imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1,
flip_y=0, n_features=20, n_clusters_per_class=1,
n_samples=1000, random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Instantiate the TomekLinks object
tl = TomekLinks()
# Resample the training data using Tomek Links undersampling
X_resampled, y_resampled = tl.fit_resample(X_train, y_train)
# Print the class distribution before and after undersampling
print("Class distribution before Tomek Links undersampling:", {label: sum(y_train==label) for label in set(y_train)})
print("Class distribution after Tomek Links undersampling:", {label: sum(y_resampled==label) for label in set(y_resampled)})