Skip to content
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS
Close
Beyond Knowledge Innovation

Beyond Knowledge Innovation

Where Data Unveils Possibilities

  • Home
  • AI & ML Insights
  • Machine Learning
    • Supervised Learning
      • Introduction
      • Regression
      • Classification
    • Unsupervised Learning
      • Introduction
      • Clustering
      • Association
      • Dimensionality Reduction
    • Reinforcement Learning
    • Generative AI
  • Knowledge Base
    • Introduction To Python
    • Introduction To Data
    • Introduction to EDA
  • References
HomeImplementationSupervised LearningUndersampling Technique – Tomek Links
Supervised Learning

Undersampling Technique – Tomek Links

April 23, 2024April 23, 2024CEO 227 views

Tomek Link Undersampling is a technique used to address class imbalance in machine learning datasets. It involves identifying Tomek links, which are pairs of instances from different classes that are nearest neighbors of each other, and removing instances from the majority class that form these links.

The main idea behind Tomek Link Undersampling is to selectively remove instances from the majority class that are close to instances of the minority class, particularly those that are ambiguous or near the decision boundary. By doing so, Tomek Link Undersampling aims to create a more balanced dataset and improve the performance of classifiers, especially in situations where the majority class overwhelms the minority class.

Here’s how Tomek Link Undersampling works:

  1. Identifying Tomek Links: First, pairs of instances are identified as Tomek links if they belong to different classes and are nearest neighbors of each other. Formally, for a pair of instances ( x_i ) and ( x_j ), where ( x_i ) belongs to the majority class and ( x_j ) belongs to the minority class, if there are no other instances closer to ( x_i ) than ( x_j ) (and vice versa), then ( x_i ) and ( x_j ) form a Tomek link.
  2. Removing Instances: Once Tomek links are identified, instances from the majority class that form Tomek links are removed. By selectively removing instances that are close to the minority class, Tomek Link Undersampling aims to reduce the imbalance between classes while preserving the structure of the dataset.
  3. Model Training: After undersampling, the balanced dataset can be used to train machine learning models. Removing instances from the majority class helps prevent the classifier from being biased towards the majority class and encourages it to better learn the patterns in the minority class.

Tomek Link Undersampling is often used in combination with other techniques for handling class imbalance, such as oversampling the minority class using methods like SMOTE (Synthetic Minority Over-sampling Technique) or ADASYN (Adaptive Synthetic Sampling). By integrating multiple strategies, it’s possible to create a more balanced and representative dataset, leading to improved model performance, especially for classifiers trained on imbalanced datasets.

from imblearn.under_sampling import TomekLinks
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate a synthetic imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.1, 0.9], n_informative=3, n_redundant=1,
                           flip_y=0, n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the TomekLinks object
tl = TomekLinks()

# Resample the training data using Tomek Links undersampling
X_resampled, y_resampled = tl.fit_resample(X_train, y_train)

# Print the class distribution before and after undersampling
print("Class distribution before Tomek Links undersampling:", {label: sum(y_train==label) for label in set(y_train)})
print("Class distribution after Tomek Links undersampling:", {label: sum(y_resampled==label) for label in set(y_resampled)})

imbalance, tomek, undersampling

Post navigation

Previous Post
Previous post: Oversampling Technique – SMOTE
Next Post
Next post: Handling missing values with SimpleImputer

You Might Also Like

No image
Oversampling Technique – SMOTE
April 23, 2024 Comments Off on Oversampling Technique – SMOTE
  • Recent
  • Popular
  • Random
  • No image
    8 months ago Low-Rank Factorization
  • No image
    8 months ago Perturbation Test for a Regression Model
  • No image
    8 months ago Calibration Curve for Classification Models
  • No image
    March 15, 20240Single linkage hierarchical clustering
  • No image
    April 17, 20240XGBoost (eXtreme Gradient Boosting)
  • No image
    April 17, 20240Gradient Boosting
  • No image
    January 16, 2024Improve model with hyperparameters
  • No image
    March 17, 2024t-distributed Stochastic Neighbor Embedding (t-SNE)
  • No image
    January 19, 2024What is Pandas?
  • Implementation (55)
    • EDA (4)
    • Neural Networks (10)
    • Supervised Learning (26)
      • Classification (17)
      • Linear Regression (8)
    • Unsupervised Learning (11)
      • Clustering (8)
      • Dimensionality Reduction (3)
  • Knowledge Base (44)
    • Python (27)
    • Statistics (6)
June 2025
M T W T F S S
 1
2345678
9101112131415
16171819202122
23242526272829
30  
« Oct    

We are on

FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS

Subscribe

© 2025 Beyond Knowledge Innovation
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS