Skip to content
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS
Close
Beyond Knowledge Innovation

Beyond Knowledge Innovation

Where Data Unveils Possibilities

  • Home
  • AI & ML Insights
  • Machine Learning
    • Supervised Learning
      • Introduction
      • Regression
      • Classification
    • Unsupervised Learning
      • Introduction
      • Clustering
      • Association
      • Dimensionality Reduction
    • Reinforcement Learning
    • Generative AI
  • Knowledge Base
    • Introduction To Python
    • Introduction To Data
    • Introduction to EDA
  • References
HomeImplementationUnsupervised LearningClusteringWhat are the common Distance Measures in Clustering
Clustering

What are the common Distance Measures in Clustering

March 11, 2024March 11, 2024CEO 162 views
Distance measures (or similarity measures, depending on the context) play a crucial role in clustering algorithms, as they determine the similarity or dissimilarity between data points. Here are some common distance measures used in clustering:

Euclidean Distance

Measures the straight-line distance between two points in Euclidean space. It is widely used and is the default distance metric for many clustering algorithms, including k-means:

\( \sqrt{\sum_{i=1}^{n}(x_i – y_i)^2} \)

Manhattan Distance (L1 Norm)

Also known as the L1 norm or taxicab distance, it measures the sum of absolute differences between corresponding coordinates:

\(\sum_{i=1}^{n} |x_i – y_i|\)

Minkowski Distance

Generalization of both Euclidean and Manhattan distances.\:

\( \left(\sum_{i=1}^{n} |x_i – y_i|^p\right)^{1/p} \)

When ( p = 2 ), it is equivalent to Euclidean distance. When ( p = 1 ), it is equivalent to Manhattan distance.

Cosine Similarity

Measures the cosine of the angle between two vectors, providing a measure of similarity rather than distance. It is often used in text mining and recommendation systems:

\(\frac{\sum_{i=1}^{n} x_i \cdot y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \cdot \sqrt{\sum_{i=1}^{n} y_i^2}} \)

Jaccard Similarity (for Binary Data)

Used for binary data (presence or absence of features) and is commonly employed in text clustering and document similarity analysis:

\(\frac{\text{Number of common elements}}{\text{Number of unique elements in both sets}} \)

Hamming Distance (for Binary Data)

  • Measures the number of positions at which two binary strings of equal length differ.
  • For example, it is used in genetic studies for comparing DNA sequences.

Correlation Distance

Measures the correlation between two vectors:

\( 1 – \text{Correlation coefficient}\)

Suitable for datasets with varying scales, as it normalizes the distance by the standard deviations of the data.

The choice of distance measure depends on the nature of your data and the specific requirements of your clustering task. It’s important to consider factors such as the scale of features, sparsity, and the underlying distribution of the data. Experimenting with different distance measures can help identify the one that best suits your clustering problem.

clustering, cosine, distance, euclidean, hamming, jaccard, manahattan, measure, minkowski

Post navigation

Previous Post
Previous post: Choosing the right estimator
Next Post
Next post: What is Jaccard Distance

You Might Also Like

No image
What is Silhouette Coefficient
March 11, 2024 Comments Off on What is Silhouette Coefficient
No image
What is Mahalanobis Distance
March 11, 2024 Comments Off on What is Mahalanobis Distance
No image
What is Jaccard Distance
March 11, 2024 Comments Off on What is Jaccard Distance
  • Recent
  • Popular
  • Random
  • No image
    7 months ago Low-Rank Factorization
  • No image
    7 months ago Perturbation Test for a Regression Model
  • No image
    7 months ago Calibration Curve for Classification Models
  • No image
    March 15, 20240Single linkage hierarchical clustering
  • No image
    April 17, 20240XGBoost (eXtreme Gradient Boosting)
  • No image
    April 17, 20240Gradient Boosting
  • No image
    March 8, 2024Post-pruning Decision Tree with Cost Complexity Parameter…
  • No image
    February 22, 2024What is Uniform Distribution?
  • No image
    March 1, 2024Difference between R-square and Adjusted R-square
  • Implementation (55)
    • EDA (4)
    • Neural Networks (10)
    • Supervised Learning (26)
      • Classification (17)
      • Linear Regression (8)
    • Unsupervised Learning (11)
      • Clustering (8)
      • Dimensionality Reduction (3)
  • Knowledge Base (44)
    • Python (27)
    • Statistics (6)
May 2025
M T W T F S S
 1234
567891011
12131415161718
19202122232425
262728293031  
« Oct    

We are on

FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS

Subscribe

© 2025 Beyond Knowledge Innovation
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS