What are the common Distance Measures in Clustering

Distance measures (or similarity measures, depending on the context) play a crucial role in clustering algorithms, as they determine the similarity or dissimilarity between data points. Here are some common distance measures used in clustering:

Euclidean Distance

Measures the straight-line distance between two points in Euclidean space. It is widely used and is the default distance metric for many clustering algorithms, including k-means:

\( \sqrt{\sum_{i=1}^{n}(x_i – y_i)^2} \)

Manhattan Distance (L1 Norm)

Also known as the L1 norm or taxicab distance, it measures the sum of absolute differences between corresponding coordinates:

\(\sum_{i=1}^{n} |x_i – y_i|\)

Minkowski Distance

Generalization of both Euclidean and Manhattan distances.\:

\( \left(\sum_{i=1}^{n} |x_i – y_i|^p\right)^{1/p} \)

When ( p = 2 ), it is equivalent to Euclidean distance. When ( p = 1 ), it is equivalent to Manhattan distance.

Cosine Similarity

Measures the cosine of the angle between two vectors, providing a measure of similarity rather than distance. It is often used in text mining and recommendation systems:

\(\frac{\sum_{i=1}^{n} x_i \cdot y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \cdot \sqrt{\sum_{i=1}^{n} y_i^2}} \)

Jaccard Similarity (for Binary Data)

Used for binary data (presence or absence of features) and is commonly employed in text clustering and document similarity analysis:

\(\frac{\text{Number of common elements}}{\text{Number of unique elements in both sets}} \)

Hamming Distance (for Binary Data)

Measures the number of positions at which two binary strings of equal length differ.
For example, it is used in genetic studies for comparing DNA sequences.

Correlation Distance

Measures the correlation between two vectors:

\( 1 – \text{Correlation coefficient}\)

Suitable for datasets with varying scales, as it normalizes the distance by the standard deviations of the data.

The choice of distance measure depends on the nature of your data and the specific requirements of your clustering task. It’s important to consider factors such as the scale of features, sparsity, and the underlying distribution of the data. Experimenting with different distance measures can help identify the one that best suits your clustering problem.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31