Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a widely used linear dimensionality reduction technique (of type Feature Extraction) used for reducing the dimensionality of datasets containing many correlated variables while preserving most of the variability in the data. Here’s how PCA works: Each of the “new” variables after PCA are all independent of one another. PCA has…

Unsupervised Learning Dimensionality Reduction – Feature Elimination vs Extraction

Feature Elimination and Feature Extraction are two common techniques used in dimensionality reduction, a process aimed at reducing the number of features (or dimensions) in a dataset while preserving the most important information. Both techniques are used to address the curse of dimensionality, improve computational efficiency, and potentially enhance model performance. However, they differ in…

Cophenetic coefficient

he cophenetic coefficient is a measure used to evaluate the quality of a hierarchical clustering solution. It quantifies how faithfully the hierarchical structure (dendrogram) preserves the original pairwise distances or dissimilarities between data points. Here’s how it works: A high cophenetic coefficient suggests that the hierarchical clustering solution accurately captures the underlying structure of the…

Complete linkage hierarchical clustering

omplete linkage hierarchical clustering is another method used in cluster analysis, like single linkage clustering, but with a different approach to determining the distance between clusters. In complete linkage clustering, the distance between two clusters is defined as the maximum distance between any two points in the two clusters. So, the distance between two clusters…

Single linkage hierarchical clustering

ingle linkage hierarchical clustering is a method used in cluster analysis to group similar data points into clusters based on their proximity or similarity. It is a bottom-up approach, starting with each data point as its own cluster and then iteratively merging the closest pairs of clusters until only one cluster remains. In single linkage…

CDF plot of Numerical columns

The provided code below generates a grid of subplots (dynamic rows and 2 columns) and plots cumulative distribution function (CDF) plots for numerical variables in a DataFrame (df).

Finding the optimal number of clusters (k) using Elbow Method

he elbow method is a technique used to find the optimal number of clusters (k) in a dataset for a clustering algorithm, such as k-means. The idea is to run the clustering algorithm for different values of k and plot the sum of squared distances (inertia) between data points and their assigned cluster centroids. The…

What is Silhouette Coefficient

he silhouette coefficient is a measure of how well-separated clusters are in a clustering analysis. It provides a way to assess the quality of clustering by evaluating both the cohesion within clusters and the separation between clusters. The silhouette coefficient ranges from -1 to 1, with higher values indicating better-defined clusters. Here’s how the silhouette…

What is Mahalanobis Distance

he Mahalanobis distance is a measure of the distance between a point and a distribution, taking into account the correlation between variables. It is often used in statistics and machine learning to identify outliers and to assess the dissimilarity between a data point and a distribution. The Mahalanobis distance is defined for a point (x)…

What is Jaccard Distance

accard distance is a measure of dissimilarity between two sets. It is calculated as the complement of the Jaccard similarity coefficient and is particularly useful when dealing with binary data or sets. The Jaccard similarity coefficient measures the proportion of shared elements between two sets, and the Jaccard distance is essentially the complement of this…