t-distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE, which stands for t-distributed Stochastic Neighbor Embedding, is a popular dimensionality reduction technique (of type Feature Extraction) used in machine learning and data visualization. It is particularly useful for visualizing high-dimensional data in a lower-dimensional space, typically two or three dimensions, while preserving the local structure of the data as much as possible. The…

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a widely used linear dimensionality reduction technique (of type Feature Extraction) used for reducing the dimensionality of datasets containing many correlated variables while preserving most of the variability in the data. Here’s how PCA works: Each of the “new” variables after PCA are all independent of one another. PCA has…

Unsupervised Learning Dimensionality Reduction – Feature Elimination vs Extraction

Feature Elimination and Feature Extraction are two common techniques used in dimensionality reduction, a process aimed at reducing the number of features (or dimensions) in a dataset while preserving the most important information. Both techniques are used to address the curse of dimensionality, improve computational efficiency, and potentially enhance model performance. However, they differ in…

Cophenetic coefficient

he cophenetic coefficient is a measure used to evaluate the quality of a hierarchical clustering solution. It quantifies how faithfully the hierarchical structure (dendrogram) preserves the original pairwise distances or dissimilarities between data points. Here’s how it works: A high cophenetic coefficient suggests that the hierarchical clustering solution accurately captures the underlying structure of the…

Complete linkage hierarchical clustering

omplete linkage hierarchical clustering is another method used in cluster analysis, like single linkage clustering, but with a different approach to determining the distance between clusters. In complete linkage clustering, the distance between two clusters is defined as the maximum distance between any two points in the two clusters. So, the distance between two clusters…

Single linkage hierarchical clustering

ingle linkage hierarchical clustering is a method used in cluster analysis to group similar data points into clusters based on their proximity or similarity. It is a bottom-up approach, starting with each data point as its own cluster and then iteratively merging the closest pairs of clusters until only one cluster remains. In single linkage…

CDF plot of Numerical columns

The provided code below generates a grid of subplots (dynamic rows and 2 columns) and plots cumulative distribution function (CDF) plots for numerical variables in a DataFrame (df).

Finding the optimal number of clusters (k) using Elbow Method

he elbow method is a technique used to find the optimal number of clusters (k) in a dataset for a clustering algorithm, such as k-means. The idea is to run the clustering algorithm for different values of k and plot the sum of squared distances (inertia) between data points and their assigned cluster centroids. The…

Standardizing features by StandardScaler

n scikit-learn (sklearn), the StandardScaler is a preprocessing technique used to standardize features by removing the mean and scaling them to have a unit variance. Standardization is a common step in many machine learning algorithms, especially those that involve distance-based calculations or optimization processes, as it helps ensure that all features contribute equally to the…

What is Silhouette Coefficient

he silhouette coefficient is a measure of how well-separated clusters are in a clustering analysis. It provides a way to assess the quality of clustering by evaluating both the cohesion within clusters and the separation between clusters. The silhouette coefficient ranges from -1 to 1, with higher values indicating better-defined clusters. Here’s how the silhouette…