Skip to content
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS
Close
Beyond Knowledge Innovation

Beyond Knowledge Innovation

Where Data Unveils Possibilities

  • Home
  • AI & ML Insights
  • Machine Learning
    • Supervised Learning
      • Introduction
      • Regression
      • Classification
    • Unsupervised Learning
      • Introduction
      • Clustering
      • Association
      • Dimensionality Reduction
    • Reinforcement Learning
    • Generative AI
  • Knowledge Base
    • Introduction To Python
    • Introduction To Data
    • Introduction to EDA
  • References
HomeImplementationUnsupervised LearningDimensionality ReductionPrincipal Component Analysis (PCA)
Dimensionality Reduction

Principal Component Analysis (PCA)

March 15, 2024April 2, 2024CEO 185 views

Principal Component Analysis (PCA) is a widely used linear dimensionality reduction technique (of type Feature Extraction) used for reducing the dimensionality of datasets containing many correlated variables while preserving most of the variability in the data.

Here’s how PCA works:

  1. Data Standardization (Scaling): PCA is sensitive to the scale of the variables, so it’s common practice to standardize the data by centering each variable around its mean and scaling it to have unit variance.
  2. Compute Covariance Matrix: PCA analyzes the variance-covariance matrix (or the correlation matrix) of the standardized data. This matrix shows how each variable in the dataset varies with respect to every other variable.
  3. Eigenvalue Decomposition or Singular Value Decomposition (SVD): PCA then calculates the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the directions (or principal components) of maximum variance in the data, while the eigenvalues represent the magnitude of variance along these directions.
  4. Select Principal Components: PCA sorts the eigenvectors based on their corresponding eigenvalues in descending order. The principal components are then selected from these sorted eigenvectors. Typically, one retains only the top k eigenvectors (principal components) that capture the most variance in the data.
  5. Transform Data: Finally, PCA transforms the original data into the new lower-dimensional space spanned by the selected principal components. This transformation is achieved by projecting the data onto the principal components.

Each of the “new” variables after PCA are all independent of one another.

PCA has several applications:

  • Dimensionality Reduction: PCA is primarily used for reducing the dimensionality of high-dimensional datasets, making it easier to visualize and analyze data while retaining most of the variability.
  • Data Visualization: PCA can be used to visualize high-dimensional data in lower-dimensional space (e.g., 2D or 3D) while preserving the structure and relationships between data points.
  • Noise Reduction: PCA can help in removing noise and redundancy from the data by focusing on the principal components that capture the most significant variability.
  • Feature Engineering: PCA can also be used as a feature engineering technique to create new features that capture the most important information in the original dataset.

However, it’s essential to note that PCA is a linear technique and may not capture complex nonlinear relationships in the data. Additionally, interpretability of the principal components might be challenging, especially when dealing with a large number of features.

Python Library:

from sklearn.decomposition import PCA

Example:

import numpy as np
from scipy.stats import zscore
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

X = data.drop('mpg', axis=1)
y = data['mpg']

X_scaled = X.apply(zscore)
covMatrix = np.cov(X_scaled, rowvar=False)
print(covMatrix)

#
pca = PCA(n_components=6)
pca.fit(X_scaled)

# the eigen Values
print(pca.explained_variance_)

# the eigen Vectors
print(pca.components_)

# the percentage of variation explained by each eigen Vector
print(pca.explained_variance_ratio_)

plt.bar(list(range(1,7)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
plt.step(list(range(1,7)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

Three dimensions seems very reasonable. With 3 variables we can explain over 95% of the variation in the original data!

pca3 = PCA(n_components=3)
pca3.fit(XScaled)
print(pca3.components_)
print(pca3.explained_variance_ratio_)
Xpca3 = pca3.transform(XScaled)
[[ 0.45509041  0.46913807  0.46318283  0.44618821 -0.32466834 -0.23188446]
 [ 0.18276349  0.16077095  0.0139189   0.25676595  0.21039209  0.9112425 ]
 [ 0.17104591  0.13443134 -0.12440857  0.27156481  0.86752316 -0.33294164]]
[0.70884563 0.13976166 0.11221664]
sns.pairplot(pd.DataFrame(Xpca3))
regression_model_pca = LinearRegression()
regression_model_pca.fit(Xpca3, y)
regression_model_pca.score(Xpca3, y)
0.7799909620572006

An out of sample (on test data), with the 3 independent variables is likely to do better since that would be less of an over-fit.

covariance matrix, dimensionality, eigen, pca, unsupervised

Post navigation

Previous Post
Previous post: Unsupervised Learning Dimensionality Reduction – Feature Elimination vs Extraction
Next Post
Next post: t-distributed Stochastic Neighbor Embedding (t-SNE)

You Might Also Like

No image
t-distributed Stochastic Neighbor Embedding (t-SNE)
March 17, 2024 Comments Off on t-distributed Stochastic Neighbor Embedding (t-SNE)
No image
Unsupervised Learning Dimensionality Reduction – Feature Elimination…
March 15, 2024 Comments Off on Unsupervised Learning Dimensionality Reduction – Feature Elimination vs Extraction
No image
Complete linkage hierarchical clustering
March 15, 2024 Comments Off on Complete linkage hierarchical clustering
No image
Single linkage hierarchical clustering
March 15, 2024 Comments Off on Single linkage hierarchical clustering
No image
Finding the optimal number of clusters (k)…
March 11, 2024 Comments Off on Finding the optimal number of clusters (k) using Elbow Method
  • Recent
  • Popular
  • Random
  • No image
    7 months ago Low-Rank Factorization
  • No image
    7 months ago Perturbation Test for a Regression Model
  • No image
    7 months ago Calibration Curve for Classification Models
  • No image
    March 15, 20240Single linkage hierarchical clustering
  • No image
    April 17, 20240XGBoost (eXtreme Gradient Boosting)
  • No image
    April 17, 20240Gradient Boosting
  • No image
    April 29, 2024Quantile-based discretization of continuous variables
  • No image
    May 9, 2024TensorFlow
  • No image
    April 23, 2024Oversampling Technique – SMOTE
  • Implementation (55)
    • EDA (4)
    • Neural Networks (10)
    • Supervised Learning (26)
      • Classification (17)
      • Linear Regression (8)
    • Unsupervised Learning (11)
      • Clustering (8)
      • Dimensionality Reduction (3)
  • Knowledge Base (44)
    • Python (27)
    • Statistics (6)
May 2025
M T W T F S S
 1234
567891011
12131415161718
19202122232425
262728293031  
« Oct    

We are on

FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS

Subscribe

© 2025 Beyond Knowledge Innovation
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS