Logo

100 Days of ML Code - Day 9, 10 & 11: Exploring the World of Cars with Principal Component Analysis (PCA)

Sunday 26 May 2024

Introduction

This post took me a while to write as I wanted to make sure I understood PCA well enough. For this entire blog, I have tried to apply what I learnt by creating a custom dataset that can be used to demonstrate the power of PCA. I hope you enjoy it!

Imagine you have a dataset containing information about various car models, each with multiple features such as horsepower, fuel efficiency, weight, and price. How can you extract meaningful information from this complex data? The last thing we want to do is have to draw out each relationship between the features individually - as the number of features grows, this would become exponentially more time-consuming.

This is where Principal Component Analysis (PCA) comes to the rescue. In this blog post, we'll how it can help us understand and visualize the relationships among different car models.

Understanding PCA

Before we get our hands dirty with the car dataset, let's take a moment to grasp the concept of PCA. In simple terms, PCA is a technique that transforms a dataset with many variables into a new set of variables called principal components. These principal components are created in such a way that they capture the maximum amount of information from the original dataset while being uncorrelated with each other. Think of it as a tool that takes a complex, high-dimensional dataset and condenses it into a more manageable and interpretable form. PCA helps us identify patterns, reduce the dimensionality of the data, and visualize the relationships among the observations.

Applying PCA to the Car Dataset

from ml_code.utils import load_data

# Load the dataset
data = load_data('car_data.csv')

# Print the first few rows of the dataset
data.head()

+-------------+------------+----------------+--------+--------+
|    model    | horsepower | fuel_efficiency | weight | price  |
+-------------+------------+----------------+--------+--------+
| SportsCar1  |    450     |       12       |  3500  | 150000 |
+-------------+------------+----------------+--------+--------+
| SportsCar2  |    500     |       10       |  3800  | 200000 |
+-------------+------------+----------------+--------+--------+
| SportsCar3  |    400     |       14       |  3200  | 120000 |
+-------------+------------+----------------+--------+--------+
| SportsCar4  |    480     |       11       |  3600  | 180000 |
+-------------+------------+----------------+--------+--------+
| SportsCar5  |    420     |       13       |  3400  | 140000 |
+-------------+------------+----------------+--------+--------+

We have information about various car models, including their horsepower, fuel efficiency, weight, and price. Our goal is to understand how these cars are related to each other based on their features.

To begin, we load the dataset into our Python environment and perform some basic preprocessing steps, such as separating the features from the target variable (car model) and standardizing the features to ensure they have zero mean and unit variance.

# Separate the features and the target variable
X = data.drop('model', axis=1)
y = data['model']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Next, we apply PCA to our preprocessed dataset. In this case, we specify the number of principal components we want to keep. Since we have four features, we'll start by retaining all four components to capture the maximum amount of information. (Note: Usually in PCA, you would start with the number of components equal to the minimum of the number of observations - in this case the number of car models - or the number of features)

# Apply PCA
pca = PCA(n_components=4)
X_pca = pca.fit_transform(X_scaled)

Interpreting the Results

After running PCA, we obtain the transformed dataset in terms of the principal components. Each principal component is a linear combination of the original features, and they are ranked in order of the amount of variance they explain. To understand the importance of each principal component, we can examine the explained variance ratio.

print(pca.explained_variance_ratio_)

[0.81957028 0.17267014 0.00658543 0.00117415]

In our car dataset, we find the following:

This tells us that the first two principal components (PC1 and PC2) together account for a whopping 94.3% of the variance in the dataset. In other words, we can focus on these two components to gain a good understanding of the relationships among the car models.

PCA Scree Plot
PCA Scree Plot

Visualizing the Results

To visualize the results, we create a scatter plot of the car models using the first two principal components (PC1 and PC2). Each point in the plot represents a car model, and we color-code them based on their category (e.g., sports cars, economy cars, family cars, luxury cars). The scatter plots reveals some fascinating insights:

PCA Scatter Plot
PCA Scatter Plot
PCA Clusters
PCA Clusters

Conclusion

PCA has proven to be a powerful tool for exploring and understanding our car dataset. By reducing the dimensionality and visualizing the data in terms of the principal components, we gained valuable insights into the relationships among different car models.

We discovered distinct clusters representing various car categories, each with its own characteristics. The scatter plot allowed us to identify the key features that differentiate the car segments, such as performance, price, and fuel efficiency. Moreover, PCA helped us simplify the complex dataset and focus on the most important information. By retaining just the first two principal components, we were able to capture a significant portion of the variance and create an informative visualization.

In conclusion, PCA is a valuable technique for anyone dealing with high-dimensional datasets. So, the next time you encounter a dataset with numerous variables, remember the power of PCA! Apply it to your data, explore the relationships, and gain insights that can drive your analysis forward. Happy analyzing!