100 Days of ML Code - Day 1: Data Preprocessing Fundamentals

Sunday, 12 May 2024

Introduction

I've decided to embark on a journey of 100 Days of ML Code to level up my machine learning skills. While I've worked with neural networks before to solve specific problems, I realized my understanding of the underlying principles was a bit lacking. Over these 100 days, I aim to dive deeper and really strengthen my ML foundations.

To kick things off, I'm starting with the critical first step of any ML project - data preprocessing. It's not the most glamorous part, but getting your data in the right shape is absolutely essential. In my Day 1 notebook, I walk through the key data preprocessing steps with a basic dataset.

The Dataset

There's a little snippet of the dataset below. It's a simple CSV file with columns for 'Country', 'Age', 'Salary', and 'Purchased'. The goal is to predict whether a user will purchase a product based on their country, age, and salary.

+----------+-----+----------+------------+
| Country  | Age | Salary   | Purchased  |
+----------+-----+----------+------------+
| France   |  44 |    72000 | No         |
| Spain    |  27 |    48000 | Yes        |
| Germany  |  30 |    54000 | No         |
| Spain    |  38 |    61000 | No         |
| Germany  |  40 |          | Yes        |
| France   |  35 |    58000 | Yes        |
| Spain    |     |    52000 | No         |
| France   |  48 |    79000 | Yes        |
| Germany  |  50 |    83000 | No         |
| France   |  37 |    67000 | Yes        |
+----------+-----+----------+------------+

Loading and Extracting Data

The first thing I do after importing the necessary libraries is load in the data using pandas. With the dataset loaded, I extract the features into one variable (X) and the label I'm trying to predict into another (y).

import numpy as np
from ml_code.utils import load_data

dataset = load_data("basic_data.csv")
features = dataset.iloc[:, :-1].values
labels = dataset.iloc[:, -1].values

Handling Missing Data

Next comes handling missing data. Real-world datasets are messy, and it's common to have missing values. Common strategies for imputing missing data include:

Mean Imputation: Replaces missing values with the mean of the colum.
Median Imputation: Replaces missing values with the median of the column.
Most Frequent Imputation: Replaces missing values with the mode of the column (common for categorical data).
Regression Imputation: Uses a regression model to predict missing values from other features.
Iterative Imputation: Iteratively imputes missing values and trains a model until convergence.
Removing Missing Values: Another option is to remove rows or columns with missing data entirely.

The choice depends on the data distribution, nature of missing values, and potential impact on the downstream models. In this case, I opted to use the median imputation strategy to fill in missing values in the 'Age' and 'Salary' columns.

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy="median")
imputer.fit(features[:, 1:3])
features[:, 1:3] = imputer.transform(features[:, 1:3])

Encoding Categorical Variables

Another key preprocessing step is encoding categorical variables. ML models work with numbers, not text labels. So I use OneHotEncoder to convert the country categories into binary variables. Similarly, I encode the "Yes"/"No" labels into 1s and 0s.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ("onehot", OneHotEncoder(), [0]),
        ("passthrough", "passthrough", [1, 2]),
    ]
)
features_encoded = preprocessor.fit_transform(features)

label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

Splitting the Dataset

With the categorical variables handled, I split the data into training and test sets. This lets you develop the model on the training data while holding back some test data to evaluate performance. Scikit-learn's train_test_split function makes this easy.

from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(
    features_encoded, labels_encoded, test_size=0.2, random_state=0
)

Feature Scaling

Feature scaling is a technique used to put your data features on the same scale. This helps because many algorithms in data science treat all features equally. Without scaling, features with larger values (like salary) could unfairly influence the model more than features with smaller values (like age).

from sklearn.preprocessing import MinMaxScaler

scaler = ColumnTransformer(
    transformers=[
        ("scaler", MinMaxScaler(), [3, 4]),
        ("passthrough", "passthrough", [0, 1, 2]),
    ]
)

features_train_scaled = scaler.fit_transform(features_train)
features_test_scaled = scaler.transform(features_test)

In this case, I standardized the age and salary, so they were all between 0 and 1. But there were other methods of standardization that could have been done here as well. eg:

Standard Scaling: This technique would adjust the 'Age' and 'Salary' so their average is 0 and variance is 1. It would measure how far each value deviates from the average in units of standard deviation. It's effective for algorithms that assume data is normally distributed.
Robust Scaling: If there were outliers in our data, Robust Scaling could be used to ignore them. It uses the median and interquartile range, reducing the influence of outliers. eg., if an age of 100 or a salary of 200,000 appeared in the dataset, Robust Scaling would prevent these values from skewing the entire scale.
MaxAbs Scaling: This method scales each feature by its maximum absolute value. For your dataset, it would divide each 'Salary' by 83,000 and each 'Age' by 50, ensuring all values lie between -1 and 1. This is useful if data is already centered at zero.

Ultimately, it's up to us to decide which scaling method is most appropriate for our dataset and the model we're using, and it might even require some experimentation to find the best fit for our specific use case. This is usually the part where domain knowledge and experience come into play...

Wrapping Up

And that's data preprocessing in a nutshell! While this is all stuff I had some exposure to before, it was good to go step-by-step and really think about the purpose behind each transformation.

Cleaning and prepping data is such a critical skill in machine learning. Bad data is one of the biggest reasons models fail to perform in the real-world. So I'm glad I devoted Day 1 to really nailing down this workflow. Stay tuned for more!