100 Days of ML Code - Day 2 & 3: Simple and Multiple Linear Regression Exploration

Wednesday 15 May 2024

Introduction

Continuing my journey in the 100 Days of ML Code challenge, Days 2 and 3 focus on exploring simple and multiple linear regression. These are foundational techniques in machine learning that help us understand the relationship between variables and make predictions. This post will cover the process of visualizing data, training linear regression models, making predictions, and evaluating model performance.

Simple Linear Regression

In Day 2, we used a simple dataset containing student scores and try to judge their correlation to the number of study hours. Our aim is to try to find a linear function that predicts a response value (score) as accurately as possible based on the feature (study hours). Let's dive into the steps involved.

The Dataset

The dataset used is student_scores.csv, which has two columns: "Hours" and "Scores". Our goal is to predict the "Scores" based on "Hours".

from ml_code.utils import load_data

dataset = load_data("studentscores.csv")
dataset.head()

+-------+--------+
| Hours | Scores |
+-------+--------+
|   2.5 |     21 |
|   5.1 |     47 |
|   3.2 |     27 |
|   8.5 |     75 |
|   3.5 |     30 |
+-------+--------+

Data Visualization

First, I wanted to visualize the relationship between the number of study hours and the scores obtained. This helps us understand if there's a linear relationship between the two variables.

Hours vs Student Scores scatter plot — Hours vs Student Scores Scatter Plot

We could see a clear positive linear relationship between the nuber of study hours and the scores obtained. Therefore, it seems suitable to apply simple linear regression to this dataset.

Model Training

We split the data into training and test sets, then train a simple linear regression model.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.25, random_state=0)

regressor = LinearRegression()
regressor.fit(features_train, labels_train)

Prediction and Evaluation

After training the model, we can make predictions on the test set and evaluate its performance by looking at metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).

But what are these metrics? Let's break them down:

Mean Absolute Error: This metric measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. It gives an idea of how wrong the predictions are on average.
Mean Squared Error: This metric measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value. MSE is more sensitive to larger errors because the squaring of each term effectively penalizes larger errors more than smaller ones.
Root Mean Squared Error: This is the square root of the mean of the squared errors. RMSE is a good measure of how accurately the model predicts the scores. It gives an idea of how large the errors are.

labels_pred = regressor.predict(features_test)
print("Mean Absolute Error:", metrics.mean_absolute_error(labels_test, labels_pred))
print("Mean Squared Error:", metrics.mean_squared_error(labels_test, labels_pred))
print(
    "Root Mean Squared Error:",
    metrics.mean_squared_error(labels_test, labels_pred) ** 0.5,
)

Mean Absolute Error: 4.130879918502482
Mean Squared Error: 20.33292367497996
Root Mean Squared Error: 4.509204328368805

The mean absolute error value hovers around 4.13. The RMSE value of 4.51 also indicates a similar range of error in the model's predictions. Let's see if our line of best fit passes the eye test, and looks like it fits the data well.

Data Visualisation

We can see that the line of best fit closely follows the trend of the data points, indicating a good fit.

Conclusion

Considering the lack of complexity in the dataset, the simple linear regression model performed okay in predicting student scores based on study hours. The model is able to find the linear relationship between the two variables, but struggles sometimes with accuracy when predicting scores. This could be due to the small size of the dataset, and the fact that within the dataset, the score values are more spread out towards the higher end of the scale.

Multiple Linear Regression

Multiple linear regression extends the concept of simple linear regression by considering multiple features to predict a response variable. The steps involved in multiple linear regression is very similar to that shown in the simple example above, but there are a few things to be mindful of that we will get into.

The Dataset

The dataset used for multiple linear regression is a bit more complex, containing multiple features that we can use to predict the response variable. We are using the 50_Startups.csv dataset, which has columns like "R&D Spend", "Administration", "Marketing Spend", "State" and "Profit".

dataset = load_data("50_Startups.csv")
dataset.head()

+------------+----------------+----------------+------------+------------+
| R&D Spend  | Administration | Marketing Spend| State      | Profit     |
+------------+----------------+----------------+------------+------------+
| 165349.2   | 136897.8       | 471784.1       | New York   | 192261.83  |
| 162597.7   | 151377.59      | 443898.53      | California | 191792.06  |
| 153441.51  | 101145.55      | 407934.54      | Florida    | 191050.39  |
| 144372.41  | 118671.85      | 383199.62      | New York   | 182901.99  |
| 142107.34  | 91391.77       | 366168.42      | Florida    | 166187.94  |
+------------+----------------+----------------+------------+------------+

Data Analysis

Before we can start using a linear regression model, we need to describe the current data, and check some of our assumptions. We need to make sure that there is some linearity between the features and the response variable, and that the features are not highly correlated with each other. Most of this can be done with visualisations or correlation matrices. I will spare you the code implementation here and just focus on the results.

Key Findings:

There are no missing values in the dataset.
The dataset contains 50 rows and 5 columns.
The dataset contains 3 numerical columns and 1 categorical column.
There is some linearity between R&D Spend and Profit, Marketing and Profit but doesn't appear to be a strong linear relationship between Administration and Profit.
There is some multicollinearity between R&D Spend and Marketing Spend but we know through contextual knowledge that they are separate spending departments so we will keep them both in the model.
We could remove the administration column as it doesn't seem to have a strong linear relationship with profit but we will keep it in the model for now and see how it performs.

Data Preprocessing

We know that there is some categorical data in the "State" column. We will need to encode this data before we can use it in our model. We're going to do this through one-hot-encoding to convert the categorical data into numerical data. This will allow us to use the "State" column in our model. Something to keep note of here is that we want to avoid the dummy variable trap, which is where we have multiple columns that are highly correlated with each other.

eg. If we have a column for "New York" and "California", we don't need a column for "Florida" as well, as the model can infer that if the other two columns are 0, then the value must be "Florida".

Model Training

After preprocessing the data, we can split it into training and test sets, and train a multiple linear regression model.

dataframe = pd.get_dummies(dataframe, columns=["State"], drop_first=True)

features = dataframe.drop("Profit", axis=1)
labels = dataframe["Profit"]

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.2, random_state=0
)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

Prediction and Evaluation

We can now make predictions on the test set and evaluate the model's performance. With multiple linear regression, we can use the same metrics as in simple linear regression to evaluate the model's accuracy, but we also incorporate the R-squared and adjusted R-squared values to understand how well the model fits the data.

What are these metrics? Let's break them down:

R-squared: This metric measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It provides an indication of the goodness of fit of the model. The higher the R-squared value, the better the model fits the data.
Adjusted R-squared: This metric adjusts the R-squared value based on the number of independent variables in the model. It provides a more accurate measure of the model's goodness of fit when there are multiple independent variables.

y_pred = regressor.predict(X_test)

r2 = metrics.r2_score(y_test, y_pred)
adjusted_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = metrics.mean_absolute_error(y_test, y_pred)

print(f"R-squared value: {r2}")
print(f"Adjusted R-squared value: {adjusted_r2}")
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Absolute Error: {mae}")

R-squared value: 0.9347068473282424
Adjusted R-squared value: 0.8530904064885454
Mean Squared Error: 83502864.03257754
Root Mean Squared Error: 9137.990152794953
Mean Absolute Error: 7514.293659640612

Conclusion

This indicates that 93.47% of the variance in the dependent variable (Profit) is predictable from the independent variables. This is a high value, suggesting a good fit of the model to the data. However, a high R-squared does not necessarily mean the model is perfect. It only indicates how well the independent variables explain the variance in the dependent variable.
The adjusted R-squared is slightly lower than the R-squared value, accounting for the number of predictors in the model. This value suggests that the model is reasonably good, but there's some room for improvement, possibly due to overfitting or irrelevant features. in the dependent variable. It also indicates that the model is not overfitting.
An RMSE of 9137.99 means that the typical error in predictions of Profit is about $9138.

The R-squared value is quite high, indicating that the model explains a significant portion of the variance in profit. However, the Adjusted R-squared is lower, suggesting that the model may include some less relevant predictors or could be improved by refining the feature set. The MSE seems large, but the RMSE provides a clearer picture. An RMSE of 9,138 means the model's typical prediction error is around 8.16% of the average profit ($112,012.64), which may (or may not) be an acceptable level of error depending on the context this is used.

What can we take away from this?

Visualising the data can be a very effective way to gauge the relationship between features and the response variable.
Understanding and utilizing metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-Squared is vital for assessing model performance. These metrics provide insights into how well the model predicts the target variable and highlight areas for potential improvement.
Handling categorical variables through techniques like one-hot encoding is necessary for incorporating them into regression models. Care must be taken to avoid the dummy variable trap.