Simple Linear Regression – Deep Dive (Part 2)

Although it’s very possible to code a typical Machine Learning algorithm from scratch, most times we wouldn’t want to do that. The reason is that there are already a bunch of libraries that have been built to solve the common problems you will encounter in the course of implementation. Hence, in practice, it’s always best to utilize existing libraries as opposed to coding our algorithms from scratch.

In the previous post, I started a deep dive into Simple Linear Regression which involved coding an implementation based on Numpy. In this post, I’ll use a library called Scikit-Learn that ships with a number of machine learning algorithms. 

Scikit-Learn is a machine learning library built with Python that comes with ready ML algorithms for clustering, classification, regression, even dimensionality reduction. If you have an anaconda environment setup on your local, then you should have Scikit already installed. However to install scikit-learn without anaconda, enter commands:

> pip install -U numpy 
> pip install -U scipy 
> pip install -U scikit-learn

Scikit-learn relies on other libraries Numpy and SciPy in order to work well. Numpy is great for linear algebra, multi-dimensional matrix operations, and vectorized operations on large arrays. Scipy is another great python module for mathematics and scientific computing. 

Scikit-learn has a concept called an estimator which must implement a fit method. The fit method is used by the estimator to learn a model. Typically, the fit method will accept some kind of training data from which it can learn the model parameters. For classification and regression tasks, we have various classifiers and regressors which are also estimators. But in addition to the fit method, they also have predict and score methods. In most ML tasks involving classification and regression, you will first call the fit method on some training data, and then the predict method on input (X) samples to estimate output (y) samples. 

You can check out more on the Scikit-learn API from this link.

With that out of the way, next is to code a Simple Linear Regression model to solve the example in the previous post. Remember the goal is to identify a suitable model for predicting the gross revenue for a company based on the amount spent on advertising. Here’s the code:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

First, we import the necessary modules. Notice we’re using a LinearRegression class defined in the package sklearn.linear_model. This is a regressor that will allow us to fit our training data. We can then call the predict method on the regressor to predict our gross revenue based on the inputted amount spent on advertising. Here’s some more code:

def read_data(x_header, y_header):
    path = input('Enter path to CSV file: ')
    frame = pd.read_csv(path)
    return frame[x_header].values, frame[y_header].values


def read_input_amount():
    return float(input('Enter amount: '))


x_train, y_train = read_data('amount (thousand naira)',
'gross revenue (millions)')

regressor = LinearRegression()
regressor.fit(x_train.reshape(-1, 1), y_train.reshape(-1, 1))

amount = read_input_amount()
revenue = regressor.predict(np.array(amount).reshape(-1, 1))[0][0]
print(f'Gross Revenue: {revenue}')

The fit method accepts two multi-dimensional arrays, both representing the x (amount spent) and y (gross revenue) training data. Notice we have to reshape the arrays so they become two dimensional. Specifying the row value as -1 implies we want NumPy to figure out the appropriate number of rows, but the column size must be one. Then on line 17, we read the input amount using the function we defined. We then call the predict method on the inputted amount, which is also boxed into a multi-dimensional array that is reshaped as well. The output of predict is a two-dimensional array from which we extract the item on the first row and first column which should be our prediction. 

Implementing the regression using scikit-learn was pretty straightforward. We only had to call the fit and predict methods on the LinearRegression object. Scikit handles the work of learning the model and applying it to the predictions. 

Model evaluation using the R-Squared metric

Machine learning algorithms typically involve learning a model for predicting classes (classification) or continuous values (regression). In the process of learning a model, we also need to have appropriate metrics for evaluating the suitability of the learned model. For Simple Linear Regression, a great metric is called R-squared or Coefficient of Determination.

R-squared is a measure of the extent to which the variance in the dependent variable is influenced by the variance in the independent variable. In essence, it gives an indication of how well our model fits the data.

The formula for R-squared is given as:

    \[R^2 = 1 - \frac{SSres}{SStot}\]

Where SSres is the residual sum of squares and SStot is the total sum of squares. We have already encountered the Residual sum of squares in the previous post. To recap, SSres is the sum of the square of the difference of y (observed values) and f(x) (predicted values).

    \[SSres = \sum_{i=1}^{N}({y_i-f(x_i)})^2\]

SStot is proportional to the variance and is given by a different formula. 

    \[SStot = \sum_{i=1}^{N}({y_i-\bar{y}})^2\]

In the above equation, \bar{y} is the mean of the y values, and y\sub{i} the observed i instance of y. 

In Scikit-learn, we can use the regressor’s score method to get the value of r-squared which typically should range from 0 to 1. To evaluate the score of our model, we will use test data which is different from the training data. Assume we have test data assembled in a CSV file, the code outlined below can help us in computing r-squared. I’ve made some slight modifications to the existing code. 

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

def read_data(file_type, x_header, y_header):
    path = input(f'Enter path to {file_type} CSV file: ')
    frame = pd.read_csv(path)
    return frame[x_header].values, frame[y_header].values


x_header = 'amount (thousand naira)'
y_header = 'gross revenue (millions)'

x_train, y_train = read_data('Training', x_header, y_header)
x_test, y_test = read_data('Test', x_header, y_header)

regressor = LinearRegression()
regressor.fit(x_train.reshape(-1, 1), y_train.reshape(-1, 1))

r_squared = regressor.score(x_test.reshape(-1, 1), y_test.reshape(-1, 1))
print(f'R-Squared: {r_squared}')

Running the linked training and test samples through the algorithm should give an R-Squared value of 0.9383265010822415
A high r-squared value indicates a strong correlation between the changes in the amount spent on advertising and changes in gross revenue. Bear in mind though, we can’t always assume a low R-squared is bad or a high R-squared is good. Sometimes, noise may clog the correlation between the independent and dependent variables. In such cases, a graphical representation of the data could help visualize what is going on. I would include a few links in the Further Reading section that address this.

Wrapping Up

In this post, I was able to build a Simple Linear Regression model using Scikit-Learn. The approach was much simpler and straightforward than in my previous post on SLR where I used NumPy functions. I also used the score method to evaluate a metric for determining the suitability of our model. In many ML problems, we would have more than one independent variable. This gives way for Multiple Linear Regression, which consists of multiple explanatory variables and a regression line that is curvilinear. Hopefully in another post, we get to do a deep dive on MLR. 

Further Reading