Linear Regression in Python

5 min readAug 8, 2020

Keeping up with our series of articles covering python for machine learning, we are back with another article. This time, we will understand linear regression and its implementation using scikit in python. Linear regression forms the basis for machine learning models and is important for python developers. We will try and understand the basics of linear regression and how it calculates the line of best fit. We will then see what is the cost function that gets minimised to evaluate the best model fit. In the end, we will explore some ways to improve the model accuracy, potential pitfalls in the regression modelling and how a python development company can build a regression model to help achieve its business goals.

So let us start.

What is Linear Regression?

Linear regression is a class of supervised learning algorithm which aims at predicting the value of a target variable as a linear combination of several input variables. This is easy to understand by remembering the concept of linear equations. How we used to fit a line passing through the data and then predict the values of ‘Y’ based on different values of ‘X’. The same concept applies to linear regression as well.

Basic Concept

A basic regression equation looks like the following -

Y = aX1 + bX2 + cX3 + dX4 + …. nXn + K

Where Y = Dependent Variable/Target Variable (this is the value to be predicted)

X1 + X2 + X3 + X4 + …. Xn = Independent Variables/Features/Input Variables (these helps us predict the value of Y)

a,b,c,d,….n = coefficients of each independent variable

K = constant (analogous to intercept in linear equations)

Aim of any linear regression modelling is to find the appropriate and statistically significant values of a,b,c,d,…n, and ‘K’.

The line that we get after we fit the values in the given model is termed as “line of best fit”.

So, the next logical question that arises is — How do we find the line of best fit? Well the answer is simple. We find it by a method called — “Ordinary Least Squares”, or often abbreviated as OLS. The objective of OLS is to find the value that minimises the gap between squared predicted values of ‘Y’ and the actual value of ‘Y’. To put it simply, to get the line of best fit, we do the following

Minimise (Yactual — Ypredicted)2

Once we have a line of best fit by minimising the above function, we have our regression equation ready. The above function is also called “cost function”.

Now that we have some clarity on the theory around linear regression, let us proceed to its implementation in python.

Importing packages

We start the code by importing the desired packages

Reading the data

Once we are done importing the packages, we proceed to reading the data. In this step, we use read_csv() function in the pandas package to read the csv file on our local machine

Basic data exploration

After importing the data, we explore the features, check the measures of central tendency to understand the distribution of the data

Following inferences can be made out of the given data -

TVs contribute more to sales than Radio and Newspapers
High standard deviation in TV shows that data is more spread in case of TV than in the case of newspaper and sales

Preparing the data for modelling

Once we are done with the preliminary analysis of the data, we will now seperate the data into train and test for model creation. We will use python package — train_test_split from sklearn.model_selection for the same.

Model fitting

Once we are ready with the ‘train’ and ‘test’ data, we will now fit the training data into our linear regression model, to train the model.

As mentioned above, each linear regression model has the following terms -

Intercept
Coefficients

In accordance to the above mentioned points, we will have an intercept and a set of coefficients for our features. Let us have a look at them -

As we can see, our regression equation is -

Sales = 4.77 + 0.05*TV + 0.11*Radio — 0.003Newspaper

We can interpret the coefficients as follows -

For every unit change in TV, the sales are likely to go up by a factor of 0.05, and similarly for Radio, it is likely to go up by 0.11.

Making and Analysing the predictions

Once we are done with the model making process, we now move to making predictions, based on the model we obtained above.

As it is clear from the parity plot above, the model yields very good results. The plot shows actual values of ‘Sales’ vs predicted value of sales. Since maximum values lie on the line x=y, we can say the model has done a good job in predicting sales.

Evaluating the metrics

Once we are done building the model, we will now calculate the metrics to evaluate our model’s performance. The metrics we will use to evaluate the model are -

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Root Mean Squared Error (R2)

Generally, to check the efficiency of the model, we check the R2 value. The closer the value is to 1 (or 100%), the better the model is. In our case, its value is 86.5%, which is a decent accuracy.

With this, we come to the end of another article.

Hope you all learnt something new.

Until next time, bye bye!

Linear Regression in Python

Written by Winklix LLC

No responses yet