A head start to Linear Regression
I’m sure most of the people reading this are already aware of the term Linear Regression. We even study linear regression in our 10th grade but actually, we were unaware of the power of this algorithm.
Let’s deep dive into Linear Regression. So, what is Linear Regression. It is nothing but a linear approach to model the relationship between two variables. Let’s go term by term. Linear and Regression. Linearity is the property of a mathematical relationship that can be graphically represented as a straight line.

Don’t worry too much about the equation right now. Just see the line that is drawn above. So this is how linear graphs look like.
And Regression analysis is a statistical process for estimating the relationship among variables.
If you still do not understand don’t worry. By the end of this article, I’m pretty sure that you will get the basic intuition of what linear Regression is.
Types of Linear Regression
There are various types of linear regression but in this article we will mainly focus on:
- Simple Linear Regression
2. Multiple Linear Regression
1. Simple Linear Regression
Suppose there is a real estate dataset where one of the columns is (Number of rooms) and another column is (Price of that house). And whenever we plot both of them in a graph we will see a linear relationship between them. You must be thinking why do we see a linear relationship between them. So let's suppose if your house has 8 rooms and one of your friend house has 3 rooms. What do you think whose house price will be greater. It’s obvious that your house price will more than his house price. Similarly, in real world the number of rooms increases house price also increases. Let us see a dummy dataset implementation below.

In the above diagram you can see the linear graph. Now I’m pretty sure you must have understood how linear graphs look like.
The graph you see above is not always linear in the real world dataset. Because Real world dataset are affected by real world factors for which we cannot quantify using mathematical formulas.
I’m sure you must be thinking where are the algorithms and mathematics. Don’t worry you will see that in a minute. But it is very important to know the overview and definitions of algorithms before knowing the mathematics.
As we saw above we plotted number_of_rooms in x axis and price in y axis. And it formed a linear line. And recall the equation of linear line in 2D.
y = mx+c
Where y is our price and is y-axis and x is number_of_rooms and where m is slope. And with the help of this equation we can actually predict the future values.
And c is the y-intercept. It means that the line can move up and down keeping the same angle. If the y-intercept is zero then the straight line passes through the origin. But y-intercept i.e c should never be zero because let’s say if we have a dataset with x as experience and y as salary and we have to predict the future upcoming employees salary then and they have 0 years experience which is our x then the equation becomes:
y= m.0 + 0
y = 0
It means that 0 years experience employees should never get salary which is not true . People with no experience also gets paid.
Now it’s time for you to build a simple ML model using Linear Regression.

Here X is number_of_rooms and y is price.
First of all we splitted the data in X and y variables. Then we splitted the data in train and test. I have used 20 percent for testing which you can see I have used test_size as 0.2 and random_state is like seed,whenever we split the data again and again the train test remains same. Then we called LinearRegression class and we fit our data on X_train and y_train where X_train and y_train has 80 percent data and we train our model using 80% data and use 20% for testing. Then we call predict function to predict on our test data and we can see the results.
What happens during the training is the data fits itself in such a way that it gets the minimum error and we call that line as a best fit line. The value of m and c is adjusted in such a way during the training that it makes less error and during the testing the same slope value m and intercept value is applied to predict.
where slope is:

and intercept i.e c is :

We can see above that we predicted our values but we can see 8500 predicted as 8160.24 and 3010 predicted as 3246.2195 which is fine but we can minimize this error to some extent. You must be thinking that why are we not getting accurate results. Why 8500 is not predicted as 8500 because error is important in ML. If the model is 100 percent accurate then the model must be overfitting or there can be data leakage.
We can also build our own Linear Regression class in python from scratch using formula. Let us see the implementation below:

Here we can see that our code that we build from scratch is predicting same as sklearn library. Now we know that sklearn also uses the method to do Linear Regression internally.
2. Multiple Linear Regression
This is same as Simple Linear Regression but here there are more number of input columns. If we consider the same real estate dataset we considered above there were only two features but now there will be multiple independent features i.e more input columns. Such as street, quality, area, etc. If we consider the real world dataset we will mostly find these types of datasets where there are features.
In sklearn we can use the same LinearRegression class for multiple linear regression.
The formula for Multiple Linear Regression is:

- y = the predicted value of the dependent variable
- B0 = the y-intercept (value of y when all other parameters are set to 0)
- B1X1= the regression coefficient (B1) of the first independent variable (X1) (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value)
- … = do the same for however many independent variables you are testing
- BnXn = the regression coefficient of the last independent variable
- e = model error (a.k.a. how much variation there is in our estimate of y)
Now try yourself implementing linear regression using a real world dataset. You can find datasets on kaggle.