Linear Regression in Python is easy to implement and is useful for predicting continous numbers (such as forecasting revenues or predicting a house price based on qualities like the number of rooms). Almost all machine learning algorithms use numbers as input, but sometimes those numbers can represent categories or other non-numeric values. Continuous numbers represent values you would sum, multiply, or otherwise manipulate mathematically.

## Theory

You probably learned the formula for Simple Linear Regression in middle school math. It looks like this:

**y = mx + b**

…where y is the target we’re trying to solve for, x is a value we know (a “feature”), m is the slope of the line, and b is the value of y when x equals zero (the “intercept”). This formula is useful, but it’s limited to a single feature, x. If we want to solve for y with multiple features, the formula is a little different:

**y = w[0]*x[0] + w[1]*x[0] ….. +b **

For each feature we want to use, we just add a w[num]*x[num]. “w” represents the weight of that feature, or how important it is compared to the others.

## Coding basic Linear Regression in Python

Most tutorials use data that comes with Pandas or sckit-learn. That might be useful for learning the theory of the algorithm, but it’s not real life. In this example we’ll be using data from a CSV, which is something you’ll do almost every day as a data scientist.

*#First import our libraries*,

*import numpy as np*

*import pandas as pd*

*from sklearn.linear_model import LinearRegression*

*from sklearn.cross_validation import train_test_split*

*# Load the CSV with Pandas*

*df = pd.read_csv(‘data.csv’) *

*#Define the Target and the Features*

*X = df.drop(‘columnname’, axis=1) #drop the target column*

*y = df[[‘columnname’]] #keep only the target column*

*#Create our cross-validation split*

*X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)*

*#Define the algorithm and train the model*

*lr = LinearRegression().fit(X_train,y_train)*

*#Check the coefficients (“w”)*

*print(lr.coef_)*

*#Now score the model*

lr.score(X_test, y_test)

## Leave a Reply