Linear Regression in Python is easy to implement and is useful for predicting continous numbers (such as forecasting revenues or predicting a house price based on qualities like the number of rooms). Almost all machine learning algorithms use numbers as input, but sometimes those numbers can represent categories or other non-numeric values. Continuous numbers represent values you would sum, multiply, or otherwise manipulate mathematically.
You probably learned the formula for Simple Linear Regression in middle school math. It looks like this:
y = mx + b
…where y is the target we’re trying to solve for, x is a value we know (a “feature”), m is the slope of the line, and b is the value of y when x equals zero (the “intercept”). This formula is useful, but it’s limited to a single feature, x. If we want to solve for y with multiple features, the formula is a little different:
y = w*x + w*x ….. +b
For each feature we want to use, we just add a w[num]*x[num]. “w” represents the weight of that feature, or how important it is compared to the others.
Coding basic Linear Regression in Python
Most tutorials use data that comes with Pandas or sckit-learn. That might be useful for learning the theory of the algorithm, but it’s not real life. In this example we’ll be using data from a CSV, which is something you’ll do almost every day as a data scientist.
#First import our libraries,
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
# Load the CSV with Pandas
df = pd.read_csv(‘data.csv’)
#Define the Target and the Features
X = df.drop(‘columnname’, axis=1) #drop the target column
y = df[[‘columnname’]] #keep only the target column
#Create our cross-validation split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
#Define the algorithm and train the model
lr = LinearRegression().fit(X_train,y_train)
#Check the coefficients (“w”)
#Now score the model