Multiple Regression with Scikit-learn: When One Variable Isn’t Enough

So you’ve mastered simple linear regression and you’re feeling pretty good about yourself. You can predict house prices based on square footage, estimate salaries from years of experience, and impress your friends at parties with your newfound ML skills. But then reality hits: the real world is messy, and one variable rarely tells the whole story.

Enter multiple regression, where instead of relying on just one predictor variable, you can throw in as many as you want (within reason). Want to predict house prices using square footage AND number of bedrooms AND location AND age of the house? Now we’re talking. If you’re looking for other practical ML algorithms to add to your toolkit, check out my guide on Random Forest.

Why Multiple Regression?

Think about predicting someone’s salary. Sure, years of experience matter, but so do education level, location, industry, and probably whether they know how to negotiate. Multiple regression lets you capture all these factors simultaneously rather than pretending one variable tells the whole story.

The beauty is that the math isn’t much more complicated than simple regression. Instead of fitting a line to a 2D plot, you’re fitting a plane (or hyperplane, if we’re being fancy) to higher-dimensional data. Scikit-learn handles all the heavy lifting for you.

Getting Your Hands Dirty

Let’s build a multiple regression model. I’ll use the Boston housing dataset because it’s classic, messy enough to be interesting, but not so messy that we’ll get lost in data cleaning.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load your data (replace this with your actual dataset)
# For this example, I'll create some synthetic data
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame({
    'square_feet': np.random.randint(800, 4000, n_samples),
    'bedrooms': np.random.randint(1, 6, n_samples),
    'age': np.random.randint(0, 100, n_samples),
    'distance_to_city': np.random.uniform(1, 50, n_samples)
})

# Create a target variable with some noise
data['price'] = (
    data['square_feet'] * 100 + 
    data['bedrooms'] * 10000 + 
    data['age'] * -500 + 
    data['distance_to_city'] * -2000 + 
    np.random.normal(0, 20000, n_samples)
)

Split Your Data

Always, always, always split your data. You need to know if your model actually works on data it hasn’t seen before.

X = data[['square_feet', 'bedrooms', 'age', 'distance_to_city']]
y = data['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Build and Train the Model

This part is almost anticlimactic:

model = LinearRegression()
model.fit(X_train, y_train)

Yep, that’s it. Two lines of code and you’ve got a multiple regression model. If you want to explore classification problems instead, I’ve also written about logistic regression with scikit-learn.

Understanding Your Model

Now let’s see what your model learned:

# Make predictions
y_pred = model.predict(X_test)

# Check performance
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):,.0f}")

# Look at coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
})
print("nCoefficients:")
print(coefficients.sort_values('Coefficient', ascending=False))

The coefficients tell you how much each variable contributes to the prediction. A coefficient of 100 for square_feet means that for each additional square foot, the predicted price goes up by $100 (assuming everything else stays constant).

Common Pitfalls and How to Avoid Them

Multicollinearity: When your predictor variables are too similar to each other, weird things happen. Check correlations between your features and consider dropping redundant ones.

Scale matters: If one variable ranges from 0-1000 and another from 0-5, the model might get confused. Consider normalizing or standardizing your features.

Overfitting: More features isn’t always better. If you have 100 features and 50 data points, you’re gonna have a bad time. Use regularization (Ridge or Lasso regression) when you have lots of features.

The Bottom Line

Multiple regression is your bread-and-butter tool for prediction problems when you have multiple factors influencing an outcome. It’s interpretable, fast, and works well as a baseline before you try fancier algorithms.

The key is understanding what your coefficients mean, checking your assumptions (linear relationships, normally distributed residuals, etc.), and not throwing in every variable you can think of just because you can.

Start simple, validate your results, and iterate. That’s the scientific method, baby!