regression models

Regression Models: When You Need to Predict Actual Numbers

Why Regression Instead of Classification?

Up to now, we’ve been doing binary classification—yes or no, churned or didn’t churn. But what if you need to predict an actual number? Like how much revenue a customer will generate, what their lifetime value is, or how many days until they churn?

That’s regression. Same scikit-learn workflow, different output. And honestly, once you understand classification, regression is easier.

This regression models tutorial teaches you when to use regression instead of classification and how to build production-ready models for predicting continuous values. We’ll build a customer lifetime value (CLV) model using the same churn dataset from previous tutorials.

Classification vs Regression: When to Use Each

Use Classification When:

  • You need yes/no, true/false answers
  • You’re predicting categories (high/medium/low risk)
  • Business decisions are binary (call or don’t call)
  • Examples: Will churn? Is spam? Fraud or not?

Use Regression When:

  • You need actual numbers
  • You’re forecasting quantities
  • Business needs dollar amounts, days, counts
  • Examples: Revenue forecast, CLV, days until churn, sales volume

Sometimes you need both. You might predict if someone will buy (classification) and how much they’ll spend (regression).

What We’re Building

We’re predicting customer lifetime value using three regression approaches:

  1. Linear Regression – The baseline everyone should try first
  2. Ridge Regression – When features are correlated
  3. Random Forest Regressor – When relationships are non-linear

By the end, you’ll know which one to use and when.

The Dataset: Customer Lifetime Value

We’re using our churn dataset but now predicting total revenue instead of churn probability. Our target variable is total_revenue—the sum of all charges for each customer.

Why CLV?

  • High-value customers get different treatment
  • Helps prioritize retention efforts
  • Informs marketing budget allocation
  • Drives strategic business decisions

Regression Model 1: Linear Regression

Why Start Here

Linear regression is your baseline for any regression problem. It assumes relationships are linear—as one variable increases, the target increases proportionally. Simple, fast, interpretable.

Advantages:

  • Trains in milliseconds
  • Easy to explain to stakeholders
  • Coefficients show feature importance
  • Works well for linear relationships

Limitations:

  • Assumes linear relationships (obviously)
  • Sensitive to outliers
  • Doesn’t handle feature interactions well

The Code

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load cleaned data from Tutorial 2
X_train = pd.read_csv('data/X_train_scaled.csv')
X_val = pd.read_csv('data/X_val_scaled.csv')
X_test = pd.read_csv('data/X_test_scaled.csv')

# Load revenue targets (instead of churn labels)
y_train = pd.read_csv('data/y_train_revenue.csv')['total_revenue']
y_val = pd.read_csv('data/y_val_revenue.csv')['total_revenue']
y_test = pd.read_csv('data/y_test_revenue.csv')['total_revenue']

# Train linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make predictions
y_pred_lr = lr.predict(X_val)

# Evaluate
print("Linear Regression Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_lr):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_lr)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_lr):.3f}")

# Feature coefficients (interpretability!)
coefficients = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': lr.coef_
}).sort_values('coefficient', key=abs, ascending=False)

print("nTop 10 Most Important Features:")
print(coefficients.head(10))

Understanding Regression Metrics

Unlike classification metrics, regression metrics measure prediction error:

MAE (Mean Absolute Error):

  • Average dollar amount you’re off by
  • If MAE = $150, predictions are off by $150 on average
  • Easy to explain to business stakeholders

RMSE (Root Mean Squared Error):

  • Penalizes large errors more heavily
  • Higher than MAE if you have outliers
  • More mathematically proper, less intuitive

R² Score (R-squared):

  • How much variance your model explains (0 to 1)
  • 0.80 = model explains 80% of revenue variance
  • Higher is better, but context matters

What to Expect

On a typical CLV dataset, linear regression gives:

  • MAE: $120-$200 (depending on revenue scale)
  • RMSE: $180-$300
  • R²: 0.65-0.75

If your R² is below 0.60, you probably need better features (Tutorial 6).

Regression Model 2: Ridge Regression

The Multicollinearity Problem

Linear regression struggles when features are correlated. If monthly_charges and total_charges are highly correlated, the model gets confused about which one matters.

Ridge regression adds regularization—it penalizes large coefficients, forcing the model to distribute importance more evenly across correlated features.

When to Use Ridge

  • Many features are correlated
  • Linear regression coefficients look unstable
  • You’re getting weird results from linear regression
  • You want to prevent overfitting

The Code

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Try different regularization strengths
param_grid = {
    'alpha': [0.1, 1.0, 10.0, 100.0, 1000.0]
}

ridge = Ridge()
grid_search = GridSearchCV(
    ridge,
    param_grid,
    cv=5,
    scoring='neg_mean_absolute_error',
    n_jobs=-1
)

# Find best alpha
grid_search.fit(X_train, y_train)
best_ridge = grid_search.best_estimator_

print(f"Best alpha: {grid_search.best_params_['alpha']}")

# Make predictions
y_pred_ridge = best_ridge.predict(X_val)

# Evaluate
print("nRidge Regression Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_ridge):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_ridge)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_ridge):.3f}")

# Compare coefficients to linear regression
ridge_coefs = pd.DataFrame({
    'feature': X_train.columns,
    'linear_coef': lr.coef_,
    'ridge_coef': best_ridge.coef_
})

print("nCoefficient Comparison:")
print(ridge_coefs.head(10))

What to Expect

Ridge typically performs similarly to linear regression but with more stable coefficients:

  • MAE: $115-$195 (often slightly better)
  • RMSE: $175-$295
  • R²: 0.66-0.76

The real benefit is coefficient stability, not dramatic performance gains.

Regression Model 3: Random Forest Regressor

When Linear Assumptions Break Down

What if the relationship between features and revenue isn’t linear? What if high spenders behave completely differently than low spenders? Linear models can’t capture that complexity.

Random forest regression can. It builds decision trees that split customers into groups with similar revenue patterns.

Why Random Forest Regression

  • Handles non-linear relationships automatically
  • Captures feature interactions
  • Robust to outliers
  • Usually 5-10% more accurate than linear models
  • Minimal tuning needed

The Code

from sklearn.ensemble import RandomForestRegressor

# Train random forest regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    max_depth=15,
    min_samples_split=20,
    min_samples_leaf=10,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)

rf_reg.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_reg.predict(X_val)

# Evaluate
print("Random Forest Regressor Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_rf):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_rf)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_rf):.3f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_reg.feature_importances_
}).sort_values('importance', ascending=False)

print("nTop 10 Most Important Features:")
print(feature_importance.head(10))

# Plot actual vs predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_val, y_pred_rf, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], 
         [y_val.min(), y_val.max()], 
         'r--', linewidth=2)
plt.xlabel('Actual Revenue')
plt.ylabel('Predicted Revenue')
plt.title('Random Forest: Actual vs Predicted')
plt.savefig('rf_predictions.png', dpi=150, bbox_inches='tight')
plt.close()

What to Expect

Random forest typically outperforms linear models:

  • MAE: $100-$170
  • RMSE: $150-$260
  • R²: 0.72-0.82

That 5-10% improvement in R² means significantly better revenue predictions.

Comparing All Three Models

# Create comparison DataFrame
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Ridge Regression', 'Random Forest'],
    'MAE': [
        mean_absolute_error(y_val, y_pred_lr),
        mean_absolute_error(y_val, y_pred_ridge),
        mean_absolute_error(y_val, y_pred_rf)
    ],
    'RMSE': [
        np.sqrt(mean_squared_error(y_val, y_pred_lr)),
        np.sqrt(mean_squared_error(y_val, y_pred_ridge)),
        np.sqrt(mean_squared_error(y_val, y_pred_rf))
    ],
    'R²': [
        r2_score(y_val, y_pred_lr),
        r2_score(y_val, y_pred_ridge),
        r2_score(y_val, y_pred_rf)
    ]
})

print("nModel Comparison:")
print(results.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

results.plot(x='Model', y='MAE', kind='bar', ax=axes[0], legend=False)
axes[0].set_title('Mean Absolute Error (Lower is Better)')
axes[0].set_ylabel('MAE ($)')

results.plot(x='Model', y='RMSE', kind='bar', ax=axes[1], legend=False)
axes[1].set_title('Root Mean Squared Error (Lower is Better)')
axes[1].set_ylabel('RMSE ($)')

results.plot(x='Model', y='R²', kind='bar', ax=axes[2], legend=False)
axes[2].set_title('R² Score (Higher is Better)')
axes[2].set_ylabel('R² Score')

plt.tight_layout()
plt.savefig('regression_comparison.png', dpi=150, bbox_inches='tight')
plt.close()

Making the Choice: Decision Framework

Use Linear Regression if:

  • You need results in 30 minutes
  • Interpretability is critical
  • Relationships look mostly linear
  • Dataset is small (<10K rows)
  • Time investment: 30 minutes

Use Ridge Regression if:

  • Linear regression works but coefficients are unstable
  • Many features are correlated
  • You want regularization without complexity
  • Time investment: 1 hour (including tuning)

Use Random Forest Regressor if:

  • You have a few hours to train properly
  • Performance matters more than interpretability
  • Relationships are clearly non-linear
  • Dataset is 10K-1M rows
  • Time investment: 2-4 hours

Beyond the Basics: XGBoost for Regression

Just like classification, XGBoost also has a regressor. If you need the absolute best performance and have time to tune:

import xgboost as xgb

xgb_reg = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

xgb_reg.fit(X_train, y_train,
            eval_set=[(X_val, y_val)],
            verbose=False)

y_pred_xgb = xgb_reg.predict(X_val)

print("XGBoost Regressor Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_xgb):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_xgb)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_xgb):.3f}")

Typically gets you 2-3% better R² than random forest, but at the cost of tuning complexity.

Real-World Application: Customer Lifetime Value

Let’s say your random forest model predicts CLV with R² = 0.78 and MAE = $145.

Business Impact:

  • Segment customers by predicted CLV
  • High CLV customers (>$5K) get VIP treatment
  • Medium CLV ($1K-$5K) get standard retention
  • Low CLV (<$1K) get automated outreach

ROI Calculation:

  • 10,000 customers at risk of churn
  • Retention call costs $50
  • Model identifies top 2,000 highest CLV customers
  • Save retention budget: $400K → $100K
  • Better allocation: Focus on customers worth 3x more

That’s the difference between good and great model selection.

Common Pitfalls in Regression

1. Ignoring outliers

  • One customer with $50K revenue skews linear models
  • Consider log-transforming target variable
  • Or use tree-based models (more robust)

2. Not checking residuals

  • Plot predicted vs actual
  • Look for patterns in errors
  • Non-random patterns mean model is missing something

3. Using wrong metrics

  • MAE for business stakeholders (dollars off)
  • RMSE for comparing models (penalizes big errors)
  • R² for explaining variance (how much model captures)

4. Overfitting to training data

  • Always validate on holdout set
  • If training R² = 0.95 but validation R² = 0.65, you’re overfitting
  • Add regularization or simplify model

The Bottom Line

Regression is classification’s slightly easier sibling. Same workflow, different target. Start with linear regression, upgrade to random forest if you have time, and only reach for XGBoost when those extra percentage points matter.

Your feature engineering (Tutorial 6) will have a bigger impact than your algorithm choice 90% of the time. A random forest with great features beats XGBoost with mediocre features every single time.

Now go predict some numbers. The code is all here. Use it.


Tutorial 4 Complete

Next: Tutorial 5 – Model Evaluation: Metrics That Actually Matter

Previous: Tutorial 3 – Classification Models: Pick the Right Tool

Series: Machine Learning Fundamentals Tutorial Series