Regression Models Tutorial: Predict Customer Lifetime Value

Why Regression Instead of Classification?

Up to now, we’ve been doing binary classification—yes or no, churned or didn’t churn. But what if you need to predict an actual number? Like how much revenue a customer will generate, what their lifetime value is, or how many days until they churn?

That’s regression. Same scikit-learn workflow, different output. And honestly, once you understand classification, regression is easier.

This regression models tutorial teaches you when to use regression instead of classification and how to build production-ready models for predicting continuous values. We’ll build a customer lifetime value (CLV) model using the same churn dataset from previous tutorials.

Classification vs Regression: When to Use Each

Use Classification When:

You need yes/no, true/false answers
You’re predicting categories (high/medium/low risk)
Business decisions are binary (call or don’t call)
Examples: Will churn? Is spam? Fraud or not?

Use Regression When:

You need actual numbers
You’re forecasting quantities
Business needs dollar amounts, days, counts
Examples: Revenue forecast, CLV, days until churn, sales volume

Sometimes you need both. You might predict if someone will buy (classification) and how much they’ll spend (regression).

What We’re Building

We’re predicting customer lifetime value using three regression approaches:

Linear Regression – The baseline everyone should try first
Ridge Regression – When features are correlated
Random Forest Regressor – When relationships are non-linear

By the end, you’ll know which one to use and when.

The Dataset: Customer Lifetime Value

We’re using our churn dataset but now predicting total revenue instead of churn probability. Our target variable is total_revenue—the sum of all charges for each customer.

Why CLV?

High-value customers get different treatment
Helps prioritize retention efforts
Informs marketing budget allocation
Drives strategic business decisions

Regression Model 1: Linear Regression

Why Start Here

Linear regression is your baseline for any regression problem. It assumes relationships are linear—as one variable increases, the target increases proportionally. Simple, fast, interpretable.

Advantages:

Trains in milliseconds
Easy to explain to stakeholders
Coefficients show feature importance
Works well for linear relationships

Limitations:

Assumes linear relationships (obviously)
Sensitive to outliers
Doesn’t handle feature interactions well

The Code

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load cleaned data from Tutorial 2
X_train = pd.read_csv('data/X_train_scaled.csv')
X_val = pd.read_csv('data/X_val_scaled.csv')
X_test = pd.read_csv('data/X_test_scaled.csv')

# Load revenue targets (instead of churn labels)
y_train = pd.read_csv('data/y_train_revenue.csv')['total_revenue']
y_val = pd.read_csv('data/y_val_revenue.csv')['total_revenue']
y_test = pd.read_csv('data/y_test_revenue.csv')['total_revenue']

# Train linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make predictions
y_pred_lr = lr.predict(X_val)

# Evaluate
print("Linear Regression Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_lr):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_lr)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_lr):.3f}")

# Feature coefficients (interpretability!)
coefficients = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': lr.coef_
}).sort_values('coefficient', key=abs, ascending=False)

print("nTop 10 Most Important Features:")
print(coefficients.head(10))

Understanding Regression Metrics

Unlike classification metrics, regression metrics measure prediction error:

MAE (Mean Absolute Error):

Average dollar amount you’re off by
If MAE = $150, predictions are off by $150 on average
Easy to explain to business stakeholders

RMSE (Root Mean Squared Error):

Penalizes large errors more heavily
Higher than MAE if you have outliers
More mathematically proper, less intuitive

R² Score (R-squared):

How much variance your model explains (0 to 1)
0.80 = model explains 80% of revenue variance
Higher is better, but context matters

What to Expect

On a typical CLV dataset, linear regression gives:

MAE: $120-$200 (depending on revenue scale)
RMSE: $180-$300
R²: 0.65-0.75

If your R² is below 0.60, you probably need better features (Tutorial 6).

Regression Model 2: Ridge Regression

The Multicollinearity Problem

Linear regression struggles when features are correlated. If monthly_charges and total_charges are highly correlated, the model gets confused about which one matters.

Ridge regression adds regularization—it penalizes large coefficients, forcing the model to distribute importance more evenly across correlated features.

When to Use Ridge

Many features are correlated
Linear regression coefficients look unstable
You’re getting weird results from linear regression
You want to prevent overfitting

The Code

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Try different regularization strengths
param_grid = {
    'alpha': [0.1, 1.0, 10.0, 100.0, 1000.0]
}

ridge = Ridge()
grid_search = GridSearchCV(
    ridge,
    param_grid,
    cv=5,
    scoring='neg_mean_absolute_error',
    n_jobs=-1
)

# Find best alpha
grid_search.fit(X_train, y_train)
best_ridge = grid_search.best_estimator_

print(f"Best alpha: {grid_search.best_params_['alpha']}")

# Make predictions
y_pred_ridge = best_ridge.predict(X_val)

# Evaluate
print("nRidge Regression Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_ridge):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_ridge)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_ridge):.3f}")

# Compare coefficients to linear regression
ridge_coefs = pd.DataFrame({
    'feature': X_train.columns,
    'linear_coef': lr.coef_,
    'ridge_coef': best_ridge.coef_
})

print("nCoefficient Comparison:")
print(ridge_coefs.head(10))

What to Expect

Ridge typically performs similarly to linear regression but with more stable coefficients:

MAE: $115-$195 (often slightly better)
RMSE: $175-$295
R²: 0.66-0.76

The real benefit is coefficient stability, not dramatic performance gains.

Regression Model 3: Random Forest Regressor

When Linear Assumptions Break Down

What if the relationship between features and revenue isn’t linear? What if high spenders behave completely differently than low spenders? Linear models can’t capture that complexity.

Random forest regression can. It builds decision trees that split customers into groups with similar revenue patterns.

Why Random Forest Regression

Handles non-linear relationships automatically
Captures feature interactions
Robust to outliers
Usually 5-10% more accurate than linear models
Minimal tuning needed

The Code

from sklearn.ensemble import RandomForestRegressor

# Train random forest regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    max_depth=15,
    min_samples_split=20,
    min_samples_leaf=10,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)

rf_reg.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_reg.predict(X_val)

# Evaluate
print("Random Forest Regressor Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_rf):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_rf)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_rf):.3f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_reg.feature_importances_
}).sort_values('importance', ascending=False)

print("nTop 10 Most Important Features:")
print(feature_importance.head(10))

# Plot actual vs predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_val, y_pred_rf, alpha=0.5)
plt.plot([y_val.min(), y_val.max()], 
         [y_val.min(), y_val.max()], 
         'r--', linewidth=2)
plt.xlabel('Actual Revenue')
plt.ylabel('Predicted Revenue')
plt.title('Random Forest: Actual vs Predicted')
plt.savefig('rf_predictions.png', dpi=150, bbox_inches='tight')
plt.close()

What to Expect

Random forest typically outperforms linear models:

MAE: $100-$170
RMSE: $150-$260
R²: 0.72-0.82

That 5-10% improvement in R² means significantly better revenue predictions.

Comparing All Three Models

# Create comparison DataFrame
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Ridge Regression', 'Random Forest'],
    'MAE': [
        mean_absolute_error(y_val, y_pred_lr),
        mean_absolute_error(y_val, y_pred_ridge),
        mean_absolute_error(y_val, y_pred_rf)
    ],
    'RMSE': [
        np.sqrt(mean_squared_error(y_val, y_pred_lr)),
        np.sqrt(mean_squared_error(y_val, y_pred_ridge)),
        np.sqrt(mean_squared_error(y_val, y_pred_rf))
    ],
    'R²': [
        r2_score(y_val, y_pred_lr),
        r2_score(y_val, y_pred_ridge),
        r2_score(y_val, y_pred_rf)
    ]
})

print("nModel Comparison:")
print(results.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

results.plot(x='Model', y='MAE', kind='bar', ax=axes[0], legend=False)
axes[0].set_title('Mean Absolute Error (Lower is Better)')
axes[0].set_ylabel('MAE ($)')

results.plot(x='Model', y='RMSE', kind='bar', ax=axes[1], legend=False)
axes[1].set_title('Root Mean Squared Error (Lower is Better)')
axes[1].set_ylabel('RMSE ($)')

results.plot(x='Model', y='R²', kind='bar', ax=axes[2], legend=False)
axes[2].set_title('R² Score (Higher is Better)')
axes[2].set_ylabel('R² Score')

plt.tight_layout()
plt.savefig('regression_comparison.png', dpi=150, bbox_inches='tight')
plt.close()

Making the Choice: Decision Framework

Use Linear Regression if:

You need results in 30 minutes
Interpretability is critical
Relationships look mostly linear
Dataset is small (<10K rows)
Time investment: 30 minutes

Use Ridge Regression if:

Linear regression works but coefficients are unstable
Many features are correlated
You want regularization without complexity
Time investment: 1 hour (including tuning)

Use Random Forest Regressor if:

You have a few hours to train properly
Performance matters more than interpretability
Relationships are clearly non-linear
Dataset is 10K-1M rows
Time investment: 2-4 hours

Beyond the Basics: XGBoost for Regression

Just like classification, XGBoost also has a regressor. If you need the absolute best performance and have time to tune:

import xgboost as xgb

xgb_reg = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

xgb_reg.fit(X_train, y_train,
            eval_set=[(X_val, y_val)],
            verbose=False)

y_pred_xgb = xgb_reg.predict(X_val)

print("XGBoost Regressor Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_xgb):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_xgb)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_xgb):.3f}")

Typically gets you 2-3% better R² than random forest, but at the cost of tuning complexity.

Real-World Application: Customer Lifetime Value

Let’s say your random forest model predicts CLV with R² = 0.78 and MAE = $145.

Business Impact:

Segment customers by predicted CLV
High CLV customers (>$5K) get VIP treatment
Medium CLV ($1K-$5K) get standard retention
Low CLV (<$1K) get automated outreach

ROI Calculation:

10,000 customers at risk of churn
Retention call costs $50
Model identifies top 2,000 highest CLV customers
Save retention budget: $400K → $100K
Better allocation: Focus on customers worth 3x more

That’s the difference between good and great model selection.

Common Pitfalls in Regression

1. Ignoring outliers

One customer with $50K revenue skews linear models
Consider log-transforming target variable
Or use tree-based models (more robust)

2. Not checking residuals

Plot predicted vs actual
Look for patterns in errors
Non-random patterns mean model is missing something

3. Using wrong metrics

MAE for business stakeholders (dollars off)
RMSE for comparing models (penalizes big errors)
R² for explaining variance (how much model captures)

4. Overfitting to training data

Always validate on holdout set
If training R² = 0.95 but validation R² = 0.65, you’re overfitting
Add regularization or simplify model

The Bottom Line

Regression is classification’s slightly easier sibling. Same workflow, different target. Start with linear regression, upgrade to random forest if you have time, and only reach for XGBoost when those extra percentage points matter.

Your feature engineering (Tutorial 6) will have a bigger impact than your algorithm choice 90% of the time. A random forest with great features beats XGBoost with mediocre features every single time.

Now go predict some numbers. The code is all here. Use it.

Tutorial 4 Complete

Next: Tutorial 5 – Model Evaluation: Metrics That Actually Matter

Previous: Tutorial 3 – Classification Models: Pick the Right Tool

Series: Machine Learning Fundamentals Tutorial Series

Regression Models: When You Need to Predict Actual Numbers

Why Regression Instead of Classification?

Classification vs Regression: When to Use Each

What We’re Building

The Dataset: Customer Lifetime Value

Regression Model 1: Linear Regression

Why Start Here

The Code

Understanding Regression Metrics

What to Expect

Regression Model 2: Ridge Regression

The Multicollinearity Problem

When to Use Ridge

The Code

What to Expect

Regression Model 3: Random Forest Regressor

When Linear Assumptions Break Down

Why Random Forest Regression

The Code

What to Expect

Comparing All Three Models

Making the Choice: Decision Framework

Beyond the Basics: XGBoost for Regression

Real-World Application: Customer Lifetime Value

Common Pitfalls in Regression

The Bottom Line