Why Regression Instead of Classification?
Up to now, we’ve been doing binary classification—yes or no, churned or didn’t churn. But what if you need to predict an actual number? Like how much revenue a customer will generate, what their lifetime value is, or how many days until they churn?
That’s regression. Same scikit-learn workflow, different output. And honestly, once you understand classification, regression is easier.
This regression models tutorial teaches you when to use regression instead of classification and how to build production-ready models for predicting continuous values. We’ll build a customer lifetime value (CLV) model using the same churn dataset from previous tutorials.
Classification vs Regression: When to Use Each
Use Classification When:
- You need yes/no, true/false answers
- You’re predicting categories (high/medium/low risk)
- Business decisions are binary (call or don’t call)
- Examples: Will churn? Is spam? Fraud or not?
Use Regression When:
- You need actual numbers
- You’re forecasting quantities
- Business needs dollar amounts, days, counts
- Examples: Revenue forecast, CLV, days until churn, sales volume
Sometimes you need both. You might predict if someone will buy (classification) and how much they’ll spend (regression).
What We’re Building
We’re predicting customer lifetime value using three regression approaches:
- Linear Regression – The baseline everyone should try first
- Ridge Regression – When features are correlated
- Random Forest Regressor – When relationships are non-linear
By the end, you’ll know which one to use and when.
The Dataset: Customer Lifetime Value
We’re using our churn dataset but now predicting total revenue instead of churn probability. Our target variable is total_revenue—the sum of all charges for each customer.
Why CLV?
- High-value customers get different treatment
- Helps prioritize retention efforts
- Informs marketing budget allocation
- Drives strategic business decisions
Regression Model 1: Linear Regression
Why Start Here
Linear regression is your baseline for any regression problem. It assumes relationships are linear—as one variable increases, the target increases proportionally. Simple, fast, interpretable.
Advantages:
- Trains in milliseconds
- Easy to explain to stakeholders
- Coefficients show feature importance
- Works well for linear relationships
Limitations:
- Assumes linear relationships (obviously)
- Sensitive to outliers
- Doesn’t handle feature interactions well
The Code
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
# Load cleaned data from Tutorial 2
X_train = pd.read_csv('data/X_train_scaled.csv')
X_val = pd.read_csv('data/X_val_scaled.csv')
X_test = pd.read_csv('data/X_test_scaled.csv')
# Load revenue targets (instead of churn labels)
y_train = pd.read_csv('data/y_train_revenue.csv')['total_revenue']
y_val = pd.read_csv('data/y_val_revenue.csv')['total_revenue']
y_test = pd.read_csv('data/y_test_revenue.csv')['total_revenue']
# Train linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)
# Make predictions
y_pred_lr = lr.predict(X_val)
# Evaluate
print("Linear Regression Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_lr):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_lr)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_lr):.3f}")
# Feature coefficients (interpretability!)
coefficients = pd.DataFrame({
'feature': X_train.columns,
'coefficient': lr.coef_
}).sort_values('coefficient', key=abs, ascending=False)
print("nTop 10 Most Important Features:")
print(coefficients.head(10))
Understanding Regression Metrics
Unlike classification metrics, regression metrics measure prediction error:
MAE (Mean Absolute Error):
- Average dollar amount you’re off by
- If MAE = $150, predictions are off by $150 on average
- Easy to explain to business stakeholders
RMSE (Root Mean Squared Error):
- Penalizes large errors more heavily
- Higher than MAE if you have outliers
- More mathematically proper, less intuitive
R² Score (R-squared):
- How much variance your model explains (0 to 1)
- 0.80 = model explains 80% of revenue variance
- Higher is better, but context matters
What to Expect
On a typical CLV dataset, linear regression gives:
- MAE: $120-$200 (depending on revenue scale)
- RMSE: $180-$300
- R²: 0.65-0.75
If your R² is below 0.60, you probably need better features (Tutorial 6).
Regression Model 2: Ridge Regression
The Multicollinearity Problem
Linear regression struggles when features are correlated. If monthly_charges and total_charges are highly correlated, the model gets confused about which one matters.
Ridge regression adds regularization—it penalizes large coefficients, forcing the model to distribute importance more evenly across correlated features.
When to Use Ridge
- Many features are correlated
- Linear regression coefficients look unstable
- You’re getting weird results from linear regression
- You want to prevent overfitting
The Code
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
# Try different regularization strengths
param_grid = {
'alpha': [0.1, 1.0, 10.0, 100.0, 1000.0]
}
ridge = Ridge()
grid_search = GridSearchCV(
ridge,
param_grid,
cv=5,
scoring='neg_mean_absolute_error',
n_jobs=-1
)
# Find best alpha
grid_search.fit(X_train, y_train)
best_ridge = grid_search.best_estimator_
print(f"Best alpha: {grid_search.best_params_['alpha']}")
# Make predictions
y_pred_ridge = best_ridge.predict(X_val)
# Evaluate
print("nRidge Regression Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_ridge):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_ridge)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_ridge):.3f}")
# Compare coefficients to linear regression
ridge_coefs = pd.DataFrame({
'feature': X_train.columns,
'linear_coef': lr.coef_,
'ridge_coef': best_ridge.coef_
})
print("nCoefficient Comparison:")
print(ridge_coefs.head(10))
What to Expect
Ridge typically performs similarly to linear regression but with more stable coefficients:
- MAE: $115-$195 (often slightly better)
- RMSE: $175-$295
- R²: 0.66-0.76
The real benefit is coefficient stability, not dramatic performance gains.
Regression Model 3: Random Forest Regressor
When Linear Assumptions Break Down
What if the relationship between features and revenue isn’t linear? What if high spenders behave completely differently than low spenders? Linear models can’t capture that complexity.
Random forest regression can. It builds decision trees that split customers into groups with similar revenue patterns.
Why Random Forest Regression
- Handles non-linear relationships automatically
- Captures feature interactions
- Robust to outliers
- Usually 5-10% more accurate than linear models
- Minimal tuning needed
The Code
from sklearn.ensemble import RandomForestRegressor
# Train random forest regressor
rf_reg = RandomForestRegressor(
n_estimators=100,
max_depth=15,
min_samples_split=20,
min_samples_leaf=10,
max_features='sqrt',
random_state=42,
n_jobs=-1
)
rf_reg.fit(X_train, y_train)
# Predictions
y_pred_rf = rf_reg.predict(X_val)
# Evaluate
print("Random Forest Regressor Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_rf):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_rf)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_rf):.3f}")
# Feature importance
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'importance': rf_reg.feature_importances_
}).sort_values('importance', ascending=False)
print("nTop 10 Most Important Features:")
print(feature_importance.head(10))
# Plot actual vs predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_val, y_pred_rf, alpha=0.5)
plt.plot([y_val.min(), y_val.max()],
[y_val.min(), y_val.max()],
'r--', linewidth=2)
plt.xlabel('Actual Revenue')
plt.ylabel('Predicted Revenue')
plt.title('Random Forest: Actual vs Predicted')
plt.savefig('rf_predictions.png', dpi=150, bbox_inches='tight')
plt.close()
What to Expect
Random forest typically outperforms linear models:
- MAE: $100-$170
- RMSE: $150-$260
- R²: 0.72-0.82
That 5-10% improvement in R² means significantly better revenue predictions.
Comparing All Three Models
# Create comparison DataFrame
results = pd.DataFrame({
'Model': ['Linear Regression', 'Ridge Regression', 'Random Forest'],
'MAE': [
mean_absolute_error(y_val, y_pred_lr),
mean_absolute_error(y_val, y_pred_ridge),
mean_absolute_error(y_val, y_pred_rf)
],
'RMSE': [
np.sqrt(mean_squared_error(y_val, y_pred_lr)),
np.sqrt(mean_squared_error(y_val, y_pred_ridge)),
np.sqrt(mean_squared_error(y_val, y_pred_rf))
],
'R²': [
r2_score(y_val, y_pred_lr),
r2_score(y_val, y_pred_ridge),
r2_score(y_val, y_pred_rf)
]
})
print("nModel Comparison:")
print(results.to_string(index=False))
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
results.plot(x='Model', y='MAE', kind='bar', ax=axes[0], legend=False)
axes[0].set_title('Mean Absolute Error (Lower is Better)')
axes[0].set_ylabel('MAE ($)')
results.plot(x='Model', y='RMSE', kind='bar', ax=axes[1], legend=False)
axes[1].set_title('Root Mean Squared Error (Lower is Better)')
axes[1].set_ylabel('RMSE ($)')
results.plot(x='Model', y='R²', kind='bar', ax=axes[2], legend=False)
axes[2].set_title('R² Score (Higher is Better)')
axes[2].set_ylabel('R² Score')
plt.tight_layout()
plt.savefig('regression_comparison.png', dpi=150, bbox_inches='tight')
plt.close()
Making the Choice: Decision Framework
Use Linear Regression if:
- You need results in 30 minutes
- Interpretability is critical
- Relationships look mostly linear
- Dataset is small (<10K rows)
- Time investment: 30 minutes
Use Ridge Regression if:
- Linear regression works but coefficients are unstable
- Many features are correlated
- You want regularization without complexity
- Time investment: 1 hour (including tuning)
Use Random Forest Regressor if:
- You have a few hours to train properly
- Performance matters more than interpretability
- Relationships are clearly non-linear
- Dataset is 10K-1M rows
- Time investment: 2-4 hours
Beyond the Basics: XGBoost for Regression
Just like classification, XGBoost also has a regressor. If you need the absolute best performance and have time to tune:
import xgboost as xgb
xgb_reg = xgb.XGBRegressor(
n_estimators=100,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
xgb_reg.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False)
y_pred_xgb = xgb_reg.predict(X_val)
print("XGBoost Regressor Results:")
print(f"MAE: ${mean_absolute_error(y_val, y_pred_xgb):.2f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_val, y_pred_xgb)):.2f}")
print(f"R² Score: {r2_score(y_val, y_pred_xgb):.3f}")
Typically gets you 2-3% better R² than random forest, but at the cost of tuning complexity.
Real-World Application: Customer Lifetime Value
Let’s say your random forest model predicts CLV with R² = 0.78 and MAE = $145.
Business Impact:
- Segment customers by predicted CLV
- High CLV customers (>$5K) get VIP treatment
- Medium CLV ($1K-$5K) get standard retention
- Low CLV (<$1K) get automated outreach
ROI Calculation:
- 10,000 customers at risk of churn
- Retention call costs $50
- Model identifies top 2,000 highest CLV customers
- Save retention budget: $400K → $100K
- Better allocation: Focus on customers worth 3x more
That’s the difference between good and great model selection.
Common Pitfalls in Regression
1. Ignoring outliers
- One customer with $50K revenue skews linear models
- Consider log-transforming target variable
- Or use tree-based models (more robust)
2. Not checking residuals
- Plot predicted vs actual
- Look for patterns in errors
- Non-random patterns mean model is missing something
3. Using wrong metrics
- MAE for business stakeholders (dollars off)
- RMSE for comparing models (penalizes big errors)
- R² for explaining variance (how much model captures)
4. Overfitting to training data
- Always validate on holdout set
- If training R² = 0.95 but validation R² = 0.65, you’re overfitting
- Add regularization or simplify model
The Bottom Line
Regression is classification’s slightly easier sibling. Same workflow, different target. Start with linear regression, upgrade to random forest if you have time, and only reach for XGBoost when those extra percentage points matter.
Your feature engineering (Tutorial 6) will have a bigger impact than your algorithm choice 90% of the time. A random forest with great features beats XGBoost with mediocre features every single time.
Now go predict some numbers. The code is all here. Use it.
Tutorial 4 Complete
Next: Tutorial 5 – Model Evaluation: Metrics That Actually Matter
Previous: Tutorial 3 – Classification Models: Pick the Right Tool
