The Reality of Algorithm Selection
Here’s the thing nobody tells you when you’re starting out: the algorithm usually matters less than you think. I’ve seen data scientists spend weeks optimizing model selection when their real problem was dirty data or bad features.
But—and this is important—sometimes the algorithm choice makes a massive difference. Knowing when it matters (and when it doesn’t) is what separates people who ship working ML systems from people who endlessly tweak hyperparameters.
In this tutorial, we’re going to build the same churn prediction model four different ways. You’ll see exactly where each algorithm shines and where it falls flat. No theory. No proofs. Just working code and honest comparisons.
What We’re Building
We’re predicting customer churn using the same dataset we cleaned in Tutorial 2. We’ll implement:
- Logistic Regression – The workhorse
- Decision Tree – The explainer
- Random Forest – The reliable performer
- Gradient Boosting – The competition winner
By the end, you’ll know which one to reach for first.
Algorithm 1: Logistic Regression
Why Start Here?
Logistic regression is where you should start for binary classification. Always. Here’s why:
- Fast to train – Seconds, not minutes
- Easy to interpret – You can explain it to non-technical stakeholders
- Good baseline – Shows you what simple relationships exist
- Robust – Doesn’t overfit easily
- Works well – Surprisingly good on clean data
It’s not sexy. It won’t win Kaggle competitions. But it’s shipped in more production systems than any fancy deep learning model.
The Code
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Load our cleaned data from Tutorial 2
X_train = pd.read_csv('data/X_train_scaled.csv')
X_val = pd.read_csv('data/X_val_scaled.csv')
X_test = pd.read_csv('data/X_test_scaled.csv')
y_train = pd.read_csv('data/y_train.csv')['churned']
y_val = pd.read_csv('data/y_val.csv')['churned']
y_test = pd.read_csv('data/y_test.csv')['churned']
# Train logistic regression
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train, y_train)
# Make predictions
y_pred = log_reg.predict(X_val)
y_pred_proba = log_reg.predict_proba(X_val)[:, 1]
# Evaluate
print("Logistic Regression Results:")
print(f"Accuracy: {accuracy_score(y_val, y_pred):.3f}")
print(f"Precision: {precision_score(y_val, y_pred):.3f}")
print(f"Recall: {recall_score(y_val, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_val, y_pred):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_val, y_pred_proba):.3f}")
What to Expect
On a well-prepped churn dataset, logistic regression typically gives:
- Accuracy: 75-80%
- Precision: 60-70% (of predicted churners, 60-70% actually churn)
- Recall: 50-65% (catches 50-65% of actual churners)
- ROC-AUC: 0.75-0.82
If you’re getting worse than this, your data prep needs work. Go back to Tutorial 2.
Algorithm 2: Decision Tree
The Explainability Champion
Decision trees ask yes/no questions about your features until they reach a prediction. Think of it like a flowchart:
- “Is monthly charge > $70?” → Yes → “Has support tickets > 3?” → Yes → CHURN
This makes them incredibly easy to explain. You can literally draw the decision path for any prediction.
The downside? They overfit like crazy if you’re not careful.
The Code
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Train decision tree with depth limit to prevent overfitting
dt = DecisionTreeClassifier(
max_depth=8, # Limit tree depth
min_samples_split=50, # Need at least 50 samples to split
min_samples_leaf=20, # Each leaf needs 20+ samples
random_state=42
)
dt.fit(X_train, y_train)
# Predictions
y_pred_dt = dt.predict(X_val)
y_pred_proba_dt = dt.predict_proba(X_val)[:, 1]
# Evaluate
print("Decision Tree Results:")
print(f"Accuracy: {accuracy_score(y_val, y_pred_dt):.3f}")
print(f"Precision: {precision_score(y_val, y_pred_dt):.3f}")
print(f"Recall: {recall_score(y_val, y_pred_dt):.3f}")
print(f"F1 Score: {f1_score(y_val, y_pred_dt):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_val, y_pred_proba_dt):.3f}")
# Visualize the tree (optional, but cool)
plt.figure(figsize=(20,10))
tree.plot_tree(dt,
feature_names=X_train.columns,
class_names=['Stayed', 'Churned'],
filled=True,
max_depth=3) # Only show top 3 levels
plt.savefig('decision_tree_viz.png', dpi=150, bbox_inches='tight')
plt.close()
What to Expect
A properly constrained decision tree will give:
- Accuracy: 72-78%
- Precision: 55-65%
- Recall: 55-70%
- ROC-AUC: 0.70-0.78
Usually slightly worse than logistic regression, but way easier to explain.
Algorithm 3: Random Forest
The Reliable Workhorse
Random forests are what you get when you train hundreds of decision trees on random subsets of your data, then average their predictions. It’s ensemble learning, and it works stupidly well.
Why random forests are great:
- Hard to overfit (the randomness prevents it)
- Usually 2-5% more accurate than single trees
- Still somewhat interpretable (feature importance)
- Robust to outliers and missing data
- Works well out-of-the-box with minimal tuning
This is my default algorithm for most classification problems. Not because it’s the best, but because it’s reliably good with minimal effort.
The Code
from sklearn.ensemble import RandomForestClassifier
# Train random forest
rf = RandomForestClassifier(
n_estimators=100, # 100 trees
max_depth=10, # Deeper than single tree
min_samples_split=20,
min_samples_leaf=10,
max_features='sqrt', # Randomness in feature selection
random_state=42,
n_jobs=-1 # Use all CPU cores
)
rf.fit(X_train, y_train)
# Predictions
y_pred_rf = rf.predict(X_val)
y_pred_proba_rf = rf.predict_proba(X_val)[:, 1]
# Evaluate
print("Random Forest Results:")
print(f"Accuracy: {accuracy_score(y_val, y_pred_rf):.3f}")
print(f"Precision: {precision_score(y_val, y_pred_rf):.3f}")
print(f"Recall: {recall_score(y_val, y_pred_rf):.3f}")
print(f"F1 Score: {f1_score(y_val, y_pred_rf):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_val, y_pred_proba_rf):.3f}")
# Feature importance
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("nTop 10 Most Important Features:")
print(feature_importance.head(10))
What to Expect
Random forests typically improve on single decision trees:
- Accuracy: 78-83%
- Precision: 65-75%
- Recall: 60-72%
- ROC-AUC: 0.78-0.85
That 3-5% accuracy bump might not sound like much, but on a 100,000-customer dataset, that’s 3,000-5,000 more correct predictions.
Algorithm 4: Gradient Boosting (XGBoost)
The Competition Winner
Gradient boosting is what wins Kaggle competitions. It’s what to use when you need the absolute best performance and have time to tune it properly.
Instead of training trees in parallel (like random forest), gradient boosting trains them sequentially. Each tree tries to fix the mistakes of the previous trees. It’s brilliant, and it works really well.
Installation First
pip install xgboost
The Code
import xgboost as xgb
# Train XGBoost model
xgb_model = xgb.XGBClassifier(
n_estimators=100,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
random_state=42,
eval_metric='logloss',
use_label_encoder=False
)
xgb_model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False)
# Predictions
y_pred_xgb = xgb_model.predict(X_val)
y_pred_proba_xgb = xgb_model.predict_proba(X_val)[:, 1]
# Evaluate
print("XGBoost Results:")
print(f"Accuracy: {accuracy_score(y_val, y_pred_xgb):.3f}")
print(f"Precision: {precision_score(y_val, y_pred_xgb):.3f}")
print(f"Recall: {recall_score(y_val, y_pred_xgb):.3f}")
print(f"F1 Score: {f1_score(y_val, y_pred_xgb):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_val, y_pred_proba_xgb):.3f}")
What to Expect
XGBoost usually edges out random forest:
- Accuracy: 80-85%
- Precision: 68-77%
- Recall: 62-75%
- ROC-AUC: 0.80-0.87
Those extra few percentage points of accuracy? They come at a cost in complexity and tuning time.
Making the Choice: A Decision Framework
Here’s how to actually choose in real projects:
Start with Logistic Regression if:
- Dataset less than 10K rows
- You need results in 30 minutes
- Stakeholders need to understand predictions
- Problem seems mostly linear
- Time investment: 30 minutes
Use Random Forest if:
- You have a few hours for proper training
- Dataset is 10K – 1M rows
- Performance matters more than perfect interpretability
- You want good results without much tuning
- Time investment: 2-4 hours
Use XGBoost if:
- This is a critical production system
- You have days/weeks for proper tuning
- Every percentage point of accuracy matters
- Dataset is 100K+ rows
- Time investment: 1-3 days
Use Decision Tree if:
- You need to show the decision process visually
- Regulations require complete explainability
- You’re okay with slightly lower performance
- Time investment: 1-2 hours
The Reality Check
Let’s be honest about what we just did; we improved from 77% to 83% accuracy by switching from logistic regression to XGBoost. That’s a 6% improvement. Is that worth it? Depends on your business problem.
If you’re predicting which customers to call for retention:
- 10,000 customers predicted to churn
- 77% accuracy = 7,700 correct predictions
- 83% accuracy = 8,300 correct predictions
- That’s 600 more correct predictions
If each saved customer is worth $1,000/year, that’s $600K in additional revenue. Suddenly that 6% matters a lot.
But if you’re building an internal demo:
- Save yourself the time
- Use logistic regression
- Ship it in an afternoon
Context matters. Always.
The Bottom Line
Algorithm selection matters less than people think, but when it matters, it matters a lot. Start with logistic regression, upgrade to random forest if you have time, and only reach for XGBoost when performance is critical.
Your data prep (Tutorial 2) and feature engineering (coming in Tutorial 6) will have bigger impacts than your algorithm choice 90% of the time.
Now go train some models. The code is all here. Use it.
Tutorial 3 Complete
Next: Tutorial 4 – Regression Models: When You Need to Predict Actual Numbers
Previous: Tutorial 2 – Data Prep: Where ML Projects Live or Die

Comments
One response to “Classification Models: Pick the Right Tool”
[…] Want to compare different classification algorithms? Check out our guide on picking the right classification model. […]