Model Evaluation in Machine Learning: Beyond Accuracy

First of all, let me apologize for this being such a long post. But, if you can truly grasp the material presented here, it means you can distinguish a truly useful model from one that needs more work. And isn’t that really the point of what we do?

Accuracy is a trap

You could deploy 95% accurate models that cost millions in missed opportunities. The model looked great in testing, it just failed spectacularly in production.

The truth is, most of the time accuracy tells you almost nothing about whether your machine learning model will work. A model predicting fraud with 99% accuracy sounds impressive until you realize 99% of transactions aren’t fraudulent anyway. You could build a model that says “not fraud” every single time and hit that metric.

In this tutorial, we’re evaluating a customer churn model using metrics that actually matter. You’ll learn which metrics to trust, which ones lie (and when), and how to catch problems before they hit production.

If you haven’t built the foundation yet, start with Tutorial 1: ML Fundamentals and work through the series.

What We’re Building

We’re evaluating a binary classification model that predicts customer churn. We’ll use:

Confusion Matrix – What’s actually happening
Precision and Recall – The tradeoffs that matter
F1 Score – When you need balance
ROC-AUC – Model discrimination ability
Precision-Recall Curve – For imbalanced data
Business Metrics – What actually drives decisions

By the end, you should know exactly which metrics matter for your specific problem.

The Dataset and Problem

Let’s say we’re predicting customer churn. Here’s what we’re working with:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load data
df = pd.read_csv('customer_churn.csv')

# Features and target
X = df.drop(['customer_id', 'churned'], axis=1)
y = df['churned']  # 1 = churned, 0 = retained

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train a baseline model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

print(f"Model trained on {len(X_train)} customers")
print(f"Testing on {len(X_test)} customers")

Now let’s see how good this model actually is.

Metric 1: Accuracy (And Why It Lies)

Let’s start with the metric everyone knows:

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

# Check the baseline
baseline_accuracy = (y_test == 0).sum() / len(y_test)
print(f"Baseline (predict no churn): {baseline_accuracy:.3f}")

Output:

Accuracy: 0.847
Baseline (predict no churn): 0.830

Looks good, right? The model is 84.7% accurate. But here’s the problem:

The class distribution is imbalanced. Only 17% of customers actually churn. A model that predicts “no churn” for everyone gets 83% accuracy without learning anything. This model is barely better than random guessing, and only measuring accuracy hides the real issue.

Metric 2: Confusion Matrix

The confusion matrix shows what the model actually predicts versus what’s true:

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.xticks([0.5, 1.5], ['Retained', 'Churned'])
plt.yticks([0.5, 1.5], ['Retained', 'Churned'])
plt.tight_layout()
plt.savefig('confusion_matrix.png')
plt.show()

print("Confusion Matrix:")
print(cm)

Output:

Confusion Matrix:
[1580   70]
[ 235  115]

Broken down, the results mean this:

True Negatives (1580): Correctly predicted retained customers
False Positives (70): Predicted churn, but customer stayed (false alarm)
False Negatives (235): Predicted retained, but customer churned (missed churn)
True Positives (115): Correctly predicted churn

The problem should be obvious: we’re missing 235 churning customers. That’s 67% of all churners, and we failed to catch them with this model.

Metric 3: Precision and Recall

Now we get to metrics that actually matter for most business decisions.

from sklearn.metrics import precision_score, recall_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")

# Calculate what this means in business terms
total_predicted_churn = cm[0, 1] + cm[1, 1]
actual_churners_caught = cm[1, 1]
total_actual_churners = cm[1, 0] + cm[1, 1]

print(f"nOf {total_predicted_churn} customers we predicted would churn:")
print(f"  {actual_churners_caught} actually churned (precision)")
print(f"nOf {total_actual_churners} customers who actually churned:")
print(f"  {actual_churners_caught} we caught (recall)")

Output:

Precision: 0.622
Recall: 0.329

Of 185 customers we predicted would churn:
  115 actually churned (precision)

Of 350 customers who actually churned:
  115 we caught (recall)

Precision: When we predict churn, we’re right 62% of the time. 38% are false alarms.

Recall: We only catch 33% of customers who actually churn. We miss 67%.

This is the real tradeoff in machine learning evaluation; you can improve one at the expense of the other.

So, Which Metric Matters?

It depends on your business. High precision matters when false positives are expensive (retention offers cost money), you have limited resources to intervene, or customer annoyance from unnecessary outreach is a concern

High recall matters when missing churners is very expensive (indicated by high customer lifetime value), you can afford to contact more customers, and/or the cost of intervention is low compared to losing customers

For most churn problems, recall matters more. Missing a churner costs you their entire lifetime value.

Metric 4: F1 Score

F1 score is the “harmonic mean” of precision and recall:

from sklearn.metrics import f1_score

f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.3f}")

# Compare to individual metrics
print(f"nPrecision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1: {f1:.3f}")
print(f"Average of P and R: {(precision + recall) / 2:.3f}")

Output:

F1 Score: 0.430

Precision: 0.622
Recall: 0.329
F1: 0.430
Average of P and R: 0.476

The F1 score penalizes models with imbalanced precision and recall. Notice it’s lower than the simple average because recall is so bad.

You should use F1 when you care about both precision and recall equally, you need a single metric for model comparison, or you’re doing hyperparameter tuning and need one number to optimize.

You should NOT use F1 when precision and recall have different business costs, you need to understand the actual tradeoffs, or you’re presenting to stakeholders who need interpretable metrics (it can be a bit hard to explain to non-techies).

Metric 5: ROC Curve and AUC (Area under curve)

The ROC (Receiver Operating Characteristic) curve shows how well your model separates classes at different thresholds:

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, label=f'Model (AUC = {auc:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curve.png')
plt.show()

print(f"AUC: {auc:.3f}")

Output:

AUC: 0.742

What AUC means:

0.5 = Random guessing (no better than a coin flip)
0.7-0.8 = Acceptable
0.8-0.9 = Good
0.9+ = Excellent

Our model has AUC of 0.742, which is acceptable but not great. The model can somewhat discriminate between churners and non-churners, but there’s room for improvement.

Metric 6: Precision-Recall Curve (For Imbalanced Data)

For imbalanced datasets like churn (where those who churned are usually far outnumbered by those who didn’t), the precision-recall curve is often more informative than ROC:

from sklearn.metrics import precision_recall_curve, average_precision_score

# Calculate precision-recall curve
precision_vals, recall_vals, thresholds = precision_recall_curve(y_test, y_pred_proba)
avg_precision = average_precision_score(y_test, y_pred_proba)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(recall_vals, precision_vals, label=f'Model (AP = {avg_precision:.3f})', linewidth=2)

# Baseline (proportion of positive class)
baseline = y_test.sum() / len(y_test)
plt.axhline(y=baseline, color='k', linestyle='--', label=f'Baseline (AP = {baseline:.3f})', linewidth=1)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('precision_recall_curve.png')
plt.show()

print(f"Average Precision: {avg_precision:.3f}")
print(f"Baseline: {baseline:.3f}")

Output:

Average Precision: 0.456
Baseline: 0.175

This curve shows the precision-recall tradeoff at different thresholds. The average precision (0.456) is significantly better than baseline (0.175), which confirms the model adds value.

The Business Metrics That Actually Matter

Now let’s translate model performance into business impact:

# Business assumptions
avg_customer_ltv = 1200  # Average customer lifetime value
retention_offer_cost = 50  # Cost to send retention offer
retention_success_rate = 0.30  # 30% of contacted churners stay

# Current model performance at threshold 0.345
threshold = 0.345
y_pred_adjusted = (y_pred_proba >= threshold).astype(int)

# Calculate confusion matrix with new threshold
cm_adjusted = confusion_matrix(y_test, y_pred_adjusted)
tn, fp, fn, tp = cm_adjusted.ravel()

# Business calculations
print("Business Impact Analysis:n")

# Cost of false positives (unnecessary offers)
fp_cost = fp * retention_offer_cost
print(f"False Positives: {fp}")
print(f"  Cost of unnecessary offers: ${fp_cost:,}n")

# Cost of false negatives (missed churners)
fn_cost = fn * avg_customer_ltv
print(f"False Negatives: {fn}")
print(f"  Lost customer value: ${fn_cost:,}n")

# Value of true positives (saved customers)
saved_customers = tp * retention_success_rate
saved_value = saved_customers * avg_customer_ltv
intervention_cost = tp * retention_offer_cost
net_value_tp = saved_value - intervention_cost

print(f"True Positives: {tp}")
print(f"  Customers contacted: {tp}")
print(f"  Expected saves (30% success): {saved_customers:.0f}")
print(f"  Value saved: ${saved_value:,.0f}")
print(f"  Intervention cost: ${intervention_cost:,}")
print(f"  Net value: ${net_value_tp:,.0f}n")

# Total impact
total_cost = fp_cost + fn_cost
total_benefit = net_value_tp
net_impact = total_benefit - total_cost

print(f"Total Model Impact:")
print(f"  Costs: ${total_cost:,}")
print(f"  Benefits: ${total_benefit:,.0f}")
print(f"  Net Impact: ${net_impact:,.0f}")

Output:

Business Impact Analysis:

False Positives: 370
  Cost of unnecessary offers: $18,500

False Negatives: 140
  Lost customer value: $168,000

True Positives: 210
  Customers contacted: 210
  Expected saves (30% success): 63
  Value saved: $75,600
  Intervention cost: $10,500
  Net value: $65,100

Total Model Impact:
  Costs: $186,500
  Benefits: $65,100
  Net Impact: -$121,400

This is the reality check. Turn stats into dollars and everything becomes clearer. Even though our model metrics look okay, we’re losing money because we’re missing too many churners (false negatives).

This tells us we need to:

Lower the threshold even more to catch more churners
Improve the model with better features
Reconsider the business case

What We’ve Learned About Model Evaluation

Accuracy is almost never the right metric
Always look deeper. This is almost never the metric you’re looking for; move along.

Understand the confusion matrix
False positives and false negatives have different business costs. Know which one matters more in your particular situation.

Precision-recall curves are better for imbalanced data
ROC curves can be overly optimistic when classes are imbalanced, as churn classes usually are.

Always calculate business metrics
Model metrics don’t matter if the business case doesn’t work. For more on this, see how we engineer features to improve model performance.

The Bottom Line

Model evaluation in machine learning is where theory meets reality. Accuracy doesn’t matter; what matters is understanding the confusion matrix, choosing the right metrics for your business problem, and translating model performance into business impact.

For churn prediction, recall usually matters more than precision because missing churners costs you their entire lifetime value. Learn to calculate the actual business impact.

A model that looks good on paper but loses money in production is a failed model.

ML Model Evaluation: Beyond Accuracy

Accuracy is a trap

What We’re Building

The Dataset and Problem

Metric 1: Accuracy (And Why It Lies)

Metric 2: Confusion Matrix

Metric 3: Precision and Recall

So, Which Metric Matters?

Metric 4: F1 Score

Metric 5: ROC Curve and AUC (Area under curve)

Metric 6: Precision-Recall Curve (For Imbalanced Data)

The Business Metrics That Actually Matter

What We’ve Learned About Model Evaluation

The Bottom Line

Comments

Leave a Reply Cancel reply