So you want to do some logistic regression? Cool! It’s like linear regression’s slightly more complicated cousin who went to business school. Instead of predicting continuous values, logistic regression predicts probabilities and categories. Perfect for questions like “Will this email be spam?” or “Is this customer going to buy something?”
Let’s dive into how to do this properly with scikit-learn, because there are definitely some gotchas that can trip you up.
The Basic Setup
First things first – let’s get our imports sorted:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
Loading and Preparing Your Data
Here’s where people often mess up. Let’s say you’ve got some data about whether customers will churn:
# Load your data (this could be from CSV, database, whatever)
df = pd.read_csv('customer_data.csv')
# Separate features and target
X = df[['age', 'income', 'months_subscribed', 'support_calls']]
y = df['churned'] # This should be 0s and 1s, or True/False
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
Pro tip: Always use stratify=y
when splitting classification data. This ensures your train and test sets have roughly the same proportion of each class. Trust me on this one.
The Scaling Situation
Here’s something that catches a lot of people: logistic regression is sensitive to the scale of your features. If one feature ranges from 0-1 and another from 0-100000, your model is going to have a bad time.
# Scale your features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Note: only transform, don't fit!
Important: Only fit the scaler on training data, then transform both train and test sets. Fitting on the full dataset is a classic data leakage mistake.
Actually Fitting the Model
Now for the fun part:
# Create and train the model
model = LogisticRegression(
random_state=42,
max_iter=1000, # Increase if you get convergence warnings
solver='lbfgs' # Good default choice
)
model.fit(X_train_scaled, y_train)
A few things to note:
max_iter=1000
prevents those annoying convergence warningssolver='lbfgs'
works well for most problems. For very large datasets, try ‘saga’- Set
random_state
for reproducible results
Making Predictions and Checking Performance
# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)
# Check accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
# More detailed metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Understanding What Your Model Learned
One of the nice things about logistic regression is interpretability:
# Get feature importance (coefficients)
feature_names = X.columns
coefficients = model.coef_[0]
# Create a DataFrame for easier viewing
coef_df = pd.DataFrame({
'feature': feature_names,
'coefficient': coefficients
}).sort_values('coefficient', key=abs, ascending=False)
print(coef_df)
Positive coefficients increase the probability of the positive class, negative coefficients decrease it. The magnitude tells you how strong the effect is.
Common Pitfalls to Avoid
- Forgetting to scale: Your features should be on similar scales
- Data leakage: Don’t fit your scaler on the full dataset
- Ignoring class imbalance: If you have way more of one class, consider using
class_weight='balanced'
- Not checking convergence: Increase
max_iter
if you get warnings
A Complete Example
Here’s everything put together:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Create sample data
X, y = make_classification(n_samples=1000, n_features=4, random_state=42)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)
# Make predictions and evaluate
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
And there you have it! You’re now equipped to do logistic regression without the common headaches. Remember: scale your features, split your data properly, and don’t overthink it. Logistic regression is a robust algorithm that works well out of the box for most problems.