Getting Logistic Regression Right with Scikit-Learn

So you want to do some logistic regression? Cool! It’s like linear regression’s slightly more complicated cousin who went to business school. Instead of predicting continuous values, logistic regression predicts probabilities and categories. Perfect for questions like “Will this email be spam?” or “Is this customer going to buy something?” If you’re working with continuous outcomes instead, check out my guide on multiple regression.

What’s the Deal with Logistic Regression?

Despite the name, logistic regression is actually a classification algorithm. Yeah, I know, confusing. Blame the mathematicians.

The basic idea is simple: instead of predicting a number directly, it predicts the probability that something belongs to a particular category. It does this using a fancy S-shaped curve called the sigmoid function that squashes any input into a probability between 0 and 1.

Getting Your Hands Dirty with Scikit-Learn

Let’s build a logistic regression model to predict whether a customer will buy a product based on their age and estimated salary. First, let’s import our tools:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Now let’s create some synthetic data for demonstration:

# Create synthetic data
np.random.seed(42)
n_samples = 400

data = pd.DataFrame({
    'age': np.random.randint(18, 70, n_samples),
    'salary': np.random.randint(15000, 150000, n_samples)
})

# Create target: people with higher age + salary are more likely to buy
data['will_buy'] = ((data['age'] * 0.02 + 
                     data['salary'] * 0.00001 + 
                     np.random.randn(n_samples) * 0.3) > 1.5).astype(int)

Prepare Your Data

Split into features (X) and target (y):

X = data[['age', 'salary']]
y = data['will_buy']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

Here’s something important: logistic regression works better when your features are on similar scales. Let’s standardize them:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Train the Model

Now for the fun part:

model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

That’s it! You’ve trained a logistic regression model. See? Not so scary.

Make Predictions

You can get both the predicted classes and the probabilities:

# Get predicted classes (0 or 1)
y_pred = model.predict(X_test_scaled)

# Get predicted probabilities
y_pred_proba = model.predict_proba(X_test_scaled)

print("First 5 predictions:")
print(y_pred[:5])

print("nFirst 5 probabilities (not buying, buying):")
print(y_pred_proba[:5])

Evaluate Your Model

Let’s see how well it did:

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")

# Confusion matrix
print("nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Detailed classification report
print("nClassification Report:")
print(classification_report(y_test, y_pred))

Understanding the Results

The confusion matrix shows you four things:

True Negatives: Correctly predicted “won’t buy”
False Positives: Predicted “will buy” but they didn’t (Type I error)
False Negatives: Predicted “won’t buy” but they did (Type II error)
True Positives: Correctly predicted “will buy”

Common Gotchas

Class Imbalance: If 95% of your data is one class, your model might just predict that class all the time and still get 95% accuracy. Use metrics like F1-score or precision/recall instead.

Feature Scaling: Always standardize or normalize your features for logistic regression. It makes a bigger difference than you’d think.

Regularization: If your model is overfitting, try adjusting the C parameter (lower C = more regularization).

The Bottom Line

Logistic regression is a solid, interpretable algorithm for binary classification problems. It’s fast to train, easy to implement, and works well as a baseline before you try more complex methods. For more complex classification tasks, you might want to explore ensemble methods like Random Forest.

The key is understanding what your coefficients mean (they represent log-odds, but that’s a topic for another day), checking your assumptions, and using the right evaluation metrics for your problem.

Now go forth and classify!