machine learning

ML Fundamentals: Stop Overthinking, Start Building

What Machine Learning Actually Is

Let’s be real: whether you’re trying to predict which customers will cancel or which employees will quit, you’re solving the same problem. It’s called churn prediction, and it’s one of the most valuable ML use cases in business.

This complete tutorial series will teach you practical machine learning for beginners using churn prediction as our core example. But here’s the thing—the exact same code, same models, same techniques work for:

  • Customer churn: Predicting subscription cancellations, customer defection
  • Employee attrition: Predicting turnover, identifying flight risks (HR use case)

Same algorithms. Different features. Massive business impact either way.

I’ll call out both use cases throughout so you can see how to adapt the code for your specific problem. Sound good? Let’s build something.

Supervised vs Unsupervised Learning

Supervised Learning (what we’re doing in this series):

  • You have labeled data: “This customer churned” or “This employee stayed”
  • Algorithm learns from examples
  • Makes predictions on new data
  • Most business problems are supervised learning

Unsupervised Learning (not covering today):

  • No labels: “Here’s data, find patterns”
  • Customer segmentation, anomaly detection
  • Harder to validate, less common in business

We’re focusing on supervised learning because that’s where the money is.

What You’ll Build Today

A spam classifier. Yes, I know—everyone builds spam classifiers in ML tutorials. But here’s why this beginner-friendly approach works:

  1. Simple to understand: Email is spam or not spam (binary classification)
  2. Quick to build: Working model in ~50 lines of code
  3. Teaches core concepts: Training, testing, evaluation
  4. Same logic as churn: Binary classification (churn/don’t churn, quit/stay)

Once you understand this, you can build churn models using the exact same approach.

The Data

We’ll use a simple SMS spam dataset. It has messages labeled as spam or ham (not spam).

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Sample data (in reality, load from CSV or database)
data = {
    'message': [
        'Free entry in 2 a wkly comp to win FA Cup final tickets',
        'Nah I dont think he goes to usf, he lives around here though',
        'WINNER!! As a valued network customer you have been selected',
        'Hey so Im not gonna be able to make it tonight',
        'Congratulations! You have won a $1000 Walmart gift card',
        'Can you pick up some milk on your way home?'
    ],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}

df = pd.DataFrame(data)

In a real project, you’d have thousands of rows. For the purposes of this learning, 6 is enough to see how it works.

Step 1: Split Your Data

The most important rule in ML: Never test on data you trained on.

Think about it: if you memorize the answers to a practice test, how do you know if you actually learned the material? You don’t. You need a separate test.

# Split 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], 
    df['label'], 
    test_size=0.2, 
    random_state=42  # Makes results reproducible
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

Why 80/20? Convention. Some people use 70/30 or 90/10. As long as your test set is big enough to be meaningful (at least a few hundred samples for most problems), you’re fine.

random_state=42: This makes your split reproducible. Without it, you get different splits each time. 42 is a joke (Hitchhiker’s Guide), but any number works.

Step 2: Convert Text to Numbers

ML algorithms can’t read. They need numbers.

We use CountVectorizer to convert text into a matrix of word counts:

# Create vectorizer
vectorizer = CountVectorizer()

# Fit on training data and transform
X_train_counts = vectorizer.fit_transform(X_train)

# Transform test data (don't fit again!)
X_test_counts = vectorizer.transform(X_test)

print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Training matrix shape: {X_train_counts.shape}")

What just happened?

  • CountVectorizer found every unique word in the training data
  • Created a vocabulary (dictionary of words)
  • Converted each message to a vector of word counts

Critical mistake to avoid: Don’t fit the vectorizer on test data! Only fit on training data, then transform test data. Otherwise you’re “leaking” information from your test set.

Step 3: Train a Model

Now the fun part. We’ll use Naive Bayes, a probability algorithm perfect for text classification.

# Create and train model
model = MultinomialNB()
model.fit(X_train_counts, y_train)

print("Model trained!")

That’s it. Three lines. The model learned patterns from your training data.

What Naive Bayes does:

  • Calculates word frequency in spam vs ham messages
  • Uses Bayes’ theorem to compute probabilities
  • Makes predictions based on which words appear

Why Naive Bayes?

  • Fast to train
  • Works well with text
  • Good baseline for classification
  • Easy to interpret

Want to compare different classification algorithms? Check out our guide on picking the right classification model.

Step 4: Make Predictions

Let’s see how well it works:

# Predict on test set
y_pred = model.predict(X_test_counts)

# Look at some predictions
for message, actual, predicted in zip(X_test[:5], y_test[:5], y_pred[:5]):
    print(f"nMessage: {message}")
    print(f"Actual: {actual}, Predicted: {predicted}")

Step 5: Evaluate Performance

How good is your model? Let’s check:

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"nAccuracy: {accuracy:.2%}")

# Detailed metrics
print("nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
print("nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

What these metrics mean:

Accuracy: Percentage of correct predictions

  • 95% accuracy = 95 correct out of 100
  • Sounds great, but can be misleading (we’ll cover why in Tutorial 5)

Precision: Of all messages you predicted as spam, how many were actually spam?

  • High precision = few false alarms

Recall: Of all actual spam messages, how many did you catch?

  • High recall = you’re catching most spam

F1-Score: Balance between precision and recall

  • Good overall metric when classes are balanced

Complete Code

Here’s everything together:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load data
data = {
    'message': [
        'Free entry in 2 a wkly comp to win FA Cup final tickets',
        'Nah I dont think he goes to usf, he lives around here though',
        'WINNER!! As a valued network customer you have been selected',
        'Hey so Im not gonna be able to make it tonight',
        'Congratulations! You have won a $1000 Walmart gift card',
        'Can you pick up some milk on your way home?',
        'Call now to claim your prize',
        'Are we still meeting for lunch tomorrow?',
        'Click here for your free iPhone',
        'Did you finish that report?'
    ] * 100,  # Repeat to have more samples
    'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham'] * 100
}

df = pd.DataFrame(data)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], 
    df['label'], 
    test_size=0.2, 
    random_state=42
)

# Vectorize text
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Train model
model = MultinomialNB()
model.fit(X_train_counts, y_train)

# Make predictions
y_pred = model.predict(X_test_counts)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("nClassification Report:")
print(classification_report(y_test, y_pred))
print("nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

How This Applies to Churn Prediction

Everything you just learned in this machine learning tutorial transfers directly to real business problems:

Customer Churn:

# Instead of text features, you'd have numerical/categorical features
X_train, X_test, y_train, y_test = train_test_split(
    df[['tenure_months', 'monthly_spend', 'support_tickets', 'login_frequency']], 
    df['churned'],  # 0 = stayed, 1 = left
    test_size=0.2, 
    random_state=42
)

# Different model (Logistic Regression works better for numerical features)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

# Same evaluation approach
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

Employee Attrition:

# Same structure, different features
X_train, X_test, y_train, y_test = train_test_split(
    df[['tenure_months', 'salary_vs_market', 'years_since_promotion', 'performance_score']], 
    df['left_company'],  # 0 = stayed, 1 = quit
    test_size=0.2, 
    random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Same workflow. Different data. That’s the beauty of ML—once you learn the pattern, you can apply it everywhere.

What Could Go Wrong

Problem: “My model has 99% accuracy but doesn’t work in production”
Cause: Imbalanced classes. If 99% of customers don’t churn, predicting “no churn” for everyone gives 99% accuracy but is useless.
Fix: Use precision, recall, and F1-score instead of accuracy (Tutorial 5).

Problem: “Model performs great on training data, terrible on test data”
Cause: Overfitting. Model memorized training data instead of learning patterns.
Fix: More data, simpler model, or regularization (Tutorial 7).

Problem: “Training takes forever”
Cause: Too many features or too much data for your algorithm.
Fix: Use a faster algorithm, reduce features, or sample your data.

Next Steps

You just built your first ML model. Here’s what’s coming in the rest of the series:

  • Tutorial 2: Data preparation—the 80% of ML work nobody likes
  • Tutorial 3: Classification models—when to use what
  • Tutorial 4: Regression models—predicting numbers instead of categories
  • Tutorial 5: Model evaluation—metrics that actually matter
  • Tutorial 6: Feature engineering—turning raw data into predictive features
  • Tutorial 7: Hyperparameter tuning—making models actually work
  • Tutorial 8: Production deployment—getting models into Streamlit apps

The Bottom Line

Machine learning isn’t magic. It’s pattern recognition with training wheels.

You don’t need a PhD in statistics. You don’t need to understand backpropagation. You need to:

  1. Split your data (train/test)
  2. Convert it to the right format
  3. Train a model
  4. Evaluate performance
  5. Iterate

Everything else is refinement.

In the next tutorial, we’ll tackle the part that actually takes 80% of your time: cleaning and preparing your data. Because no matter how good your model is, garbage in = garbage out.


Comments

2 responses to “ML Fundamentals: Stop Overthinking, Start Building”

  1. […] you need the foundation, start with Tutorial 1: ML Fundamentals and work through the series. Make sure you’ve covered Tutorial 6: Hyperparameter Tuning […]

  2. […] you need the foundation first, start with Tutorial 1: ML Fundamentals and work through the series. Make sure […]