data preparation

Data Prep: Where ML Projects Actually Live or Die

The 80/20 Rule Nobody Talks About

Let’s be real: data preparation machine learning is where you’ll spend 80% of your time, not building models. And if your model performs terribly, it’s probably your data preparation; not the algorithm, not the hyperparameters, your data prep.

This data preparation machine learning tutorial covers the critical stuff that happens before you even think about training a model. Proper data preparation machine learning practices separate successful ML projects from failures. This is Tutorial 2 in our Machine Learning Fundamentals Tutorial Series.

What You’ll Learn

  • Loading data from SQL databases
  • Handling missing values (the right way)
  • Feature scaling and why it matters
  • Creating proper train/test/validation splits
  • Avoiding data leakage (this will bite you)

Part 1: Loading Data from SQL

In the real world, your data lives in databases, not CSV files. Here’s how to load it properly. If you’re working with cloud data warehouses, check out the guide on Snowflake Snowpark for processing data at scale.

import pandas as pd
import sqlalchemy

def load_from_sql(query, connection_string):
    """
    Load data from SQL database.

    Args:
        query: SQL query string
        connection_string: Database connection string

    Returns:
        pandas DataFrame
    """
    engine = sqlalchemy.create_engine(connection_string)
    df = pd.read_sql(query, engine)
    engine.dispose()
    return df

# Example connection strings (replace with your actual credentials)
# PostgreSQL: 'postgresql://user:password@localhost:5432/database'
# MySQL: 'mysql+pymysql://user:password@localhost:3306/database'
# SQLite: 'sqlite:///path/to/database.db'

Part 2: Initial Data Exploration

Before doing anything, understand your data. This takes 10 minutes and saves hours later.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Basic info
print("Dataset Shape:", df.shape)
print("nData Types:")
print(df.dtypes)
print("nMissing Values:")
print(df.isnull().sum())
print("nBasic Statistics:")
print(df.describe())

# Visualize missing data
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.title('Missing Data Heatmap')
plt.show()

Part 3: Handling Missing Values

Three strategies:

  1. Drop – Only if less than 5% of data and randomly distributed
  2. Impute with median – For numerical data (robust to outliers)
  3. Impute with mode – For categorical data

Never use mean imputation; it’s sensitive to outliers and skews your distribution.

from sklearn.impute import SimpleImputer

def handle_missing_values(df, strategy='median'):
    """
    Handle missing values in dataset.

    Args:
        df: pandas DataFrame
        strategy: 'median', 'mean', 'mode', or 'drop'

    Returns:
        Cleaned DataFrame
    """
    df_clean = df.copy()

    if strategy == 'drop':
        # Only drop if very few missing values
        missing_pct = df_clean.isnull().sum().sum() / (df_clean.shape[0] * df_clean.shape[1])
        if missing_pct < 0.05:
            df_clean = df_clean.dropna()
            print(f"Dropped {len(df) - len(df_clean)} rows with missing values")
        else:
            print(f"Too much missing data ({missing_pct:.2%}). Use imputation instead.")
            return df_clean

    elif strategy in ['median', 'mean']:
        # Impute numerical columns
        numerical_cols = df_clean.select_dtypes(include=[np.number]).columns
        imputer = SimpleImputer(strategy=strategy)
        df_clean[numerical_cols] = imputer.fit_transform(df_clean[numerical_cols])
        print(f"Imputed numerical columns with {strategy}")

    return df_clean

# Apply missing value handling
df_clean = handle_missing_values(df, strategy='median')

Part 4: Encoding Categorical Variables

Models need numbers, not text. Two approaches:

  • Label Encoding: For ordinal categories (e.g., low/medium/high)
  • One-Hot Encoding: For nominal categories (e.g., colors, contract types)
# Identify categorical columns
categorical_cols = ['contract_type', 'internet_service', 'tech_support', 
                   'online_security', 'paperless_billing', 'payment_method']

# One-hot encode
df_encoded = pd.get_dummies(df_clean, columns=categorical_cols, drop_first=True)

print(f"Original columns: {df_clean.shape[1]}")
print(f"After encoding: {df_encoded.shape[1]}")
print(f"New dummy columns created: {df_encoded.shape[1] - df_clean.shape[1]}")

Part 5: Train/Test Split – The Right Way

Critical rules:

  1. Split BEFORE any transformations (to avoid data leakage)
  2. Use stratify for imbalanced datasets
  3. Set random_state for reproducibility
  4. Keep 20% for testing (or 30% if you have lots of data)
from sklearn.model_selection import train_test_split

# Separate features and target
X = df_encoded.drop(['customer_id', 'churned'], axis=1)
y = df_encoded['churned']

# Create train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Maintains class distribution in both sets
)

print(f"Training set size: {len(X_train)} ({len(X_train)/len(X):.1%})")
print(f"Test set size: {len(X_test)} ({len(X_test)/len(X):.1%})")

print(f"nChurn rate in training set: {y_train.mean():.1%}")
print(f"Churn rate in test set: {y_test.mean():.1%}")
print("nNotice: stratify maintains the same churn rate in both sets!")

Part 6: Feature Scaling

Why scale?

  • Algorithms like logistic regression and SVM are sensitive to feature magnitude
  • Features with larger values can dominate the learning
  • Gradient descent converges faster with scaled features

CRITICAL: Fit scaler on training data only, then transform both train and test!

from sklearn.preprocessing import StandardScaler

def scale_features(X_train, X_test):
    """
    Scale features using StandardScaler.

    CRITICAL: Fit on training data only!

    Args:
        X_train: Training features
        X_test: Test features

    Returns:
        X_train_scaled, X_test_scaled, scaler
    """
    scaler = StandardScaler()

    # Fit on training data only
    X_train_scaled = scaler.fit_transform(X_train)

    # Transform test data using training statistics
    X_test_scaled = scaler.transform(X_test)

    # Convert back to DataFrame to keep column names
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

    return X_train_scaled, X_test_scaled, scaler

# Apply scaling
X_train_scaled, X_test_scaled, scaler = scale_features(X_train, X_test)

print("Features scaled successfully!")
print("nNotice: All features now have mean ~0 and std ~1")

Part 7: Data Leakage – What NOT to Do

Data leakage happens when information from the test set influences training. This gives you falsely optimistic results that fall apart in production.

Common mistakes:

  1. Scaling before splitting
  2. Filling missing values using statistics from the entire dataset
  3. Feature selection using the entire dataset
  4. Using future information to predict the past
# WRONG WAY (causes data leakage)
print("WRONG: Scaling before split")
X_scaled = scaler.fit_transform(X)  # Leakage!
X_train, X_test = train_test_split(X_scaled)

# Problem: Test set statistics influenced the scaler

# RIGHT WAY
print("RIGHT: Split first, then scale")
X_train, X_test = train_test_split(X)
scaler.fit(X_train)  # Fit on training only
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Result: Test set never influences training

Part 8: Save Processed Data

Save your cleaned data so you don’t have to reprocess it every time.

import joblib

# Save processed data
X_train_scaled.to_csv('data/churn_X_train.csv', index=False)
X_test_scaled.to_csv('data/churn_X_test.csv', index=False)
y_train.to_csv('data/churn_y_train.csv', index=False)
y_test.to_csv('data/churn_y_test.csv', index=False)

# Save the scaler for later use in production
joblib.dump(scaler, 'data/scaler.pkl')
print("Scaler saved as scaler.pkl")
print("nYou'll need this scaler to transform new data in production!")

Data Preparation Machine Learning Checklist

Before training any model, make sure you’ve done:

  1. Explored the data
    • Check shapes, types, distributions
    • Identify missing values
    • Look for outliers
  2. Handled missing values
    • Use median for numerical (robust to outliers)
    • Use mode for categorical
    • Only drop if less than 5% missing
  3. Created useful features
    • Domain knowledge is your friend
    • Simple features often work best
  4. Encoded categorical variables
    • One-hot encoding for nominal categories
    • Label encoding for ordinal categories
  5. Split the data FIRST
    • Before any transformations
    • Use stratify for imbalanced data
    • Set random_state for reproducibility
  6. Scaled features properly
    • Fit scaler on training data only
    • Transform both train and test
    • Save the scaler for production
  7. Avoided data leakage
    • No peeking at test data
    • No using future information

The Bottom Line

If your model performs poorly:

  1. Check your data preparation machine learning process first
  2. Then check for data leakage
  3. Only then worry about the model

Good data preparation machine learning beats fancy algorithms every time.

In Tutorial 3, we’ll use this preprocessed data to train and compare four different classification algorithms: Logistic Regression, Decision Trees, Random Forest, and XGBoost. You’ll see how proper data preparation machine learning makes the modeling part straightforward.


Tutorial 2 Complete
Next: Tutorial 3 – Classification Models: Pick the Right Tool
Previous: Tutorial 1 – ML Fundamentals: Stop Overthinking, Start Building