feature engineering

Feature Engineering in Machine Learning: The Complete Tutorial

Feature engineering in machine learning is where you actually make money. Not the algorithm. Not the hyperparameters. The features you engineer determine whether your ML model succeeds or fails in production.

Simple logistic regression with good features beats gradient boosting with raw data, every single time. I’ve watched data scientists spend three weeks optimizing XGBoost when twenty minutes of feature engineering would have doubled their accuracy.

The problem? Feature engineering is messy, domain-specific, and requires actual thinking. There’s no sklearn.FeatureEngineer() you can import. You have to understand your data and your problem.

In this tutorial, we’re building a customer lifetime value predictor. We’ll start with raw transaction data and engineer features that actually predict customer behavior. You’ll see exactly what works and what doesn’t.

What We’re Building

We’re predicting customer lifetime value (CLV) using transaction history. We’ll engineer features in four categories:

Recency Features – How recently did they buy?
Frequency Features – How often do they buy?
Monetary Features – How much do they spend?
Behavioral Features – What patterns emerge?

By the end, you’ll have a framework for feature engineering that works across different ML problems. If you’re new to machine learning, start with Tutorial 1: ML Fundamentals and Tutorial 2: Data Preparation first.

The Raw Data

Let’s start with what we actually have, which is raw customer transaction data:

-- Sample transactions table
SELECT 
    customer_id,
    transaction_date,
    transaction_amount,
    product_category,
    payment_method
FROM transactions
WHERE transaction_date >= '2024-01-01'
LIMIT 5;

This is typical e-commerce data; each row is one transaction. To predict CLV, we need to aggregate this into customer-level features.

Feature Category 1: Recency Features

Recency matters in machine learning. A customer who bought yesterday is more valuable than one who bought six months ago. Let’s quantify that.

Days Since Last Purchase

import pandas as pd
from datetime import datetime

# Calculate days since last purchase
def create_recency_features(df, reference_date=None):
    if reference_date is None:
        reference_date = df['transaction_date'].max()

    customer_features = df.groupby('customer_id').agg({
        'transaction_date': 'max'
    }).reset_index()

    customer_features['days_since_last_purchase'] = (
        reference_date - customer_features['transaction_date']
    ).dt.days

    return customer_features

# Load your data
df = pd.read_sql("SELECT * FROM transactions", connection)
df['transaction_date'] = pd.to_datetime(df['transaction_date'])

recency_features = create_recency_features(df)

Purchase Acceleration

Is the customer buying more frequently over time? Let’s measure that:

def calculate_purchase_acceleration(df):
    customer_features = df.groupby('customer_id').apply(
        lambda x: calculate_acceleration(x)
    ).reset_index()

    return customer_features

def calculate_acceleration(customer_df):
    if len(customer_df)  0 else 0

    return pd.Series({'purchase_acceleration': acceleration})

A customer buying every 30 days is good. A customer who went from 60 days to 30 days between purchases is even better, since it means they’re accelerating.

Feature Category 2: Frequency Features

How often someone buys tells you a lot about their value.

Total Purchase Count

Simple but effective:

def create_frequency_features(df):
    customer_features = df.groupby('customer_id').agg({
        'transaction_date': 'count'
    }).reset_index()

    customer_features.columns = ['customer_id', 'total_purchases']

    return customer_features

Purchases Per Month (Normalized Frequency)

Raw counts are misleading if customers have been active for different lengths of time:

def create_normalized_frequency(df):
    customer_df = df.groupby('customer_id').agg({
        'transaction_date': ['count', 'min', 'max']
    }).reset_index()

    customer_df.columns = ['customer_id', 'total_purchases', 'first_purchase', 'last_purchase']

    # Calculate customer lifespan in months
    customer_df['customer_age_months'] = (
        (customer_df['last_purchase'] - customer_df['first_purchase']).dt.days / 30
    )

    # Avoid division by zero
    customer_df['customer_age_months'] = customer_df['customer_age_months'].clip(lower=1)

    # Purchases per month
    customer_df['purchases_per_month'] = (
        customer_df['total_purchases'] / customer_df['customer_age_months']
    )

    return customer_df

Feature Category 3: Monetary Features

Money talks. Let’s measure it properly.

Total Spend and Average Order Value

def create_monetary_features(df):
    customer_features = df.groupby('customer_id').agg({
        'transaction_amount': ['sum', 'mean', 'median', 'std']
    }).reset_index()

    customer_features.columns = [
        'customer_id', 
        'total_spend', 
        'avg_order_value', 
        'median_order_value',
        'order_value_std'
    ]

    return customer_features

Spending Trend

Is the customer spending more over time?

def calculate_spending_trend(df):
    customer_features = df.groupby('customer_id').apply(
        lambda x: calculate_trend(x)
    ).reset_index()

    return customer_features

def calculate_trend(customer_df):
    if len(customer_df)  0 else 0

    return pd.Series({'spending_trend': trend})

Feature Category 4: Behavioral Features

These capture patterns that aren’t obvious from simple aggregations.

Product Category Diversity

Does the customer buy from multiple categories or just one?

def create_diversity_features(df):
    customer_features = df.groupby('customer_id').agg({
        'product_category': lambda x: x.nunique()
    }).reset_index()

    customer_features.columns = ['customer_id', 'category_diversity']

    return customer_features

Payment Method Consistency

Customers who use the same payment method are more engaged:

def create_payment_features(df):
    customer_features = df.groupby('customer_id').agg({
        'payment_method': lambda x: x.value_counts().iloc[0] / len(x) if len(x) > 0 else 0
    }).reset_index()

    customer_features.columns = ['customer_id', 'payment_consistency']

    return customer_features

Purchase Day Patterns

Do they buy on weekends? Weekdays?

def create_temporal_features(df):
    df['day_of_week'] = df['transaction_date'].dt.dayofweek
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

    customer_features = df.groupby('customer_id').agg({
        'is_weekend': 'mean'
    }).reset_index()

    customer_features.columns = ['customer_id', 'weekend_purchase_ratio']

    return customer_features

Combining All Engineered Features

Now we can finally merge all our engineered features into a single dataset:

def create_all_features(df):
    # Create each feature set
    recency = create_recency_features(df)
    recency_accel = calculate_purchase_acceleration(df)
    frequency = create_frequency_features(df)
    frequency_norm = create_normalized_frequency(df)
    monetary = create_monetary_features(df)
    spending = calculate_spending_trend(df)
    diversity = create_diversity_features(df)
    payment = create_payment_features(df)
    temporal = create_temporal_features(df)

    # Merge everything
    features = recency
    features = features.merge(recency_accel, on='customer_id', how='left')
    features = features.merge(frequency, on='customer_id', how='left')
    features = features.merge(frequency_norm, on='customer_id', how='left')
    features = features.merge(monetary, on='customer_id', how='left')
    features = features.merge(spending, on='customer_id', how='left')
    features = features.merge(diversity, on='customer_id', how='left')
    features = features.merge(payment, on='customer_id', how='left')
    features = features.merge(temporal, on='customer_id', how='left')

    # Fill NAs
    features = features.fillna(0)

    return features

# Create the feature set
customer_features = create_all_features(df)
print(customer_features.head())

Training the Machine Learning Model

Now let’s use these engineered features to train a machine learning model that predicts CLV:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Assume we have actual CLV values (future 12-month spend)
# You'd calculate this from future transaction data
target = calculate_actual_clv(df)  # Your function to calculate CLV

# Merge target with features
model_data = customer_features.merge(target, on='customer_id', how='inner')

# Split features and target
X = model_data.drop(['customer_id', 'clv'], axis=1)
y = model_data['clv']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: ${mae:.2f}")
print(f"R² Score: {r2:.3f}")

Analyzing Feature Importance

Which engineered features actually matter for predicting customer lifetime value?

import matplotlib.pyplot as plt

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 10
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.xlabel('Importance')
plt.title('Top 10 Most Important Features')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('feature_importance.png')
plt.show()

print(feature_importance[:10])

What We’ve Learned About Feature Engineering

Domain knowledge beats fancy techniques
The best features come from understanding your business. No amount of automated feature engineering replaces knowing that “days since last purchase” is critical for retention.

Simple features often win
Recency, frequency, and monetary value beat 90% of complex engineered features. Start there.

Ratios and trends matter more than raw numbers
Total spend is okay. Spending trend is better. It captures direction, not just magnitude.

Normalize by customer tenure
Raw counts are misleading. A customer with 10 purchases in 2 months is very different from 10 purchases in 2 years.

Test feature importance
Don’t guess. Train a model and check which features actually matter. Drop the ones that don’t.

Common Feature Engineering Mistakes

Mistake 1: Data leakage
Don’t include information from the future. If you’re predicting 12-month CLV, don’t use features calculated from those 12 months.

Mistake 2: Too many features
More features don’t always help. They can cause overfitting and slow training. Start simple, add complexity only if it helps.

Mistake 3: Not handling missing values
Decide how to handle NAs deliberately. Zero-fill? Mean imputation? Separate “missing” category? It matters.

Mistake 4: Ignoring feature scaling
If you’re using distance-based models (KNN, SVM), scale your features. Tree-based models don’t care. Learn more about choosing the right machine learning algorithm.

Mistake 5: Engineering features without checking importance
Just because you can create a feature doesn’t mean you should. Check if it actually helps.

The Bottom Line

Feature engineering in machine learning is where projects succeed or fail. The algorithm matters less than you think. The features you engineer matter more than you think.

Spend your time here. Understand your data. Engineer features that capture real patterns. And test everything to see what actually works.

A simple model with good engineered features beats a complex model with raw data, every time.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *