Feature engineering in machine learning is where you actually make money. Not the algorithm. Not the hyperparameters. The features you engineer determine whether your ML model succeeds or fails in production.
Simple logistic regression with good features beats gradient boosting with raw data, every single time. I’ve watched data scientists spend three weeks optimizing XGBoost when twenty minutes of feature engineering would have doubled their accuracy.
The problem? Feature engineering is messy, domain-specific, and requires actual thinking. There’s no sklearn.FeatureEngineer() you can import. You have to understand your data and your problem.
In this tutorial, we’re building a customer lifetime value predictor. We’ll start with raw transaction data and engineer features that actually predict customer behavior. You’ll see exactly what works and what doesn’t.
What We’re Building
We’re predicting customer lifetime value (CLV) using transaction history. We’ll engineer features in four categories:
Recency Features – How recently did they buy?
Frequency Features – How often do they buy?
Monetary Features – How much do they spend?
Behavioral Features – What patterns emerge?
By the end, you’ll have a framework for feature engineering that works across different ML problems. If you’re new to machine learning, start with Tutorial 1: ML Fundamentals and Tutorial 2: Data Preparation first.
The Raw Data
Let’s start with what we actually have, which is raw customer transaction data:
-- Sample transactions table
SELECT
customer_id,
transaction_date,
transaction_amount,
product_category,
payment_method
FROM transactions
WHERE transaction_date >= '2024-01-01'
LIMIT 5;
This is typical e-commerce data; each row is one transaction. To predict CLV, we need to aggregate this into customer-level features.
Feature Category 1: Recency Features
Recency matters in machine learning. A customer who bought yesterday is more valuable than one who bought six months ago. Let’s quantify that.
Days Since Last Purchase
import pandas as pd
from datetime import datetime
# Calculate days since last purchase
def create_recency_features(df, reference_date=None):
if reference_date is None:
reference_date = df['transaction_date'].max()
customer_features = df.groupby('customer_id').agg({
'transaction_date': 'max'
}).reset_index()
customer_features['days_since_last_purchase'] = (
reference_date - customer_features['transaction_date']
).dt.days
return customer_features
# Load your data
df = pd.read_sql("SELECT * FROM transactions", connection)
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
recency_features = create_recency_features(df)
Purchase Acceleration
Is the customer buying more frequently over time? Let’s measure that:
def calculate_purchase_acceleration(df):
customer_features = df.groupby('customer_id').apply(
lambda x: calculate_acceleration(x)
).reset_index()
return customer_features
def calculate_acceleration(customer_df):
if len(customer_df) 0 else 0
return pd.Series({'purchase_acceleration': acceleration})
A customer buying every 30 days is good. A customer who went from 60 days to 30 days between purchases is even better, since it means they’re accelerating.
Feature Category 2: Frequency Features
How often someone buys tells you a lot about their value.
Total Purchase Count
Simple but effective:
def create_frequency_features(df):
customer_features = df.groupby('customer_id').agg({
'transaction_date': 'count'
}).reset_index()
customer_features.columns = ['customer_id', 'total_purchases']
return customer_features
Purchases Per Month (Normalized Frequency)
Raw counts are misleading if customers have been active for different lengths of time:
def create_normalized_frequency(df):
customer_df = df.groupby('customer_id').agg({
'transaction_date': ['count', 'min', 'max']
}).reset_index()
customer_df.columns = ['customer_id', 'total_purchases', 'first_purchase', 'last_purchase']
# Calculate customer lifespan in months
customer_df['customer_age_months'] = (
(customer_df['last_purchase'] - customer_df['first_purchase']).dt.days / 30
)
# Avoid division by zero
customer_df['customer_age_months'] = customer_df['customer_age_months'].clip(lower=1)
# Purchases per month
customer_df['purchases_per_month'] = (
customer_df['total_purchases'] / customer_df['customer_age_months']
)
return customer_df
Feature Category 3: Monetary Features
Money talks. Let’s measure it properly.
Total Spend and Average Order Value
def create_monetary_features(df):
customer_features = df.groupby('customer_id').agg({
'transaction_amount': ['sum', 'mean', 'median', 'std']
}).reset_index()
customer_features.columns = [
'customer_id',
'total_spend',
'avg_order_value',
'median_order_value',
'order_value_std'
]
return customer_features
Spending Trend
Is the customer spending more over time?
def calculate_spending_trend(df):
customer_features = df.groupby('customer_id').apply(
lambda x: calculate_trend(x)
).reset_index()
return customer_features
def calculate_trend(customer_df):
if len(customer_df) 0 else 0
return pd.Series({'spending_trend': trend})
Feature Category 4: Behavioral Features
These capture patterns that aren’t obvious from simple aggregations.
Product Category Diversity
Does the customer buy from multiple categories or just one?
def create_diversity_features(df):
customer_features = df.groupby('customer_id').agg({
'product_category': lambda x: x.nunique()
}).reset_index()
customer_features.columns = ['customer_id', 'category_diversity']
return customer_features
Payment Method Consistency
Customers who use the same payment method are more engaged:
def create_payment_features(df):
customer_features = df.groupby('customer_id').agg({
'payment_method': lambda x: x.value_counts().iloc[0] / len(x) if len(x) > 0 else 0
}).reset_index()
customer_features.columns = ['customer_id', 'payment_consistency']
return customer_features
Purchase Day Patterns
Do they buy on weekends? Weekdays?
def create_temporal_features(df):
df['day_of_week'] = df['transaction_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
customer_features = df.groupby('customer_id').agg({
'is_weekend': 'mean'
}).reset_index()
customer_features.columns = ['customer_id', 'weekend_purchase_ratio']
return customer_features
Combining All Engineered Features
Now we can finally merge all our engineered features into a single dataset:
def create_all_features(df):
# Create each feature set
recency = create_recency_features(df)
recency_accel = calculate_purchase_acceleration(df)
frequency = create_frequency_features(df)
frequency_norm = create_normalized_frequency(df)
monetary = create_monetary_features(df)
spending = calculate_spending_trend(df)
diversity = create_diversity_features(df)
payment = create_payment_features(df)
temporal = create_temporal_features(df)
# Merge everything
features = recency
features = features.merge(recency_accel, on='customer_id', how='left')
features = features.merge(frequency, on='customer_id', how='left')
features = features.merge(frequency_norm, on='customer_id', how='left')
features = features.merge(monetary, on='customer_id', how='left')
features = features.merge(spending, on='customer_id', how='left')
features = features.merge(diversity, on='customer_id', how='left')
features = features.merge(payment, on='customer_id', how='left')
features = features.merge(temporal, on='customer_id', how='left')
# Fill NAs
features = features.fillna(0)
return features
# Create the feature set
customer_features = create_all_features(df)
print(customer_features.head())
Training the Machine Learning Model
Now let’s use these engineered features to train a machine learning model that predicts CLV:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
# Assume we have actual CLV values (future 12-month spend)
# You'd calculate this from future transaction data
target = calculate_actual_clv(df) # Your function to calculate CLV
# Merge target with features
model_data = customer_features.merge(target, on='customer_id', how='inner')
# Split features and target
X = model_data.drop(['customer_id', 'clv'], axis=1)
y = model_data['clv']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error: ${mae:.2f}")
print(f"R² Score: {r2:.3f}")
Analyzing Feature Importance
Which engineered features actually matter for predicting customer lifetime value?
import matplotlib.pyplot as plt
# Get feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
# Plot top 10
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.xlabel('Importance')
plt.title('Top 10 Most Important Features')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('feature_importance.png')
plt.show()
print(feature_importance[:10])
What We’ve Learned About Feature Engineering
Domain knowledge beats fancy techniques
The best features come from understanding your business. No amount of automated feature engineering replaces knowing that “days since last purchase” is critical for retention.
Simple features often win
Recency, frequency, and monetary value beat 90% of complex engineered features. Start there.
Ratios and trends matter more than raw numbers
Total spend is okay. Spending trend is better. It captures direction, not just magnitude.
Normalize by customer tenure
Raw counts are misleading. A customer with 10 purchases in 2 months is very different from 10 purchases in 2 years.
Test feature importance
Don’t guess. Train a model and check which features actually matter. Drop the ones that don’t.
Common Feature Engineering Mistakes
Mistake 1: Data leakage
Don’t include information from the future. If you’re predicting 12-month CLV, don’t use features calculated from those 12 months.
Mistake 2: Too many features
More features don’t always help. They can cause overfitting and slow training. Start simple, add complexity only if it helps.
Mistake 3: Not handling missing values
Decide how to handle NAs deliberately. Zero-fill? Mean imputation? Separate “missing” category? It matters.
Mistake 4: Ignoring feature scaling
If you’re using distance-based models (KNN, SVM), scale your features. Tree-based models don’t care. Learn more about choosing the right machine learning algorithm.
Mistake 5: Engineering features without checking importance
Just because you can create a feature doesn’t mean you should. Check if it actually helps.
The Bottom Line
Feature engineering in machine learning is where projects succeed or fail. The algorithm matters less than you think. The features you engineer matter more than you think.
Spend your time here. Understand your data. Engineer features that capture real patterns. And test everything to see what actually works.
A simple model with good engineered features beats a complex model with raw data, every time.

Leave a Reply