Feature Engineering Best Practices for ML Models: From Raw Data to Predictive Power
The Most Underrated Skill in Machine Learning
Andrew Ng famously said: “Applied machine learning is basically feature engineering.” Despite the rise of deep learning and automated ML, feature engineering remains the secret weapon of data scientists who consistently build high-performing models.
Good features can make a simple model outperform a complex one. Bad features will cripple even the most sophisticated algorithms. This guide distills years of practical experience into actionable patterns you can apply immediately.
Why Feature Engineering Matters
The Cold, Hard Truth
Consider these real-world scenarios:
Scenario 1: E-commerce Conversion Prediction
- Raw features only: 72% accuracy
- With engineered features: 89% accuracy
- Same model, same data, 17% improvement
Scenario 2: Credit Risk Modeling
- Basic features: AUC 0.78
- Engineered features: AUC 0.91
- Made the difference between rejected and deployed model
Scenario 3: Time Series Forecasting
- Without temporal features: RMSE 45.2
- With temporal features: RMSE 18.7
- 58% reduction in error
What Makes a Good Feature?
Great features share these characteristics:
- Informative: Strong correlation with target variable
- Independent: Low correlation with other features (reduces redundancy)
- Simple: Easy to compute and explain
- Robust: Stable across different data distributions
- Generalizable: Works well on unseen data
The Feature Engineering Workflow
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif
import warnings
warnings.filterwarnings('ignore')
# Load example dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
# Our feature engineering pipeline:
# 1. Understand → 2. Create → 3. Transform → 4. Select → 5. Validate
Let’s explore each step with real examples.
1. Understanding Your Data
Before engineering features, understand what you have:
def analyze_data(df, target_col='target'):
"""
Comprehensive data analysis
"""
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Shape: {df.shape}")
print(f"Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\n" + "=" * 60)
print("DATA TYPES")
print("=" * 60)
print(df.dtypes.value_counts())
print("\n" + "=" * 60)
print("MISSING VALUES")
print("=" * 60)
missing = df.isnull().sum()
missing_pct = 100 * missing / len(df)
missing_table = pd.DataFrame({
'Missing': missing,
'Percent': missing_pct
})
print(missing_table[missing_table['Missing'] > 0].sort_values('Percent', ascending=False))
print("\n" + "=" * 60)
print("STATISTICAL SUMMARY")
print("=" * 60)
print(df.describe())
# Feature-target correlations
if target_col in df.columns:
print("\n" + "=" * 60)
print("TOP CORRELATIONS WITH TARGET")
print("=" * 60)
correlations = df.corr()[target_col].sort_values(ascending=False)
print(correlations[1:11]) # Top 10 excluding target itself
analyze_data(df)
Key Insights to Extract:
- Distribution shapes (normal, skewed, multimodal)
- Outliers and anomalies
- Missing patterns (random vs. systematic)
- Correlation structure
- Class imbalance
2. Creating New Features
A. Domain-Inspired Features
The most powerful features come from domain knowledge:
# Example: E-commerce transaction data
def create_domain_features(transactions_df):
"""
Create business-relevant features
"""
df = transactions_df.copy()
# Recency, Frequency, Monetary (RFM) features
df['days_since_last_purchase'] = (
pd.Timestamp.now() - df['last_purchase_date']
).dt.days
df['purchase_frequency'] = df['total_purchases'] / df['customer_age_days']
df['avg_order_value'] = df['total_spent'] / df['total_purchases']
# Behavioral patterns
df['weekend_shopper'] = df['weekend_purchases'] / df['total_purchases']
df['discount_sensitivity'] = df['discounted_purchases'] / df['total_purchases']
# Customer lifetime value estimate
df['estimated_clv'] = (
df['avg_order_value'] *
df['purchase_frequency'] *
365 # Annualized
)
# Engagement metrics
df['cart_abandonment_rate'] = (
df['carts_created'] - df['purchases']
) / df['carts_created']
return df
B. Mathematical Transformations
Transform features to better match model assumptions:
def apply_mathematical_transforms(df, numeric_cols):
"""
Apply common mathematical transformations
"""
transformed = df.copy()
for col in numeric_cols:
# Skip if column has zeros or negatives (for log/sqrt)
if (transformed[col] <= 0).any():
continue
# Log transformation (reduces right skew)
transformed[f'{col}_log'] = np.log1p(transformed[col])
# Square root (mild skew reduction)
transformed[f'{col}_sqrt'] = np.sqrt(transformed[col])
# Square (emphasize larger values)
transformed[f'{col}_squared'] = transformed[col] ** 2
# Cube root (handles negative values)
transformed[f'{col}_cbrt'] = np.cbrt(transformed[col])
# Reciprocal (inverse relationship)
if (transformed[col] == 0).any():
transformed[f'{col}_reciprocal'] = 1 / (transformed[col] + 1e-6)
else:
transformed[f'{col}_reciprocal'] = 1 / transformed[col]
return transformed
# Example: Apply to skewed features
skewed_features = df.select_dtypes(include=[np.number]).columns[:5]
df_transformed = apply_mathematical_transforms(df, skewed_features)
C. Interaction Features
Capture relationships between features:
def create_interactions(df, feature_pairs):
"""
Create interaction features
"""
interactions = df.copy()
for feat1, feat2 in feature_pairs:
# Multiplicative interaction
interactions[f'{feat1}_x_{feat2}'] = (
interactions[feat1] * interactions[feat2]
)
# Ratio (division)
interactions[f'{feat1}_div_{feat2}'] = (
interactions[feat1] / (interactions[feat2] + 1e-6)
)
# Difference
interactions[f'{feat1}_minus_{feat2}'] = (
interactions[feat1] - interactions[feat2]
)
# Sum
interactions[f'{feat1}_plus_{feat2}'] = (
interactions[feat1] + interactions[feat2]
)
return interactions
# Example: Medical diagnosis features
feature_pairs = [
('mean radius', 'mean texture'),
('mean area', 'mean smoothness'),
('worst radius', 'worst concavity')
]
df_interactions = create_interactions(df, feature_pairs)
D. Polynomial Features
Systematically create polynomial combinations:
from sklearn.preprocessing import PolynomialFeatures
def create_polynomial_features(df, numeric_cols, degree=2):
"""
Generate polynomial features
"""
poly = PolynomialFeatures(
degree=degree,
include_bias=False,
interaction_only=False # Set True for interactions only
)
# Apply to numeric columns
X_poly = poly.fit_transform(df[numeric_cols])
# Create feature names
feature_names = poly.get_feature_names_out(numeric_cols)
# Create DataFrame
df_poly = pd.DataFrame(X_poly, columns=feature_names, index=df.index)
return df_poly
# Warning: Can create many features quickly!
# Use with selected features, not all
important_features = ['mean radius', 'mean texture', 'mean area']
df_poly = create_polynomial_features(df, important_features, degree=2)
print(f"Created {df_poly.shape[1]} polynomial features from {len(important_features)} base features")
3. Temporal Feature Engineering
Time is a goldmine for features:
def create_temporal_features(df, date_col):
"""
Extract comprehensive temporal features
"""
df = df.copy()
df[date_col] = pd.to_datetime(df[date_col])
# Basic temporal components
df['year'] = df[date_col].dt.year
df['month'] = df[date_col].dt.month
df['day'] = df[date_col].dt.day
df['dayofweek'] = df[date_col].dt.dayofweek
df['dayofyear'] = df[date_col].dt.dayofyear
df['quarter'] = df[date_col].dt.quarter
df['week'] = df[date_col].dt.isocalendar().week
# Time-based patterns
df['is_weekend'] = (df['dayofweek'] >= 5).astype(int)
df['is_month_start'] = df[date_col].dt.is_month_start.astype(int)
df['is_month_end'] = df[date_col].dt.is_month_end.astype(int)
df['is_quarter_start'] = df[date_col].dt.is_quarter_start.astype(int)
df['is_quarter_end'] = df[date_col].dt.is_quarter_end.astype(int)
# Cyclical encoding (preserves circular nature of time)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['day_sin'] = np.sin(2 * np.pi * df['day'] / 31)
df['day_cos'] = np.cos(2 * np.pi * df['day'] / 31)
df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)
# Hour-based (if datetime includes time)
if df[date_col].dt.hour.max() > 0:
df['hour'] = df[date_col].dt.hour
df['is_business_hours'] = (
(df['hour'] >= 9) & (df['hour'] <= 17)
).astype(int)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
return df
Why Cyclical Encoding?
December (12) and January (1) are adjacent months, but numerically they’re far apart. Cyclical encoding preserves this relationship:
# Bad: Linear encoding
months = [1, 2, 11, 12]
# Distance between Dec and Jan: |12 - 1| = 11 (wrong!)
# Good: Cyclical encoding
month_sin = np.sin(2 * np.pi * np.array(months) / 12)
month_cos = np.cos(2 * np.pi * np.array(months) / 12)
# Now Dec and Jan are close in 2D space
4. Categorical Feature Engineering
A. Basic Encoding
def encode_categorical_features(df, categorical_cols, method='label'):
"""
Encode categorical variables
"""
df_encoded = df.copy()
if method == 'label':
# Label Encoding: Good for ordinal features
for col in categorical_cols:
le = LabelEncoder()
df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))
elif method == 'onehot':
# One-Hot Encoding: Good for nominal features with few categories
df_encoded = pd.get_dummies(
df_encoded,
columns=categorical_cols,
drop_first=True # Avoid multicollinearity
)
elif method == 'frequency':
# Frequency Encoding: Replace with occurrence count
for col in categorical_cols:
freq_map = df_encoded[col].value_counts().to_dict()
df_encoded[f'{col}_freq'] = df_encoded[col].map(freq_map)
return df_encoded
B. Target Encoding
Powerful but dangerous (can leak target information):
def target_encode(df, categorical_col, target_col, smoothing=10):
"""
Target encoding with smoothing to prevent overfitting
"""
# Calculate global mean
global_mean = df[target_col].mean()
# Calculate category statistics
agg = df.groupby(categorical_col)[target_col].agg(['mean', 'count'])
# Smoothing: blend category mean with global mean
# More counts = more weight to category mean
smoothed_mean = (
agg['mean'] * agg['count'] + global_mean * smoothing
) / (agg['count'] + smoothing)
# Map back to original dataframe
df[f'{categorical_col}_target_encoded'] = df[categorical_col].map(smoothed_mean)
return df
# Critical: Only fit on training data, apply to validation/test
# Otherwise you're leaking target information!
C. High-Cardinality Categories
When you have many categories:
def handle_high_cardinality(df, col, method='frequency', top_n=10):
"""
Handle categorical features with many unique values
"""
df_handled = df.copy()
if method == 'top_n':
# Keep top N categories, group rest as 'Other'
top_categories = df[col].value_counts().nlargest(top_n).index
df_handled[col] = df_handled[col].apply(
lambda x: x if x in top_categories else 'Other'
)
elif method == 'frequency':
# Replace with frequency
freq_map = df[col].value_counts().to_dict()
df_handled[f'{col}_freq'] = df_handled[col].map(freq_map)
elif method == 'hash':
# Hash to fixed number of buckets
df_handled[f'{col}_hash'] = df_handled[col].apply(
lambda x: hash(str(x)) % 100 # 100 buckets
)
return df_handled
5. Handling Missing Values
Missing data is informative:
def handle_missing_values(df, strategy='comprehensive'):
"""
Sophisticated missing value handling
"""
df_filled = df.copy()
for col in df.columns:
if df[col].isnull().sum() > 0:
# Create missingness indicator
df_filled[f'{col}_is_missing'] = df[col].isnull().astype(int)
if df[col].dtype in [np.float64, np.int64]:
if strategy == 'simple':
# Simple: Use median
df_filled[col].fillna(df[col].median(), inplace=True)
elif strategy == 'comprehensive':
# Better: Use multiple strategies
df_filled[f'{col}_filled_median'] = df[col].fillna(df[col].median())
df_filled[f'{col}_filled_mean'] = df[col].fillna(df[col].mean())
df_filled[f'{col}_filled_mode'] = df[col].fillna(df[col].mode()[0])
# Forward fill and backward fill for time series
if 'date' in col.lower() or 'time' in col.lower():
df_filled[f'{col}_filled_ffill'] = df[col].fillna(method='ffill')
df_filled[f'{col}_filled_bfill'] = df[col].fillna(method='bfill')
else:
# Categorical: use mode or create 'Missing' category
df_filled[col].fillna(df[col].mode()[0] if len(df[col].mode()) > 0 else 'Missing', inplace=True)
return df_filled
6. Scaling and Normalization
Models need features on similar scales:
def scale_features(df, numeric_cols, method='standard'):
"""
Scale numerical features
"""
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
df_scaled = df.copy()
if method == 'standard':
# Standardization: mean=0, std=1
# Good for: Most ML algorithms, especially linear models
scaler = StandardScaler()
elif method == 'minmax':
# Min-Max: scale to [0, 1]
# Good for: Neural networks, algorithms sensitive to scale
scaler = MinMaxScaler()
elif method == 'robust':
# Robust: median and IQR (resistant to outliers)
# Good for: Data with outliers
scaler = RobustScaler()
df_scaled[numeric_cols] = scaler.fit_transform(df_scaled[numeric_cols])
return df_scaled, scaler
# Example
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
df_scaled, scaler = scale_features(df, numeric_features, method='standard')
7. Feature Selection
More features isn’t always better:
def select_features(X, y, method='statistical', k=10):
"""
Select most important features
"""
if method == 'statistical':
# Statistical test (ANOVA F-value)
selector = SelectKBest(score_func=f_classif, k=k)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()
elif method == 'model_based':
# Use model feature importance
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
# Get feature importance
importances = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
selected_features = importances.head(k)['feature'].tolist()
X_selected = X[selected_features]
elif method == 'correlation':
# Remove highly correlated features
corr_matrix = X.corr().abs()
upper_triangle = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
# Find features with correlation > 0.95
to_drop = [col for col in upper_triangle.columns
if any(upper_triangle[col] > 0.95)]
selected_features = [col for col in X.columns if col not in to_drop]
X_selected = X[selected_features]
return X_selected, selected_features
# Example
X = df_scaled.drop('target', axis=1)
y = df_scaled['target']
X_selected, selected_features = select_features(X, y, method='model_based', k=15)
print(f"Selected features: {selected_features}")
8. Validation and Testing
Always validate your features:
def validate_features(X_train, X_test, y_train, y_test):
"""
Validate feature engineering with cross-validation
"""
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Cross-validation on training set
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-Validation Results:")
print(f" Mean Accuracy: {cv_scores.mean():.4f}")
print(f" Std Deviation: {cv_scores.std():.4f}")
# Test set performance
model.fit(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"\nTest Set Accuracy: {test_score:.4f}")
# Check for overfitting
train_score = model.score(X_train, y_train)
print(f"Train Set Accuracy: {train_score:.4f}")
if train_score - test_score > 0.1:
print("\n⚠️ Warning: Possible overfitting detected!")
else:
print("\n✓ Good generalization")
return cv_scores, test_score
Complete Pipeline Example
Putting it all together:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
def create_feature_engineering_pipeline(numeric_features, categorical_features):
"""
Create a complete feature engineering pipeline
"""
# Numeric features pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical features pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
return preprocessor
# Full workflow
def full_feature_engineering_workflow(df, target_col):
"""
Complete feature engineering workflow
"""
# 1. Separate features and target
X = df.drop(target_col, axis=1)
y = df[target_col]
# 2. Identify feature types
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 4. Create and apply pipeline
preprocessor = create_feature_engineering_pipeline(
numeric_features, categorical_features
)
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
# 5. Validate
cv_scores, test_score = validate_features(
X_train_processed, X_test_processed, y_train, y_test
)
return X_train_processed, X_test_processed, y_train, y_test, preprocessor
# Execute
X_train, X_test, y_train, y_test, preprocessor = full_feature_engineering_workflow(
df, 'target'
)
Anti-Patterns to Avoid
1. Target Leakage
# ❌ WRONG: Using future information
df['next_month_sales'] = df['sales'].shift(-1) # Leaks future data!
# ✓ CORRECT: Only use past information
df['prev_month_sales'] = df['sales'].shift(1)
2. Fitting on Entire Dataset
# ❌ WRONG: Fit scaler on all data
scaler = StandardScaler().fit(df[numeric_cols])
df_scaled = scaler.transform(df[numeric_cols])
train, test = train_test_split(df_scaled)
# ✓ CORRECT: Fit only on training data
train, test = train_test_split(df)
scaler = StandardScaler().fit(train[numeric_cols])
train_scaled = scaler.transform(train[numeric_cols])
test_scaled = scaler.transform(test[numeric_cols])
3. Creating Too Many Features
# ❌ WRONG: Creating thousands of features blindly
for i in numeric_cols:
for j in numeric_cols:
df[f'{i}_x_{j}'] = df[i] * df[j] # Combinatorial explosion!
# ✓ CORRECT: Be selective based on domain knowledge
important_interactions = [
('feature_a', 'feature_b'), # Known to interact
('feature_c', 'feature_d') # Business logic suggests interaction
]
Conclusion
Feature engineering is where data science becomes an art. The techniques in this guide provide a solid foundation, but the real magic happens when you combine them with deep domain knowledge and iterative experimentation.
Key Takeaways:
- Understand first: Explore data thoroughly before engineering
- Domain knowledge: Best features come from understanding the problem
- Iterate: Try multiple approaches, measure, refine
- Validate rigorously: Avoid leakage, use proper train/test splits
- Keep it simple: Start with simple features, add complexity only if needed
- Document: Keep track of what works and why
The difference between a mediocre model and an exceptional one often lies not in the algorithm choice, but in the quality of features you feed it. Master feature engineering, and you’ll master machine learning.
Resources
- Books: “Feature Engineering for Machine Learning” by Alice Zheng & Amanda Casari
- Kaggle: Study feature engineering in winning solutions
- Papers: “Feature Engineering and Selection: A Practical Approach” (Kuhn & Johnson)
- Tools: Featuretools for automated feature engineering
Remember: Models are only as good as their features. Engineer wisely.