Skip to main content

Feature Engineering Best Practices for ML Models: From Raw Data to Predictive Power

Ryan Dahlberg
Ryan Dahlberg
October 22, 2025 12 min read
Share:
Feature Engineering Best Practices for ML Models: From Raw Data to Predictive Power

The Most Underrated Skill in Machine Learning

Andrew Ng famously said: “Applied machine learning is basically feature engineering.” Despite the rise of deep learning and automated ML, feature engineering remains the secret weapon of data scientists who consistently build high-performing models.

Good features can make a simple model outperform a complex one. Bad features will cripple even the most sophisticated algorithms. This guide distills years of practical experience into actionable patterns you can apply immediately.


Why Feature Engineering Matters

The Cold, Hard Truth

Consider these real-world scenarios:

Scenario 1: E-commerce Conversion Prediction

  • Raw features only: 72% accuracy
  • With engineered features: 89% accuracy
  • Same model, same data, 17% improvement

Scenario 2: Credit Risk Modeling

  • Basic features: AUC 0.78
  • Engineered features: AUC 0.91
  • Made the difference between rejected and deployed model

Scenario 3: Time Series Forecasting

  • Without temporal features: RMSE 45.2
  • With temporal features: RMSE 18.7
  • 58% reduction in error

What Makes a Good Feature?

Great features share these characteristics:

  1. Informative: Strong correlation with target variable
  2. Independent: Low correlation with other features (reduces redundancy)
  3. Simple: Easy to compute and explain
  4. Robust: Stable across different data distributions
  5. Generalizable: Works well on unseen data

The Feature Engineering Workflow

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif
import warnings
warnings.filterwarnings('ignore')

# Load example dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Our feature engineering pipeline:
# 1. Understand → 2. Create → 3. Transform → 4. Select → 5. Validate

Let’s explore each step with real examples.


1. Understanding Your Data

Before engineering features, understand what you have:

def analyze_data(df, target_col='target'):
    """
    Comprehensive data analysis
    """
    print("=" * 60)
    print("DATASET OVERVIEW")
    print("=" * 60)
    print(f"Shape: {df.shape}")
    print(f"Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

    print("\n" + "=" * 60)
    print("DATA TYPES")
    print("=" * 60)
    print(df.dtypes.value_counts())

    print("\n" + "=" * 60)
    print("MISSING VALUES")
    print("=" * 60)
    missing = df.isnull().sum()
    missing_pct = 100 * missing / len(df)
    missing_table = pd.DataFrame({
        'Missing': missing,
        'Percent': missing_pct
    })
    print(missing_table[missing_table['Missing'] > 0].sort_values('Percent', ascending=False))

    print("\n" + "=" * 60)
    print("STATISTICAL SUMMARY")
    print("=" * 60)
    print(df.describe())

    # Feature-target correlations
    if target_col in df.columns:
        print("\n" + "=" * 60)
        print("TOP CORRELATIONS WITH TARGET")
        print("=" * 60)
        correlations = df.corr()[target_col].sort_values(ascending=False)
        print(correlations[1:11])  # Top 10 excluding target itself

analyze_data(df)

Key Insights to Extract:

  • Distribution shapes (normal, skewed, multimodal)
  • Outliers and anomalies
  • Missing patterns (random vs. systematic)
  • Correlation structure
  • Class imbalance

2. Creating New Features

A. Domain-Inspired Features

The most powerful features come from domain knowledge:

# Example: E-commerce transaction data
def create_domain_features(transactions_df):
    """
    Create business-relevant features
    """
    df = transactions_df.copy()

    # Recency, Frequency, Monetary (RFM) features
    df['days_since_last_purchase'] = (
        pd.Timestamp.now() - df['last_purchase_date']
    ).dt.days

    df['purchase_frequency'] = df['total_purchases'] / df['customer_age_days']
    df['avg_order_value'] = df['total_spent'] / df['total_purchases']

    # Behavioral patterns
    df['weekend_shopper'] = df['weekend_purchases'] / df['total_purchases']
    df['discount_sensitivity'] = df['discounted_purchases'] / df['total_purchases']

    # Customer lifetime value estimate
    df['estimated_clv'] = (
        df['avg_order_value'] *
        df['purchase_frequency'] *
        365  # Annualized
    )

    # Engagement metrics
    df['cart_abandonment_rate'] = (
        df['carts_created'] - df['purchases']
    ) / df['carts_created']

    return df

B. Mathematical Transformations

Transform features to better match model assumptions:

def apply_mathematical_transforms(df, numeric_cols):
    """
    Apply common mathematical transformations
    """
    transformed = df.copy()

    for col in numeric_cols:
        # Skip if column has zeros or negatives (for log/sqrt)
        if (transformed[col] <= 0).any():
            continue

        # Log transformation (reduces right skew)
        transformed[f'{col}_log'] = np.log1p(transformed[col])

        # Square root (mild skew reduction)
        transformed[f'{col}_sqrt'] = np.sqrt(transformed[col])

        # Square (emphasize larger values)
        transformed[f'{col}_squared'] = transformed[col] ** 2

        # Cube root (handles negative values)
        transformed[f'{col}_cbrt'] = np.cbrt(transformed[col])

        # Reciprocal (inverse relationship)
        if (transformed[col] == 0).any():
            transformed[f'{col}_reciprocal'] = 1 / (transformed[col] + 1e-6)
        else:
            transformed[f'{col}_reciprocal'] = 1 / transformed[col]

    return transformed

# Example: Apply to skewed features
skewed_features = df.select_dtypes(include=[np.number]).columns[:5]
df_transformed = apply_mathematical_transforms(df, skewed_features)

C. Interaction Features

Capture relationships between features:

def create_interactions(df, feature_pairs):
    """
    Create interaction features
    """
    interactions = df.copy()

    for feat1, feat2 in feature_pairs:
        # Multiplicative interaction
        interactions[f'{feat1}_x_{feat2}'] = (
            interactions[feat1] * interactions[feat2]
        )

        # Ratio (division)
        interactions[f'{feat1}_div_{feat2}'] = (
            interactions[feat1] / (interactions[feat2] + 1e-6)
        )

        # Difference
        interactions[f'{feat1}_minus_{feat2}'] = (
            interactions[feat1] - interactions[feat2]
        )

        # Sum
        interactions[f'{feat1}_plus_{feat2}'] = (
            interactions[feat1] + interactions[feat2]
        )

    return interactions

# Example: Medical diagnosis features
feature_pairs = [
    ('mean radius', 'mean texture'),
    ('mean area', 'mean smoothness'),
    ('worst radius', 'worst concavity')
]

df_interactions = create_interactions(df, feature_pairs)

D. Polynomial Features

Systematically create polynomial combinations:

from sklearn.preprocessing import PolynomialFeatures

def create_polynomial_features(df, numeric_cols, degree=2):
    """
    Generate polynomial features
    """
    poly = PolynomialFeatures(
        degree=degree,
        include_bias=False,
        interaction_only=False  # Set True for interactions only
    )

    # Apply to numeric columns
    X_poly = poly.fit_transform(df[numeric_cols])

    # Create feature names
    feature_names = poly.get_feature_names_out(numeric_cols)

    # Create DataFrame
    df_poly = pd.DataFrame(X_poly, columns=feature_names, index=df.index)

    return df_poly

# Warning: Can create many features quickly!
# Use with selected features, not all
important_features = ['mean radius', 'mean texture', 'mean area']
df_poly = create_polynomial_features(df, important_features, degree=2)
print(f"Created {df_poly.shape[1]} polynomial features from {len(important_features)} base features")

3. Temporal Feature Engineering

Time is a goldmine for features:

def create_temporal_features(df, date_col):
    """
    Extract comprehensive temporal features
    """
    df = df.copy()
    df[date_col] = pd.to_datetime(df[date_col])

    # Basic temporal components
    df['year'] = df[date_col].dt.year
    df['month'] = df[date_col].dt.month
    df['day'] = df[date_col].dt.day
    df['dayofweek'] = df[date_col].dt.dayofweek
    df['dayofyear'] = df[date_col].dt.dayofyear
    df['quarter'] = df[date_col].dt.quarter
    df['week'] = df[date_col].dt.isocalendar().week

    # Time-based patterns
    df['is_weekend'] = (df['dayofweek'] >= 5).astype(int)
    df['is_month_start'] = df[date_col].dt.is_month_start.astype(int)
    df['is_month_end'] = df[date_col].dt.is_month_end.astype(int)
    df['is_quarter_start'] = df[date_col].dt.is_quarter_start.astype(int)
    df['is_quarter_end'] = df[date_col].dt.is_quarter_end.astype(int)

    # Cyclical encoding (preserves circular nature of time)
    df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
    df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
    df['day_sin'] = np.sin(2 * np.pi * df['day'] / 31)
    df['day_cos'] = np.cos(2 * np.pi * df['day'] / 31)
    df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
    df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)

    # Hour-based (if datetime includes time)
    if df[date_col].dt.hour.max() > 0:
        df['hour'] = df[date_col].dt.hour
        df['is_business_hours'] = (
            (df['hour'] >= 9) & (df['hour'] <= 17)
        ).astype(int)
        df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
        df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

    return df

Why Cyclical Encoding?

December (12) and January (1) are adjacent months, but numerically they’re far apart. Cyclical encoding preserves this relationship:

# Bad: Linear encoding
months = [1, 2, 11, 12]
# Distance between Dec and Jan: |12 - 1| = 11 (wrong!)

# Good: Cyclical encoding
month_sin = np.sin(2 * np.pi * np.array(months) / 12)
month_cos = np.cos(2 * np.pi * np.array(months) / 12)
# Now Dec and Jan are close in 2D space

4. Categorical Feature Engineering

A. Basic Encoding

def encode_categorical_features(df, categorical_cols, method='label'):
    """
    Encode categorical variables
    """
    df_encoded = df.copy()

    if method == 'label':
        # Label Encoding: Good for ordinal features
        for col in categorical_cols:
            le = LabelEncoder()
            df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))

    elif method == 'onehot':
        # One-Hot Encoding: Good for nominal features with few categories
        df_encoded = pd.get_dummies(
            df_encoded,
            columns=categorical_cols,
            drop_first=True  # Avoid multicollinearity
        )

    elif method == 'frequency':
        # Frequency Encoding: Replace with occurrence count
        for col in categorical_cols:
            freq_map = df_encoded[col].value_counts().to_dict()
            df_encoded[f'{col}_freq'] = df_encoded[col].map(freq_map)

    return df_encoded

B. Target Encoding

Powerful but dangerous (can leak target information):

def target_encode(df, categorical_col, target_col, smoothing=10):
    """
    Target encoding with smoothing to prevent overfitting
    """
    # Calculate global mean
    global_mean = df[target_col].mean()

    # Calculate category statistics
    agg = df.groupby(categorical_col)[target_col].agg(['mean', 'count'])

    # Smoothing: blend category mean with global mean
    # More counts = more weight to category mean
    smoothed_mean = (
        agg['mean'] * agg['count'] + global_mean * smoothing
    ) / (agg['count'] + smoothing)

    # Map back to original dataframe
    df[f'{categorical_col}_target_encoded'] = df[categorical_col].map(smoothed_mean)

    return df

# Critical: Only fit on training data, apply to validation/test
# Otherwise you're leaking target information!

C. High-Cardinality Categories

When you have many categories:

def handle_high_cardinality(df, col, method='frequency', top_n=10):
    """
    Handle categorical features with many unique values
    """
    df_handled = df.copy()

    if method == 'top_n':
        # Keep top N categories, group rest as 'Other'
        top_categories = df[col].value_counts().nlargest(top_n).index
        df_handled[col] = df_handled[col].apply(
            lambda x: x if x in top_categories else 'Other'
        )

    elif method == 'frequency':
        # Replace with frequency
        freq_map = df[col].value_counts().to_dict()
        df_handled[f'{col}_freq'] = df_handled[col].map(freq_map)

    elif method == 'hash':
        # Hash to fixed number of buckets
        df_handled[f'{col}_hash'] = df_handled[col].apply(
            lambda x: hash(str(x)) % 100  # 100 buckets
        )

    return df_handled

5. Handling Missing Values

Missing data is informative:

def handle_missing_values(df, strategy='comprehensive'):
    """
    Sophisticated missing value handling
    """
    df_filled = df.copy()

    for col in df.columns:
        if df[col].isnull().sum() > 0:
            # Create missingness indicator
            df_filled[f'{col}_is_missing'] = df[col].isnull().astype(int)

            if df[col].dtype in [np.float64, np.int64]:
                if strategy == 'simple':
                    # Simple: Use median
                    df_filled[col].fillna(df[col].median(), inplace=True)

                elif strategy == 'comprehensive':
                    # Better: Use multiple strategies
                    df_filled[f'{col}_filled_median'] = df[col].fillna(df[col].median())
                    df_filled[f'{col}_filled_mean'] = df[col].fillna(df[col].mean())
                    df_filled[f'{col}_filled_mode'] = df[col].fillna(df[col].mode()[0])

                    # Forward fill and backward fill for time series
                    if 'date' in col.lower() or 'time' in col.lower():
                        df_filled[f'{col}_filled_ffill'] = df[col].fillna(method='ffill')
                        df_filled[f'{col}_filled_bfill'] = df[col].fillna(method='bfill')

            else:
                # Categorical: use mode or create 'Missing' category
                df_filled[col].fillna(df[col].mode()[0] if len(df[col].mode()) > 0 else 'Missing', inplace=True)

    return df_filled

6. Scaling and Normalization

Models need features on similar scales:

def scale_features(df, numeric_cols, method='standard'):
    """
    Scale numerical features
    """
    from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

    df_scaled = df.copy()

    if method == 'standard':
        # Standardization: mean=0, std=1
        # Good for: Most ML algorithms, especially linear models
        scaler = StandardScaler()

    elif method == 'minmax':
        # Min-Max: scale to [0, 1]
        # Good for: Neural networks, algorithms sensitive to scale
        scaler = MinMaxScaler()

    elif method == 'robust':
        # Robust: median and IQR (resistant to outliers)
        # Good for: Data with outliers
        scaler = RobustScaler()

    df_scaled[numeric_cols] = scaler.fit_transform(df_scaled[numeric_cols])

    return df_scaled, scaler

# Example
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
df_scaled, scaler = scale_features(df, numeric_features, method='standard')

7. Feature Selection

More features isn’t always better:

def select_features(X, y, method='statistical', k=10):
    """
    Select most important features
    """
    if method == 'statistical':
        # Statistical test (ANOVA F-value)
        selector = SelectKBest(score_func=f_classif, k=k)
        X_selected = selector.fit_transform(X, y)
        selected_features = X.columns[selector.get_support()].tolist()

    elif method == 'model_based':
        # Use model feature importance
        from sklearn.ensemble import RandomForestClassifier

        model = RandomForestClassifier(n_estimators=100, random_state=42)
        model.fit(X, y)

        # Get feature importance
        importances = pd.DataFrame({
            'feature': X.columns,
            'importance': model.feature_importances_
        }).sort_values('importance', ascending=False)

        selected_features = importances.head(k)['feature'].tolist()
        X_selected = X[selected_features]

    elif method == 'correlation':
        # Remove highly correlated features
        corr_matrix = X.corr().abs()
        upper_triangle = corr_matrix.where(
            np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
        )

        # Find features with correlation > 0.95
        to_drop = [col for col in upper_triangle.columns
                   if any(upper_triangle[col] > 0.95)]

        selected_features = [col for col in X.columns if col not in to_drop]
        X_selected = X[selected_features]

    return X_selected, selected_features

# Example
X = df_scaled.drop('target', axis=1)
y = df_scaled['target']

X_selected, selected_features = select_features(X, y, method='model_based', k=15)
print(f"Selected features: {selected_features}")

8. Validation and Testing

Always validate your features:

def validate_features(X_train, X_test, y_train, y_test):
    """
    Validate feature engineering with cross-validation
    """
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import RandomForestClassifier

    model = RandomForestClassifier(n_estimators=100, random_state=42)

    # Cross-validation on training set
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

    print("Cross-Validation Results:")
    print(f"  Mean Accuracy: {cv_scores.mean():.4f}")
    print(f"  Std Deviation: {cv_scores.std():.4f}")

    # Test set performance
    model.fit(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print(f"\nTest Set Accuracy: {test_score:.4f}")

    # Check for overfitting
    train_score = model.score(X_train, y_train)
    print(f"Train Set Accuracy: {train_score:.4f}")

    if train_score - test_score > 0.1:
        print("\n⚠️  Warning: Possible overfitting detected!")
    else:
        print("\n✓ Good generalization")

    return cv_scores, test_score

Complete Pipeline Example

Putting it all together:

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def create_feature_engineering_pipeline(numeric_features, categorical_features):
    """
    Create a complete feature engineering pipeline
    """
    # Numeric features pipeline
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])

    # Categorical features pipeline
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])

    # Combine transformers
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ])

    return preprocessor

# Full workflow
def full_feature_engineering_workflow(df, target_col):
    """
    Complete feature engineering workflow
    """
    # 1. Separate features and target
    X = df.drop(target_col, axis=1)
    y = df[target_col]

    # 2. Identify feature types
    numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
    categorical_features = X.select_dtypes(include=['object']).columns.tolist()

    # 3. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # 4. Create and apply pipeline
    preprocessor = create_feature_engineering_pipeline(
        numeric_features, categorical_features
    )

    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)

    # 5. Validate
    cv_scores, test_score = validate_features(
        X_train_processed, X_test_processed, y_train, y_test
    )

    return X_train_processed, X_test_processed, y_train, y_test, preprocessor

# Execute
X_train, X_test, y_train, y_test, preprocessor = full_feature_engineering_workflow(
    df, 'target'
)

Anti-Patterns to Avoid

1. Target Leakage

# ❌ WRONG: Using future information
df['next_month_sales'] = df['sales'].shift(-1)  # Leaks future data!

# ✓ CORRECT: Only use past information
df['prev_month_sales'] = df['sales'].shift(1)

2. Fitting on Entire Dataset

# ❌ WRONG: Fit scaler on all data
scaler = StandardScaler().fit(df[numeric_cols])
df_scaled = scaler.transform(df[numeric_cols])
train, test = train_test_split(df_scaled)

# ✓ CORRECT: Fit only on training data
train, test = train_test_split(df)
scaler = StandardScaler().fit(train[numeric_cols])
train_scaled = scaler.transform(train[numeric_cols])
test_scaled = scaler.transform(test[numeric_cols])

3. Creating Too Many Features

# ❌ WRONG: Creating thousands of features blindly
for i in numeric_cols:
    for j in numeric_cols:
        df[f'{i}_x_{j}'] = df[i] * df[j]  # Combinatorial explosion!

# ✓ CORRECT: Be selective based on domain knowledge
important_interactions = [
    ('feature_a', 'feature_b'),  # Known to interact
    ('feature_c', 'feature_d')   # Business logic suggests interaction
]

Conclusion

Feature engineering is where data science becomes an art. The techniques in this guide provide a solid foundation, but the real magic happens when you combine them with deep domain knowledge and iterative experimentation.

Key Takeaways:

  1. Understand first: Explore data thoroughly before engineering
  2. Domain knowledge: Best features come from understanding the problem
  3. Iterate: Try multiple approaches, measure, refine
  4. Validate rigorously: Avoid leakage, use proper train/test splits
  5. Keep it simple: Start with simple features, add complexity only if needed
  6. Document: Keep track of what works and why

The difference between a mediocre model and an exceptional one often lies not in the algorithm choice, but in the quality of features you feed it. Master feature engineering, and you’ll master machine learning.


Resources

  • Books: “Feature Engineering for Machine Learning” by Alice Zheng & Amanda Casari
  • Kaggle: Study feature engineering in winning solutions
  • Papers: “Feature Engineering and Selection: A Practical Approach” (Kuhn & Johnson)
  • Tools: Featuretools for automated feature engineering

Remember: Models are only as good as their features. Engineer wisely.

#Machine Learning #Feature Engineering #Data Science #Model Performance #Python #scikit-learn