Applied MLbeginner

Feature Engineering & Pipelines

“Garbage in, garbage out — the art of turning raw data into model-ready signals”

The full preprocessing pipeline: imputation (Simple, MICE), categorical encoding (OHE, Target, Ordinal), scaling (Standard, MinMax, Robust), feature creation (polynomial, interactions, log transforms), and sklearn Pipelines for leakage-free evaluation.

45 min

14 diagrams

7 Concepts Covered

Prerequisites

→Linear Regression

→Model Evaluation

Concepts Covered

ImputationOneHotEncoderStandardScalerRobustScalerPolynomialFeaturesColumnTransformerData Leakage

Previous: Bias-Variance Tradeoff & Error Analysis Next: Naïve Bayes Classifiers

∑Key Formulas

StandardScaler

Zero mean, unit variance — sensitive to outliers

MinMaxScaler

Scales to [0,1] — preserves sparsity, sensitive to outliers

RobustScaler

Scales using median and IQR — robust to outliers

Log Transform

Compresses skewed distributions — useful for income, population counts

▶Interactive Simulation

Loading visualization…

🎯

Why Feature Engineering Wins Competitions

motivation

Andrew Ng famously said 'Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.' In Kaggle competitions, top-ranked solutions consistently have better feature engineering than better model architectures. A linear model with brilliant features beats a deep network with raw features in most tabular data problems. Features encode human knowledge about the problem domain — they're the bridge between raw measurement and mathematical structure a model can exploit.

In the Netflix Prize ($1M), the winning team's features included complex temporal patterns, implicit user feedback signals, and movie metadata interactions — not model sophistication.

💡

The Pipeline Mindset

intuition

Think of feature engineering as a sequence of transformations: Raw Data → Imputation (fill missing values) → Encoding (convert categoricals to numbers) → Scaling (put features on comparable scales) → Selection (drop noisy/redundant features). Each step must be fit on training data only and applied consistently to test data — use scikit-learn Pipelines to guarantee this. A Pipeline is also serializable, so your preprocessing is always bundled with your model for deployment.

Data leakage is the most dangerous bug in ML: if your test data influences any preprocessing step, your evaluation is optimistic garbage. Pipelines prevent this by design.

⚙️

The 5-Stage Pipeline

algorithm

Imputation: SimpleImputer (mean/median/mode/constant) or IterativeImputer (MICE multivariate)

Encoding: OrdinalEncoder for ordered categories, OneHotEncoder for nominal (use drop='first' to avoid dummy trap)

Scaling: StandardScaler for Gaussian-ish data, RobustScaler when outliers exist, MinMaxScaler for bounded inputs

Feature creation: PolynomialFeatures (x², x·y interactions), date decomposition (day/month/weekday), domain transforms (log, sqrt)

Selection: VarianceThreshold, SelectKBest (mutual info / chi²), SelectFromModel (tree importances), RFECV

🔬

Categorical Encoding Strategies

deepdive

One-Hot Encoding creates a binary column per category — perfect for unordered categories with few values. With high-cardinality categoricals (cities, zip codes, product IDs), OHE explodes dimensionality. Use Target Encoding instead: replace each category with the mean target value of that category. But target encoding leaks if not done with cross-validation folds. CatBoost's ordered target encoding solves this by using only past samples. For ordinal features (Low/Medium/High), always use OrdinalEncoder with explicit category order.

High cardinality + OHE = disaster. 10,000 zip codes → 10,000 columns, most nearly empty. Use target encoding, embedding layers, or feature hashing instead.

∑

Scaling Choices and Their Effects

math

StandardScaler: assumes Gaussian distribution, makes mean=0 and std=1. Required for SVMs, regularized linear models (Lasso/Ridge), PCA, KNN, neural networks. Not needed for tree-based models (Random Forest, XGBoost — trees only use feature order, not magnitude). MinMaxScaler: needed when algorithm requires bounded inputs (sigmoid activation, [0,1] features for neural networks). RobustScaler: use when outliers are present — scales using median and IQR, making it robust to extreme values.

</>

sklearn Pipeline — Full Example

code

python63 lines

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (StandardScaler, OneHotEncoder,
                                   RobustScaler, PolynomialFeatures)
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
import pandas as pd
import numpy as np

# ── Sample DataFrame ───────────────────────────────────────────────────
np.random.seed(42)
n = 300
df = pd.DataFrame({
    'age':        np.random.randint(18, 70, n).astype(float),
    'income':     np.random.exponential(40000, n),
    'score':      np.random.uniform(300, 850, n),
    'city':       np.random.choice(['Paris', 'Lyon', 'Toulouse'], n),
    'occupation': np.random.choice(['engineer', 'teacher', 'doctor'], n),
    'target':     np.random.randint(0, 2, n),
})
# Add some missing values
df.loc[np.random.choice(n, 20, replace=False), 'age'] = np.nan
df.loc[np.random.choice(n, 15, replace=False), 'city'] = np.nan

X_train = df.drop('target', axis=1)
y_train = df['target']

# ── Define column groups ───────────────────────────────────────────
num_features = ['age', 'income', 'score']
cat_features = ['city', 'occupation']

# ── Preprocessing for numeric columns ─────────────────────────────
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', RobustScaler()),
])

# ── Preprocessing for categorical columns ─────────────────────────
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', drop='first')),
])

# ── Combine with ColumnTransformer ─────────────────────────────────
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, num_features),
    ('cat', categorical_transformer, cat_features),
])

# ── Full pipeline: preprocess → feature select → model ────────────
pipe = Pipeline([
    ('prep', preprocessor),
    ('poly', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
    ('select', SelectFromModel(RandomForestClassifier(n_estimators=50), threshold='median')),
    ('clf', GradientBoostingClassifier(n_estimators=200, learning_rate=0.05)),
])

# Train / evaluate — preprocessing is always fitted on train only
pipe.fit(X_train, y_train)
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")

⚠️

Preprocessing Pitfalls

pitfall

Fitting scalers on the full dataset (before splitting) is data leakage — test statistics contaminate training. Always fit inside a Pipeline or on X_train only. Second: OneHotEncoder on test data may see unseen categories → use handle_unknown='ignore'. Third: imputing with mean before splitting leaks test mean into training. Fourth: polynomial features explode memory — 100 features × degree=2 → 5,050 columns. Use interaction_only=True and feature selection downstream. Fifth: target encoding without cross-validation leaks target information.

The Pipeline object in scikit-learn is not just convenient — it is required for correct cross-validation. Any preprocessing that 'learns' from data (scalers, encoders, imputers) must be inside the pipeline.

?Knowledge Check

Progress is saved in your browser — no account needed.

Bias-Variance Tradeoff & Error Analysis

Naïve Bayes Classifiers

Need a Data Scientist or AI Engineer?

I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.

Get in touch View services