Pipelines and cross-validation
Hook
Once you have preprocessing + a model, you have two problems: keeping them in sync between train and test, and trusting that the test score isn’t just luck. Pipelines fix the first. Cross-validation fixes the second.
Concept
A Pipeline is a chain of transformations ending in a model. The whole
thing acts like a single model — fit, predict, score.
from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegression
pipe = Pipeline([ ('scaler', StandardScaler()), ('model', LinearRegression()),])
pipe.fit(X_train, y_train)pipe.score(X_test, y_test)The scaler fits on train data and transforms both train and test
correctly — no manual bookkeeping.
For mixed numeric + categorical features, use ColumnTransformer:
from sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import OneHotEncoder, StandardScaler
preprocess = ColumnTransformer([ ('num', StandardScaler(), ['carries', 'attempts']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['position_group']),])
pipe = Pipeline([ ('prep', preprocess), ('model', LinearRegression()),])Cross-validation trains the model on K-1 folds and evaluates on the held- out fold, K times. The result is K scores instead of one. The standard deviation across folds tells you how reliable the score is.
from sklearn.model_selection import cross_val_scorescores = cross_val_score(pipe, X, y, cv=5, scoring='r2')print(f"R² = {scores.mean():.3f} ± {scores.std():.3f}")A model with R² = 0.62 ± 0.03 is trustworthy. R² = 0.62 ± 0.20 is not.
Lions example
A full pipeline predicting Lions QB passing yards, with cross-validation:
import pandas as pdfrom sklearn.compose import ColumnTransformerfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sqlalchemy import create_engine
eng = create_engine("postgresql+psycopg://onepride:lions@localhost:5432/onepride")
df = pd.read_sql( """ SELECT ws.season, ws.week, ws.passing_yards, ws.attempts, ws.completions, ws.interceptions, sc.roof FROM weekly_stats ws JOIN schedules sc ON sc.season = ws.season AND sc.week = ws.week AND (sc.home_team = ws.recent_team OR sc.away_team = ws.recent_team) WHERE ws.position = 'QB' AND ws.season BETWEEN 2021 AND 2024 AND ws.season_type = 'REG' AND ws.attempts >= 15 """, eng,)
X = df[['attempts', 'completions', 'interceptions', 'roof']]y = df['passing_yards']
pipe = Pipeline([ ('prep', ColumnTransformer([ ('num', StandardScaler(), ['attempts', 'completions', 'interceptions']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['roof']), ])), ('model', LinearRegression()),])
# NB: this is regular KFold, which shuffles weeks/seasons together. For game# data the right tool is TimeSeriesSplit (used in Challenge 5) — the number# below is therefore optimistic. Common-Mistakes section calls this out.scores = cross_val_score(pipe, X, y, cv=5, scoring='r2')print(f"R² (KFold, optimistic) = {scores.mean():.3f} ± {scores.std():.3f}")Try it
Take the L4 lesson 4 model. Wrap it in a Pipeline with StandardScaler.
Run 5-fold cross-validation. Report the mean R² and standard deviation.
Compare to the single-split R² from Lesson 4 — is it within one standard
deviation?
Common mistakes
- Scaling before
train_test_split. Leaks information. Use aPipelineand you can’t make this mistake. - Time-series leakage with
cross_val_score. Default CV shuffles rows. For game-by-game data, useTimeSeriesSplitso test folds always come after train folds. - One-hot encoding without
handle_unknown='ignore'. Test set has a category the train set didn’t? Default behavior errors out. Sethandle_unknown='ignore'and you get a row of zeros. - Cherry-picking the best fold. Report mean ± std, not the best single score.
Quick check
- What problem does a
Pipelinesolve that you’d hit doing scaling manually? - When should you use
TimeSeriesSplitinstead of regularKFold? - How do you interpret R² = 0.50 ± 0.18?