Skip to content

Pipelines and cross-validation

Level 4 · Lesson 6

Hook

Once you have preprocessing + a model, you have two problems: keeping them in sync between train and test, and trusting that the test score isn’t just luck. Pipelines fix the first. Cross-validation fixes the second.

Concept

A Pipeline is a chain of transformations ending in a model. The whole thing acts like a single model — fit, predict, score.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression()),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

The scaler fits on train data and transforms both train and test correctly — no manual bookkeeping.

For mixed numeric + categorical features, use ColumnTransformer:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
preprocess = ColumnTransformer([
('num', StandardScaler(), ['carries', 'attempts']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['position_group']),
])
pipe = Pipeline([
('prep', preprocess),
('model', LinearRegression()),
])

Cross-validation trains the model on K-1 folds and evaluates on the held- out fold, K times. The result is K scores instead of one. The standard deviation across folds tells you how reliable the score is.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring='r2')
print(f"R² = {scores.mean():.3f} ± {scores.std():.3f}")

A model with R² = 0.62 ± 0.03 is trustworthy. R² = 0.62 ± 0.20 is not.

Lions example

A full pipeline predicting Lions QB passing yards, with cross-validation:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sqlalchemy import create_engine
eng = create_engine("postgresql+psycopg://onepride:lions@localhost:5432/onepride")
df = pd.read_sql(
"""
SELECT ws.season, ws.week, ws.passing_yards, ws.attempts,
ws.completions, ws.interceptions, sc.roof
FROM weekly_stats ws
JOIN schedules sc
ON sc.season = ws.season
AND sc.week = ws.week
AND (sc.home_team = ws.recent_team OR sc.away_team = ws.recent_team)
WHERE ws.position = 'QB'
AND ws.season BETWEEN 2021 AND 2024
AND ws.season_type = 'REG'
AND ws.attempts >= 15
""",
eng,
)
X = df[['attempts', 'completions', 'interceptions', 'roof']]
y = df['passing_yards']
pipe = Pipeline([
('prep', ColumnTransformer([
('num', StandardScaler(), ['attempts', 'completions', 'interceptions']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['roof']),
])),
('model', LinearRegression()),
])
# NB: this is regular KFold, which shuffles weeks/seasons together. For game
# data the right tool is TimeSeriesSplit (used in Challenge 5) — the number
# below is therefore optimistic. Common-Mistakes section calls this out.
scores = cross_val_score(pipe, X, y, cv=5, scoring='r2')
print(f"R² (KFold, optimistic) = {scores.mean():.3f} ± {scores.std():.3f}")

Try it

Take the L4 lesson 4 model. Wrap it in a Pipeline with StandardScaler. Run 5-fold cross-validation. Report the mean R² and standard deviation. Compare to the single-split R² from Lesson 4 — is it within one standard deviation?

Common mistakes

  • Scaling before train_test_split. Leaks information. Use a Pipeline and you can’t make this mistake.
  • Time-series leakage with cross_val_score. Default CV shuffles rows. For game-by-game data, use TimeSeriesSplit so test folds always come after train folds.
  • One-hot encoding without handle_unknown='ignore'. Test set has a category the train set didn’t? Default behavior errors out. Set handle_unknown='ignore' and you get a row of zeros.
  • Cherry-picking the best fold. Report mean ± std, not the best single score.

Quick check

  1. What problem does a Pipeline solve that you’d hit doing scaling manually?
  2. When should you use TimeSeriesSplit instead of regular KFold?
  3. How do you interpret R² = 0.50 ± 0.18?