Skip to content

Pipeline + cross-validation

Level 4 · Challenge 4
Starter

Prompt

Extend Challenge 3:

  1. Add position_group as a categorical feature.
  2. Wrap the preprocessing + model in a single Pipeline using ColumnTransformer.
  3. Run 5-fold cross-validation.
  4. Compare two final models — LinearRegression and RandomForestRegressor(n_estimators=200).

Report mean R² ± std for both. Comment on whether the extra complexity of the RF is justified by the gain.

Expected output

Linear R² = 0.0XX ± 0.0XX
RandomForest R² = 0.0XX ± 0.0XX

Plus your read on whether the RF earns its complexity.

Hint
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
preprocess = ColumnTransformer([
('num', StandardScaler(), ['prev_yards', 'prev_targets']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['position_group']),
])
pipe = Pipeline([('prep', preprocess), ('model', LinearRegression())])

Same SQL as Challenge 3, just expand the SELECT to include position_group and broaden the position filter to IN ('WR', 'TE').

Solution
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sqlalchemy import create_engine
eng = create_engine("postgresql+psycopg://onepride:lions@localhost:5432/onepride")
df = pd.read_sql(
"""
SELECT
player_display_name, recent_team, season, week,
receiving_yards, position_group,
LAG(receiving_yards) OVER w AS prev_yards,
LAG(targets) OVER w AS prev_targets
FROM weekly_stats
WHERE season BETWEEN 2021 AND 2024
AND season_type = 'REG'
AND position_group IN ('WR', 'TE')
AND targets > 0
WINDOW w AS (PARTITION BY player_display_name, season ORDER BY week)
""",
eng,
)
df = df.dropna(subset=['prev_yards', 'prev_targets'])
X = df[['prev_yards', 'prev_targets', 'position_group']]
y = df['receiving_yards']
preprocess = ColumnTransformer([
('num', StandardScaler(), ['prev_yards', 'prev_targets']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['position_group']),
])
for name, est in [
('Linear', LinearRegression()),
('RandomForest', RandomForestRegressor(n_estimators=200, random_state=42)),
]:
pipe = Pipeline([('prep', preprocess), ('model', est)])
scores = cross_val_score(pipe, X, y, cv=5, scoring='r2')
print(f"{name:13s} R² = {scores.mean():.3f} ± {scores.std():.3f}")

If RF beats Linear by less than one standard deviation, the simpler model wins — fewer hyperparameters, faster, more interpretable.