Prompt
Extend Challenge 3:
- Add
position_groupas a categorical feature. - Wrap the preprocessing + model in a single
PipelineusingColumnTransformer. - Run 5-fold cross-validation.
- Compare two final models —
LinearRegressionandRandomForestRegressor(n_estimators=200).
Report mean R² ± std for both. Comment on whether the extra complexity of the RF is justified by the gain.
Expected output
Linear R² = 0.0XX ± 0.0XXRandomForest R² = 0.0XX ± 0.0XXPlus your read on whether the RF earns its complexity.
Hint
from sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler, OneHotEncoder
preprocess = ColumnTransformer([ ('num', StandardScaler(), ['prev_yards', 'prev_targets']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['position_group']),])pipe = Pipeline([('prep', preprocess), ('model', LinearRegression())])Same SQL as Challenge 3, just expand the SELECT to include
position_group and broaden the position filter to IN ('WR', 'TE').
Solution
import pandas as pdfrom sklearn.compose import ColumnTransformerfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sqlalchemy import create_engine
eng = create_engine("postgresql+psycopg://onepride:lions@localhost:5432/onepride")
df = pd.read_sql( """ SELECT player_display_name, recent_team, season, week, receiving_yards, position_group, LAG(receiving_yards) OVER w AS prev_yards, LAG(targets) OVER w AS prev_targets FROM weekly_stats WHERE season BETWEEN 2021 AND 2024 AND season_type = 'REG' AND position_group IN ('WR', 'TE') AND targets > 0 WINDOW w AS (PARTITION BY player_display_name, season ORDER BY week) """, eng,)df = df.dropna(subset=['prev_yards', 'prev_targets'])
X = df[['prev_yards', 'prev_targets', 'position_group']]y = df['receiving_yards']
preprocess = ColumnTransformer([ ('num', StandardScaler(), ['prev_yards', 'prev_targets']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['position_group']),])
for name, est in [ ('Linear', LinearRegression()), ('RandomForest', RandomForestRegressor(n_estimators=200, random_state=42)),]: pipe = Pipeline([('prep', preprocess), ('model', est)]) scores = cross_val_score(pipe, X, y, cv=5, scoring='r2') print(f"{name:13s} R² = {scores.mean():.3f} ± {scores.std():.3f}")If RF beats Linear by less than one standard deviation, the simpler model wins — fewer hyperparameters, faster, more interpretable.