Prompt
Train a RandomForestRegressor to predict per-game receiving_yards for
all NFL receivers (WR + TE) in 2021-2024 regular season. Use these
features:
prev_yards(LAG receiving yards)prev_targets(LAG targets)prev_catches(LAG receptions)position_group(categorical)is_home(joined from schedules)week(numeric)
After fitting, return a table of feature importances sorted descending. Plot them as a horizontal bar chart, Lions blue.
Expected output
A (feature, importance) table plus a horizontal bar chart with title,
labels, and source caption.
Hint
RandomForestRegressor exposes feature_importances_ after fitting. Be
aware: for one-hot encoded categoricals, you get one importance per
encoded column, not per original feature. To group them, sum across the
encoded columns or use permutation_importance (which is more robust).
Solution
import matplotlib.pyplot as pltimport pandas as pdfrom sklearn.compose import ColumnTransformerfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sqlalchemy import create_engine
eng = create_engine("postgresql+psycopg://onepride:lions@localhost:5432/onepride")
df = pd.read_sql( """ WITH lagged AS ( SELECT ws.player_display_name, ws.recent_team, ws.season, ws.week, ws.receiving_yards, ws.position_group, LAG(ws.receiving_yards) OVER w AS prev_yards, LAG(ws.targets) OVER w AS prev_targets, LAG(ws.receptions) OVER w AS prev_catches FROM weekly_stats ws WHERE ws.season BETWEEN 2021 AND 2024 AND ws.season_type = 'REG' AND ws.position_group IN ('WR', 'TE') AND ws.targets > 0 WINDOW w AS (PARTITION BY ws.player_display_name, ws.season ORDER BY ws.week) ) SELECT l.*, (CASE WHEN sc.home_team = l.recent_team THEN 1 ELSE 0 END)::int AS is_home FROM lagged l JOIN schedules sc ON sc.season = l.season AND sc.week = l.week AND (sc.home_team = l.recent_team OR sc.away_team = l.recent_team) """, eng,).dropna(subset=['prev_yards', 'prev_targets', 'prev_catches'])
num_cols = ['prev_yards', 'prev_targets', 'prev_catches', 'week', 'is_home']cat_cols = ['position_group']
preprocess = ColumnTransformer([ ('num', StandardScaler(), num_cols), ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),])
pipe = Pipeline([ ('prep', preprocess), ('model', RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)),])
X = df[num_cols + cat_cols]y = df['receiving_yards']pipe.fit(X, y)
# Extract feature names + importancesfeat_names = (num_cols + list(pipe.named_steps['prep'] .named_transformers_['cat'] .get_feature_names_out(cat_cols)))importances = pipe.named_steps['model'].feature_importances_
imp_df = (pd.DataFrame({'feature': feat_names, 'importance': importances}) .sort_values('importance', ascending=True))
fig, ax = plt.subplots(figsize=(8, 5))ax.barh(imp_df['feature'], imp_df['importance'], color='#0076B6')ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)ax.set_title('Random forest feature importance — predicting weekly receiving yards')ax.set_xlabel('Importance')fig.text(0.99, 0.01, 'Source: nflverse', ha='right', fontsize=8, color='gray')plt.tight_layout()plt.savefig('feature-importance.png', dpi=150)print(imp_df.to_string(index=False))prev_targets and prev_yards will dominate. is_home and week
contribute a small amount. The one-hot position columns barely register
because once you’ve got volume features, position is largely captured.