scikit-learn basics
Hook
A GM doesn’t just describe last season — they predict next season. scikit-learn
is the tool. Three methods (fit, predict, score) handle almost everything;
the rest is feature engineering and not lying to yourself about how well it
works.
Concept
Every sklearn model has the same shape:
from sklearn.linear_model import LinearRegression
model = LinearRegression()model.fit(X_train, y_train)predictions = model.predict(X_test)r2 = model.score(X_test, y_test)Xis a 2-D array (rows = samples, columns = features).yis a 1-D array of targets (what you’re predicting).fit(X, y)learns from training data.predict(X)outputs predictions for new data.score(X, y)returns a default metric (R² for regression, accuracy for classification).
The model menu, starting points:
| Problem | Default first model |
|---|---|
| Regression (predict a number) | LinearRegression, then RandomForestRegressor |
| Classification (predict a category) | LogisticRegression, then RandomForestClassifier |
| When in doubt | Try both linear and tree-based; compare |
Lions example
Predict QB passing yards in a game from rushing yards. (Toy example — the real version of this needs more features.)
import pandas as pdfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sqlalchemy import create_engine
eng = create_engine("postgresql+psycopg://onepride:lions@localhost:5432/onepride")
df = pd.read_sql( """ SELECT player_display_name, season, week, passing_yards, attempts, completions FROM weekly_stats WHERE position = 'QB' AND season BETWEEN 2021 AND 2024 AND season_type = 'REG' AND attempts >= 15 """, eng,)
X = df[['attempts', 'completions']].to_numpy()y = df['passing_yards'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)
model = LinearRegression()model.fit(X_train, y_train)print(f"R² = {model.score(X_test, y_test):.3f}")print(f"Coefficients: {dict(zip(['attempts', 'completions'], model.coef_))}")You’ll get an R² in the 0.8+ range — completions and attempts predict
yards well because they’re mechanically tied (every completion is
some yards). That’s high R² for the wrong reason — it’s a target-leakage
near-miss, since completions is observed at the same time as
passing_yards. Lesson 5 covers leakage formally; for now, just notice
that this number looks too clean. Real predictive problems are
much harder.
Try it
Predict rushing yards per game for NFL running backs from carries, in
2024 regular season only. Use a LinearRegression. Report R² on the test
set, plus the coefficient on carries. Interpret the coefficient — what does
it physically mean?
Common mistakes
- Fitting and scoring on the same data. Always split train/test. A model that perfectly predicts training data tells you nothing.
- Not setting
random_state. Train/test splits should be reproducible. - R² on classification. R² is for regression. For classification, use accuracy, precision, recall, or F1.
- Forgetting feature scaling. Linear models, SVMs, and KNN are scale-
sensitive — use
StandardScalerbefore fitting. Tree-based models don’t care.
Quick check
- What do
fit,predict, andscoredo? - Why do you need separate train and test sets?
- When is feature scaling necessary?