SQL is great for pulling the data. Charts, models, and reproducible analysis live
in Python. pandas is the bridge: it gives you SQL-shaped operations as Python
methods on a DataFrame.
Concept
A DataFrame is a table — rows, columns, a header, an index. The four operations
that map cleanly from SQL:
SQL
pandas
SELECT a, b FROM t
df[['a', 'b']]
WHERE a > 100
df[df['a'] > 100] (boolean mask)
ORDER BY a DESC
df.sort_values('a', ascending=False)
LIMIT 10
df.head(10)
You can pull straight from Postgres using read_sql, then chain pandas methods:
import pandas as pd
from sqlalchemy import create_engine
eng =create_engine("postgresql+psycopg://onepride:lions@localhost:5432/onepride")
raw[...] is a boolean mask. .loc[:, [...]] picks columns. Method chains read
top to bottom like a query plan.
Try it
Using read_sql and pandas (no WHERE clause in the SQL), pull every 2024
regular-season Lions row into df, then find the single highest fantasy
points game using only pandas methods.
Common mistakes
df[col > 100] instead of df[df[col] > 100]. The first compares a
string to a number. The second creates a boolean mask of the right length.
Forgetting .copy() when slicing for later mutation. If you plan to add
columns to a filtered subset, do subset = df[mask].copy() to silence the
SettingWithCopyWarning.
== vs =. SQL uses =; pandas uses Python’s ==. Mix them up and
you’ll get assignments where you wanted comparisons.
String comparisons.df['recent_team'] == 'DET' works. 'DET' in df['recent_team'] looks like English but is wrong. Use .isin(['DET']) for
multi-value matches.
Quick check
What does df.head(10) correspond to in SQL?
How do you express WHERE recent_team IN ('DET', 'GB') in pandas?
Given a DataFrame df with a rushing_yards column, write the pandas
equivalent of SELECT * FROM ... WHERE rushing_yards >= 100 ORDER BY rushing_yards DESC LIMIT 5.