Skip to content

pandas merge and groupby

Level 2 · Lesson 5

Hook

Everything you just learned in SQL has a pandas twin. pd.merge is JOIN. df.groupby(...).agg(...) is GROUP BY. Knowing both gives you flexibility: heavy aggregation in SQL (database is faster), shaping for charts in pandas.

Concept

The translation table:

SQLpandas
INNER JOINpd.merge(a, b, on='key', how='inner')
LEFT JOINpd.merge(a, b, on='key', how='left')
GROUP BY col SELECT agg(...)df.groupby('col').agg(...)
HAVING agg(...) > Ndf.groupby('col').filter(lambda g: g[col].sum() > N)

Aggregation in pandas uses .agg() with a dict mapping output names to functions:

totals = (
weekly.groupby('player_display_name', as_index=False)
.agg(games=('week', 'count'),
yards=('rushing_yards', 'sum'),
tds=('rushing_tds', 'sum'))
.sort_values('yards', ascending=False)
)

as_index=False keeps the group column as a column (instead of an index), which makes downstream work — sorting, filtering, plotting — feel like SQL.

Lions example

The Level 2 Lesson 1 join (Lions stats + game scores) in pandas:

import pandas as pd
from sqlalchemy import create_engine
eng = create_engine("postgresql+psycopg://onepride:lions@localhost:5432/onepride")
stats = pd.read_sql(
"SELECT * FROM weekly_stats WHERE recent_team = 'DET' AND season = 2024",
eng,
)
games = pd.read_sql(
"""SELECT season, week, home_team, away_team, home_score, away_score
FROM schedules WHERE season = 2024 AND game_type = 'REG'""",
eng,
)
# Build a Lions-side game frame
games['det_pts'] = games.apply(
lambda r: r['home_score'] if r['home_team'] == 'DET' else r['away_score'],
axis=1,
)
games['opp_pts'] = games.apply(
lambda r: r['away_score'] if r['home_team'] == 'DET' else r['home_score'],
axis=1,
)
games = games[(games['home_team'] == 'DET') | (games['away_team'] == 'DET')]
# Tag the games frame with the team so the merge is self-correcting if you
# ever pull non-DET stats into the same DataFrame.
games['recent_team'] = 'DET'
games = games[['season', 'week', 'recent_team', 'det_pts', 'opp_pts']]
merged = pd.merge(
stats, games, on=['season', 'week', 'recent_team'], how='inner',
)
# Sum receiving yards per player, with game context preserved
totals = (
merged.groupby('player_display_name', as_index=False)
.agg(games=('week', 'count'),
rec_yards=('receiving_yards', 'sum'),
points_for=('det_pts', 'mean'))
.sort_values('rec_yards', ascending=False)
)

Same shape as the SQL version — the database can do the work too. In practice you’ll lean on whichever tool feels more natural for the analysis you’re running.

Try it

Using pandas (no SQL aggregation), produce a DataFrame of Lions 2024 regular- season receivers with: name, games played (rows with non-null receptions), total catches, total yards, total TDs, yards per catch. Sort by yards descending.

Lean on groupby('player_display_name').agg(...) and add the yards-per-catch column with .assign(ypc=lambda d: d['yards'] / d['catches']).

Common mistakes

  • Forgetting as_index=False. Without it, the group column becomes the DataFrame index, and you can’t merge or sort on it without .reset_index().
  • .agg('sum') on a non-numeric column. Filter the input frame to numeric columns first, or use the dict-of-tuples form (agg(yards=('rushing_yards', 'sum'))) to be explicit.
  • pd.merge without on=. It guesses based on shared column names, which is fine until two tables share more columns than you intended.
  • Chained inplace operations. df.merge(...).fillna(0, inplace=True) modifies a temporary, not df. Skip inplace=True and rebind: df = df.merge(...).fillna(0).

Quick check

  1. How do pd.merge(a, b, how='inner') and pd.merge(a, b, how='left') differ?
  2. What does as_index=False do on groupby?
  3. Translate this SQL to pandas: SELECT recent_team, SUM(rushing_yards) FROM weekly_stats GROUP BY recent_team HAVING SUM(rushing_yards) > 1000.