ETL orchestration

Level 5 · Lesson 3

Hook

The data layer is only useful if it stays fresh. A scheduled ETL job pulls nflverse on a cadence, writes the new rows to Postgres, and tells you when something breaks. Build it once and don’t think about it again.

Concept

ETL = Extract, Transform, Load. The loader you’ve been running from the CLI is already most of an ETL — load.py extracts (pulls nflverse), transforms (picks columns, coerces types), and loads (writes to Postgres). What it lacks for production:

A schedule. Cron, GitHub Actions, or a cloud scheduler.
Idempotency. Re-running should not duplicate rows. (Yours already handles this via DELETE-by-year.)
Failure alerting. Something should yell if the job exits non-zero.
A status table. “Last successful run was at X. Loaded Y rows.” Lets the API surface freshness.

The simplest production setup, in increasing order of magic:

Option	Pros	Cons
Cron on a laptop	free, simple	only runs when the laptop’s on
GitHub Actions cron	free, runs in cloud	5-min minimum schedule, public logs
Cloud scheduler (Vercel, Render, etc.)	tied to your deploy	usually paid past a free tier
Airflow / Dagster	full DAGs, deps	heavy for one job

For 1PRIDE, GitHub Actions is the right answer.

Lions example

A GitHub Actions workflow that runs every Tuesday at 6am UTC (Monday night games are in by then):

name: refresh-data
on:
  schedule:
    - cron: '0 6 * * 2'   # Tuesdays 06:00 UTC
  workflow_dispatch:       # also runnable on-demand

jobs:
  refresh:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install uv
      - run: uv sync --extra api
        working-directory: data
      - run: uv run python -m onepride_data.load --years 2024 --tables all
        working-directory: data
        env:
          DATABASE_URL: ${{ secrets.DATABASE_URL }}

Two things to do once:

In GitHub repo settings: add DATABASE_URL as a secret pointing at your prod Postgres (Neon connection string).
In Neon/Supabase: allowlist GitHub Actions IPs, or use a connection pooler URL.

Try it

Add a last_loaded table to the schema:

CREATE TABLE IF NOT EXISTS load_status (
    table_name   TEXT PRIMARY KEY,
    last_loaded  TIMESTAMP NOT NULL,
    rows_loaded  INT
);

Modify load.py so each load_* function writes a row to load_status after a successful insert. Add an endpoint to the FastAPI service:

@app.get("/api/status")
def status() -> dict:
    """Show when each table was last refreshed."""
    ...

Expose this in a small footer on the L5 app so users (and you) can see data freshness.

Common mistakes

No idempotency. Running the loader twice doubles your rows. Always delete-by-key (your loader does this) or use INSERT ... ON CONFLICT.
No alerting. A silent failure looks the same as success. Wire the workflow to send a Slack / email notification on failure.
Pulling all years on every run. Pull only the current season for weekly refreshes; backfill historical years once.
Long-running jobs in serverless. Vercel functions have a hard timeout. ETL jobs that take more than a minute belong in a real runner, not a serverless function.

Quick check

What’s the smallest change that makes a CLI loader “production”?
Why pull only the current season weekly, instead of all years?
What does idempotency mean in this context?