# PostgreSQL Slow Log Analysis Checklist

## Minimum Evidence

- PostgreSQL major/minor version.
- Incident start/end time and timezone.
- CPU graph, load average, I/O wait, memory, swap, and connection count for the same window.
- Slow logs with enough context before and after the spike.
- `pg_stat_statements` snapshot if installed. Column availability varies by PostgreSQL version; remove `wal_*` or `jit_*` columns if the server does not expose them:

```sql
SELECT queryid, calls, total_exec_time, mean_exec_time, max_exec_time,
       rows,
       round(rows::numeric / NULLIF(calls, 0), 2) AS rows_per_call,
       shared_blks_hit, shared_blks_read,
       round(shared_blks_hit::numeric / NULLIF(shared_blks_hit + shared_blks_read, 0), 4) AS shared_hit_ratio,
       shared_blks_dirtied, shared_blks_written,
       temp_blks_read, temp_blks_written,
       wal_records, wal_fpi, wal_bytes,
       jit_generation_time + jit_inlining_time + jit_optimization_time + jit_emission_time AS jit_time,
       left(query, 500) AS query_sample
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 30;
```

- Relation size and bloat indicators:

```sql
SELECT s.relname,
       pg_size_pretty(pg_total_relation_size(s.relid)) AS total_size,
       s.n_live_tup, s.n_dead_tup,
       round(s.n_dead_tup::numeric / NULLIF(s.n_live_tup + s.n_dead_tup, 0), 4) AS dead_tuple_ratio,
       s.last_vacuum, s.last_autovacuum,
       last_analyze, last_autoanalyze
FROM pg_stat_user_tables s
ORDER BY pg_total_relation_size(s.relid) DESC
LIMIT 30;
```

## Read-Only Queries For Active Incidents

```sql
SELECT pid, usename, application_name, client_addr, state,
       wait_event_type, wait_event,
       now() - xact_start AS xact_age,
       now() - query_start AS query_age,
       left(query, 500) AS query_sample
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY query_age DESC
LIMIT 50;
```

```sql
SELECT wait_event_type, wait_event, state, count(*) AS sessions
FROM pg_stat_activity
GROUP BY wait_event_type, wait_event, state
ORDER BY sessions DESC;
```

```sql
SELECT state, count(*) AS sessions,
       count(*) FILTER (WHERE now() - xact_start > interval '5 minutes') AS old_transactions
FROM pg_stat_activity
GROUP BY state
ORDER BY sessions DESC;
```

```sql
SELECT blocked.pid AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking.pid AS blocking_pid,
       blocking_activity.usename AS blocking_user,
       left(blocked_activity.query, 300) AS blocked_query,
       left(blocking_activity.query, 300) AS blocking_query
FROM pg_locks blocked
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked.pid
JOIN pg_locks blocking
  ON blocking.locktype = blocked.locktype
 AND blocking.database IS NOT DISTINCT FROM blocked.database
 AND blocking.relation IS NOT DISTINCT FROM blocked.relation
 AND blocking.page IS NOT DISTINCT FROM blocked.page
 AND blocking.tuple IS NOT DISTINCT FROM blocked.tuple
 AND blocking.virtualxid IS NOT DISTINCT FROM blocked.virtualxid
 AND blocking.transactionid IS NOT DISTINCT FROM blocked.transactionid
 AND blocking.classid IS NOT DISTINCT FROM blocked.classid
 AND blocking.objid IS NOT DISTINCT FROM blocked.objid
 AND blocking.objsubid IS NOT DISTINCT FROM blocked.objsubid
 AND blocking.pid <> blocked.pid
JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking.pid
WHERE NOT blocked.granted AND blocking.granted;
```

For PostgreSQL 16+, use `pg_stat_io` to separate CPU-bound query churn from I/O pressure. For older versions, inspect `pg_stat_bgwriter`, `pg_stat_wal` when available, checkpoint logs, OS I/O wait, and storage metrics.

## Diagnosis Heuristics

- High `calls` plus low mean time can still burn CPU. Fix with caching, batching, fewer round trips, or app-side deduplication.
- Estimate calls/sec from two `pg_stat_statements` snapshots around the incident. A query can dominate CPU through call rate even when mean time is small.
- Compare rows/call, shared block hit/read ratio, temp blocks, WAL bytes, JIT time, and parallel worker usage before deciding whether the pressure is CPU, I/O, memory spill, or write amplification.
- High mean/max time with low calls points to plan quality, missing index, stale stats, bloated tables, or large result sets.
- `temp_blks_written` or `temporary file` logs point to sort/hash spills. Inspect `work_mem`, query shape, indexes, and row estimates.
- Lock waits usually make CPU look high indirectly through blocked backlogs and retry storms. Identify blockers before tuning SQL.
- Many active sessions running the same query can be a connection pool or application retry problem, not only a query problem.
- Sequential scans are not automatically bad. Compare scanned rows, selected rows, table size, cache hit ratio, and whether the query is intentionally analytical.
- Avoid global config changes until query/index/app causes are ruled out.

## Safe Fix Patterns

- Add a targeted `CREATE INDEX CONCURRENTLY` only after sizing the table, checking write rate, confirming the predicate/join/order pattern, and defining monitoring plus rollback. Treat it as change-window work on large or busy tables.
- Rewrite non-sargable predicates into indexable forms.
- Limit returned columns and rows; remove accidental N+1 patterns.
- Batch writes and reads with bounded batch sizes.
- Run `ANALYZE` on tables with stale statistics.
- Tune one query before increasing server-wide memory or parallelism settings.

## Verification

- Compare p95/p99 latency, CPU, calls/sec, rows/call, buffers read, temp writes, and lock waits before and after.
- Re-run the exact fingerprint in `pg_stat_statements`.
- Confirm the new plan uses the intended index and does not regress another common predicate.
- Define rollback: drop concurrent index, revert query change, restore pool settings, or disable a feature flag.
