---
name: postgresql-log-analysis
description: PostgreSQL production slow-query and database log analysis for diagnosing high CPU, slow SQL, lock waits, temp-file spills, autovacuum pressure, checkpoint/I/O symptoms, and query-plan regressions. Use when asked to analyze PostgreSQL slow logs, pg logs, pg_stat_statements exports, EXPLAIN plans, CPU spikes, production database performance incidents, or SQL optimization work.
---

# PostgreSQL Log Analysis

## Operating Rules

Treat production data as sensitive. Ask the user to redact secrets, customer PII, tokens, hostnames, and literal business identifiers before sharing logs. Do not request direct production credentials. Prefer read-only evidence: PostgreSQL logs, `pg_stat_statements` exports, safe execution plans, schema/index DDL, table sizes, and PostgreSQL version.

Be explicit about `EXPLAIN ANALYZE` risk. `ANALYZE` executes the statement. On production, prefer replica or staging. For DML, use `EXPLAIN` without `ANALYZE` unless the user has a change window and rollback plan. For production SELECT analysis, require a low-traffic window, `statement_timeout`, read-only transaction when possible, expected row bounds, and a stop condition.

Never recommend risky production changes as a first step. Label every recommendation by risk:

- `safe-now`: observation, read-only query, adding missing evidence, timeout guard, or `EXPLAIN` without `ANALYZE`.
- `low-risk`: tuning a single statement, increasing targeted logging, lowering batch size after validation, or running `ANALYZE` on a clearly stale table during a safe window.
- `requires-change-window`: `EXPLAIN ANALYZE` for DML, `CREATE INDEX CONCURRENTLY` on large or busy tables, VACUUM FULL, REINDEX without CONCURRENTLY, major config changes, migration rewrites, partition changes, or app behavior changes.

## Workflow

1. Clarify the incident window, timezone, PostgreSQL version, workload type, and symptom: CPU saturation, latency spike, lock pileup, I/O wait, connection storm, or all of the above.
2. Collect evidence. Prefer:
   - Slow logs with `duration:` lines.
   - Error logs around the same timestamp.
   - `pg_stat_statements` top queries by total time, mean time, calls, rows, shared/local/temp block metrics.
   - Safe plans for top SQL fingerprints: start with `EXPLAIN (BUFFERS, VERBOSE)`; use `EXPLAIN (ANALYZE, BUFFERS, VERBOSE)` only under the production guardrails above.
   - Index definitions, table row estimates, `n_dead_tup`, autovacuum status, and table sizes.
3. Run `scripts/parse_postgres_slowlog.py` for text slow-log `duration:` lines when useful. It groups statements by normalized fingerprint and reports total time, mean time, max time, and calls. Raw SQL examples are hidden unless `--show-examples` is explicitly used after redaction.
4. Classify the root cause pattern before proposing fixes:
   - high call count cheap query causing CPU burn;
   - expensive sequential scan or bad join plan;
   - missing/unused index;
   - stale statistics or parameter-sensitive plan;
   - lock wait or transaction backlog;
   - temp-file sort/hash spill;
   - autovacuum or bloat pressure;
   - checkpoint/WAL/I/O amplification;
   - connection pool storm or too much parallelism.
5. Produce a concise incident report with findings, evidence, immediate mitigations, durable fixes, and verification steps.

## Output Format

Use this structure for production incidents. Respond in Chinese by default unless the user asks otherwise.

```markdown
## 结论
一句话说明最可能原因、影响面和置信度。

## 证据
- 时间窗口：
- Top SQL 指纹：
- CPU/日志关联：
- 缺失证据：

## 主要发现
1. 发现标题
   证据：
   原因分析（为什么会导致 CPU 或延迟升高）：
   风险：

## 建议动作
- safe-now:
- low-risk:
- requires-change-window:

## 需要补充的信息
- （列出当前缺少但对定论必要的证据，没有则省略此段）

## 验证方式
- 观察指标：
- 复查 SQL 或日志：
- 回滚或停止条件：
```

## Script Usage

Run the parser against one or more PostgreSQL text log files. Paths below assume you are running from the project root directory; adjust if running from elsewhere.

```bash
python3 .trae/skills/postgresql-log-analysis/scripts/parse_postgres_slowlog.py /path/to/postgresql.log --top 30
```

Use `--min-duration` to skip statements below a threshold (milliseconds) and reduce noise when `log_min_duration_statement` is set low:

```bash
python3 .trae/skills/postgresql-log-analysis/scripts/parse_postgres_slowlog.py /path/to/postgresql.log --top 30 --min-duration 100
```

For compressed logs, decompress outside the script first:

```bash
zcat postgresql.log.gz > /tmp/postgresql.log
python3 .trae/skills/postgresql-log-analysis/scripts/parse_postgres_slowlog.py /tmp/postgresql.log
```

The script is a triage helper, not a final diagnosis. Cross-check grouped fingerprints against `pg_stat_statements` and execution plans before recommending schema or config changes.

The script only parses text slow-log entries shaped like `duration: ... statement:` or prepared-statement `execute/parse/bind` lines. It does not parse CSV logs, JSON logs, syslog fields, `auto_explain` plans, temporary-file records, lock-wait records, checkpoint records, or autovacuum records. Interpret those with `references/postgresql-log-patterns.md` and database metrics.

To include raw SQL examples after logs have been redacted:

```bash
python3 .trae/skills/postgresql-log-analysis/scripts/parse_postgres_slowlog.py /tmp/postgresql.log --show-examples
```

## References

Read `references/analysis-checklist.md` when preparing an incident report or deciding what evidence to request.

Read `references/postgresql-log-patterns.md` when interpreting specific log signatures such as `duration`, `temporary file`, lock waits, checkpoints, autovacuum, and cancelled statements.
