When CloudWatch shows RDS CPU above 80% during peak traffic and application queries start timing out, the cause is almost always one of four things: a single high-frequency query doing a sequential scan, connection saturation forcing the database to context-switch across hundreds of idle connections, an autovacuum worker consuming CPU on a high-write table, or a missing or unused index after a recent schema change or minor version upgrade. This runbook isolates which one you are dealing with in under 10 minutes.
Step 1 — Find what is running right now
The fastest signal is pg_stat_activity. Run this the moment you notice
elevated CPU — active queries are already in flight:
SELECT pid, now() - query_start AS duration, state, wait_event_type, wait_event, left(query, 120) AS query FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - INTERVAL '3 seconds' ORDER BY query_start ASC;
Queries running for more than 3 seconds during peak traffic are usually the problem.
Focus on queries with wait_event_type = NULL — they are actively burning CPU,
not waiting on I/O or locks. Note the query text; you will need it for Step 3.
Step 2 — Check connection count
High connection counts cause CPU overhead even when individual queries are fast. PostgreSQL spawns a backend process per connection. At 200+ connections on a 4-core instance, context-switching alone adds meaningful latency:
SELECT state, count(*) AS total, count(*) FILTER (WHERE state = 'active') AS active, count(*) FILTER (WHERE state = 'idle') AS idle, count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_txn FROM pg_stat_activity WHERE backend_type = 'client backend' GROUP BY state ORDER BY total DESC;
If you have a high idle_in_txn count, those connections are holding locks
and preventing autovacuum. Terminating them is safe in most cases during an incident.
If idle is very high relative to your max_connections, you
need a connection pooler — but that is a post-incident fix, not an incident fix.
Step 3 — Check if autovacuum is consuming CPU
On high-write tables, autovacuum workers can consume significant CPU — especially after a write burst. Check for active vacuum workers:
SELECT pid, wait_event_type, wait_event, left(query, 80) AS query FROM pg_stat_activity WHERE backend_type = 'autovacuum worker';
If autovacuum workers appear here during a CPU incident, it is almost always because
a table accumulated millions of dead tuples during the traffic spike (high delete/update
rates) and autovacuum is now catching up. This will resolve naturally but you can
temporarily increase autovacuum_vacuum_cost_delay to throttle it if necessary.
Do not cancel autovacuum workers unless the table is known to be non-critical — cancelling
just delays the work and allows bloat to grow.
Step 4 — Identify the top CPU-consuming queries
If pg_stat_statements is enabled (it should be — see the
pg_stat_statements guide),
this query identifies which query shapes have consumed the most CPU since the last reset:
SELECT round(total_exec_time::numeric, 2) AS total_ms, calls, round(mean_exec_time::numeric, 2) AS mean_ms, round(stddev_exec_time::numeric, 2) AS stddev_ms, rows, left(query, 120) AS query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 15;
Sort by total_exec_time first, not mean_exec_time. A query
taking 1ms but called 500,000 times per hour is a far bigger CPU burden than a 200ms
query called 100 times. Look for queries where calls is extremely high
and mean_ms has crept up compared to normal — that's the sign of a
plan regression or a table that's grown past its index's efficiency point.
Step 5 — Check for sequential scans on large tables
A sequential scan on a table with tens of millions of rows will consume CPU for the entire duration of the scan. They show up in Step 1 as long-running active queries. Confirm by running EXPLAIN (without ANALYZE — never EXPLAIN ANALYZE on a production system during a CPU incident) on the query from Step 1:
EXPLAIN (FORMAT TEXT) /* paste the query from Step 1 here */;
If the plan shows Seq Scan on a table you would expect to use an index,
the index either doesn't exist, is not being used because statistics are stale, or was
recently dropped. Stale statistics after a data migration are a common cause of CPU spikes
— the planner thinks the table has 10,000 rows, not 10 million, and chooses a seq scan.
Immediate relief options (during the incident)
| Cause | Immediate action | Risk |
|---|---|---|
| Single high-frequency slow query | SELECT pg_cancel_backend(pid) — cancels the query, not the connection |
Low — the query will restart unless you address the root cause |
| Connection saturation | Terminate idle-in-transaction connections older than 5 min | Low — those sessions were already stalled |
| Missing index causing seq scan | CREATE INDEX CONCURRENTLY — safe on live tables |
Medium — adds I/O during creation; CPU may spike briefly |
| Stale statistics after migration | ANALYZE schema.table_name on the affected table |
Low — runs quickly on a single table |
| Autovacuum consuming CPU unexpectedly | Temporarily set autovacuum_vacuum_cost_delay = 20 in parameter group |
Low short-term — increases bloat risk if left too long |
Post-incident: root cause prevention
Once you are back to normal CPU levels, the real work starts. CPU spikes during peak traffic almost always have an underlying structural cause:
- If the culprit was a high-frequency query doing a seq scan: add the appropriate index.
Profile your index usage weekly using
pg_stat_user_indexes. - If connection saturation contributed: implement pgBouncer in transaction mode. See the connection pool exhaustion guide.
- If stale statistics caused a plan regression: set up
autovacuum_analyze_scale_factor = 0.01on high-write tables so ANALYZE runs more frequently. - If autovacuum workers are consuming CPU regularly during peak: recalibrate autovacuum cost parameters. See the autovacuum guide.
Enable log_min_duration_statement = 1000 (1 second) in your parameter group
so that slow queries during future incidents are automatically captured in CloudWatch
Logs — you won't need to be in psql when the next spike happens.
Had a high-CPU incident this week?
PGFlare's Emergency Response session is
designed exactly for this. We join your incident call, run the full diagnostic within minutes,
and deliver a root cause report with specific fixes — not a list of things to investigate.
Available within 2 hours during UK business hours.