Knowledge Base › P1 Runbook: High RDS CPU
Emergency 4 min read · Incident Response · AWS RDS

P1 Runbook: High RDS CPU During Peak Traffic

In an active incident? Skip to Step 1 below. Run the queries in order. The whole runbook takes under 10 minutes to execute.

When CloudWatch shows RDS CPU above 80% during peak traffic and application queries start timing out, the cause is almost always one of four things: a single high-frequency query doing a sequential scan, connection saturation forcing the database to context-switch across hundreds of idle connections, an autovacuum worker consuming CPU on a high-write table, or a missing or unused index after a recent schema change or minor version upgrade. This runbook isolates which one you are dealing with in under 10 minutes.

Step 1 — Find what is running right now

The fastest signal is pg_stat_activity. Run this the moment you notice elevated CPU — active queries are already in flight:

SELECT
  pid,
  now() - query_start        AS duration,
  state,
  wait_event_type,
  wait_event,
  left(query, 120)             AS query
FROM   pg_stat_activity
WHERE  state = 'active'
  AND  query_start < now() - INTERVAL '3 seconds'
ORDER BY query_start ASC;

Queries running for more than 3 seconds during peak traffic are usually the problem. Focus on queries with wait_event_type = NULL — they are actively burning CPU, not waiting on I/O or locks. Note the query text; you will need it for Step 3.

Step 2 — Check connection count

High connection counts cause CPU overhead even when individual queries are fast. PostgreSQL spawns a backend process per connection. At 200+ connections on a 4-core instance, context-switching alone adds meaningful latency:

SELECT
  state,
  count(*)                                                AS total,
  count(*) FILTER (WHERE state = 'active')               AS active,
  count(*) FILTER (WHERE state = 'idle')                 AS idle,
  count(*) FILTER (WHERE state = 'idle in transaction')  AS idle_in_txn
FROM   pg_stat_activity
WHERE  backend_type = 'client backend'
GROUP BY state
ORDER BY total DESC;

If you have a high idle_in_txn count, those connections are holding locks and preventing autovacuum. Terminating them is safe in most cases during an incident. If idle is very high relative to your max_connections, you need a connection pooler — but that is a post-incident fix, not an incident fix.

Step 3 — Check if autovacuum is consuming CPU

On high-write tables, autovacuum workers can consume significant CPU — especially after a write burst. Check for active vacuum workers:

SELECT
  pid,
  wait_event_type,
  wait_event,
  left(query, 80) AS query
FROM   pg_stat_activity
WHERE  backend_type = 'autovacuum worker';

If autovacuum workers appear here during a CPU incident, it is almost always because a table accumulated millions of dead tuples during the traffic spike (high delete/update rates) and autovacuum is now catching up. This will resolve naturally but you can temporarily increase autovacuum_vacuum_cost_delay to throttle it if necessary. Do not cancel autovacuum workers unless the table is known to be non-critical — cancelling just delays the work and allows bloat to grow.

Step 4 — Identify the top CPU-consuming queries

If pg_stat_statements is enabled (it should be — see the pg_stat_statements guide), this query identifies which query shapes have consumed the most CPU since the last reset:

SELECT
  round(total_exec_time::numeric, 2)     AS total_ms,
  calls,
  round(mean_exec_time::numeric, 2)      AS mean_ms,
  round(stddev_exec_time::numeric, 2)    AS stddev_ms,
  rows,
  left(query, 120)                        AS query
FROM   pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT  15;

Sort by total_exec_time first, not mean_exec_time. A query taking 1ms but called 500,000 times per hour is a far bigger CPU burden than a 200ms query called 100 times. Look for queries where calls is extremely high and mean_ms has crept up compared to normal — that's the sign of a plan regression or a table that's grown past its index's efficiency point.

Step 5 — Check for sequential scans on large tables

A sequential scan on a table with tens of millions of rows will consume CPU for the entire duration of the scan. They show up in Step 1 as long-running active queries. Confirm by running EXPLAIN (without ANALYZE — never EXPLAIN ANALYZE on a production system during a CPU incident) on the query from Step 1:

EXPLAIN (FORMAT TEXT)
  /* paste the query from Step 1 here */;

If the plan shows Seq Scan on a table you would expect to use an index, the index either doesn't exist, is not being used because statistics are stale, or was recently dropped. Stale statistics after a data migration are a common cause of CPU spikes — the planner thinks the table has 10,000 rows, not 10 million, and chooses a seq scan.

Never run EXPLAIN ANALYZE during an incident. EXPLAIN ANALYZE actually executes the query. On a table scan that's already contributing to CPU load, running EXPLAIN ANALYZE makes it worse. Use plain EXPLAIN during live incidents. Save ANALYZE for post-incident investigation on a read replica.

Immediate relief options (during the incident)

Cause Immediate action Risk
Single high-frequency slow query SELECT pg_cancel_backend(pid) — cancels the query, not the connection Low — the query will restart unless you address the root cause
Connection saturation Terminate idle-in-transaction connections older than 5 min Low — those sessions were already stalled
Missing index causing seq scan CREATE INDEX CONCURRENTLY — safe on live tables Medium — adds I/O during creation; CPU may spike briefly
Stale statistics after migration ANALYZE schema.table_name on the affected table Low — runs quickly on a single table
Autovacuum consuming CPU unexpectedly Temporarily set autovacuum_vacuum_cost_delay = 20 in parameter group Low short-term — increases bloat risk if left too long

Post-incident: root cause prevention

Once you are back to normal CPU levels, the real work starts. CPU spikes during peak traffic almost always have an underlying structural cause:

Enable log_min_duration_statement = 1000 (1 second) in your parameter group so that slow queries during future incidents are automatically captured in CloudWatch Logs — you won't need to be in psql when the next spike happens.

Had a high-CPU incident this week?
PGFlare's Emergency Response session is designed exactly for this. We join your incident call, run the full diagnostic within minutes, and deliver a root cause report with specific fixes — not a list of things to investigate. Available within 2 hours during UK business hours.

Emergency Response — £350/hr →