The default RDS monitoring setup that most teams inherit alarms on CPUUtilization
and not much else. It's a reasonable starting point. The problem is that CPU is a lagging
indicator for most PostgreSQL failure modes. By the time CPU hits 80%, queries have already
been timing out for several minutes. The metrics that matter are the ones that give you
signal before the incident is in progress — not after.
These are the five alarms we set on every RDS PostgreSQL instance we work with, and why each one matters more than CPU alone.
1. DatabaseConnections — the most important one you're probably not watching
Connection exhaustion is one of the most common causes of "database unavailable" pages
in production PostgreSQL. Unlike CPU spikes, which often self-correct, connection
exhaustion causes immediate application errors: every new connection attempt gets
rejected with FATAL: remaining connection slots are reserved for non-replication
superuser connections.
The alarm should trigger at 80% of max_connections — not when you're
already at 100% and application errors are happening. The threshold varies by instance
class because PostgreSQL's default max_connections is memory-based:
| Instance family | RAM | Default max_connections | Alarm threshold (80%) |
|---|---|---|---|
db.t3.medium |
4 GB | ~170 | 136 |
db.r5.large |
16 GB | ~435 | 348 |
db.r5.xlarge |
32 GB | ~855 | 684 |
db.r5.2xlarge |
64 GB | ~1700 | 1360 |
db.r5.4xlarge |
128 GB | ~3400 | 2720 |
To check your current max_connections value and set the alarm threshold correctly:
SHOW max_connections;
2. FreeStorageSpace — set it early, not at zero
A full RDS volume doesn't cause a graceful degradation. PostgreSQL stops accepting writes immediately when storage is exhausted. This means: no new rows, no transaction commits, no autovacuum (which also writes). The instance becomes effectively read-only and applications start failing hard.
Set the alarm at 20% of allocated storage remaining, not 5%. You need time to react — storage can fill faster than expected after a write surge, a bloated migration, or a large batch job. Twenty percent gives you hours to respond; 5% gives you minutes.
For a 500GB gp3 volume, the alarm threshold is 100GB free (i.e., 400GB used). The metric is in bytes, so:
# 100GB in bytes = 107374182400 # Alarm: FreeStorageSpace < 107374182400 aws cloudwatch put-metric-alarm \ --alarm-name rds-storage-low \ --metric-name FreeStorageSpace \ --namespace AWS/RDS \ --dimensions Name=DBInstanceIdentifier,Value=your-db-id \ --statistic Average \ --period 300 \ --threshold 107374182400 \ --comparison-operator LessThanThreshold \ --evaluation-periods 2 \ --alarm-actions arn:aws:sns:eu-west-1:YOUR_ACCOUNT:your-alert-topic
3. ReadLatency / WriteLatency — the early warning for I/O problems
Elevated I/O latency shows up before CPU spikes when the root cause is storage-related.
A missing index causing a sequential scan on a large table will show up as sustained
high ReadLatency for minutes before CPU climbs. A write storm from an ORM
doing per-row inserts without batching will register in WriteLatency before
the queue depth saturates.
For gp3 volumes, normal read/write latency for PostgreSQL should be below 2ms for random I/O. Alarm thresholds:
| Metric | Warn threshold | Critical threshold | Notes |
|---|---|---|---|
ReadLatency |
5ms | 20ms | Above 20ms consistently means sequential scan or I/O saturation |
WriteLatency |
3ms | 10ms | Write latency is usually lower than read on gp3 with fsync |
DiskQueueDepth |
1 | 5 | Any sustained queue depth above 1 means IOPS are saturated |
4. DBLoad — the best single metric if you have Enhanced Monitoring
DBLoad is available when Enhanced Monitoring is enabled and measures the
average number of active sessions at any given moment (similar to CPU load on Linux,
but for database sessions). A DBLoad greater than the number of vCPUs on
your instance means queries are actively competing for CPU — sessions are waiting in
the run queue rather than executing.
This is the metric that catches the scenario CPU doesn't: 200 connections, all
queued on lock waits, zero CPU, application timing out. CPU shows 5% and nothing
looks wrong — DBLoad shows 180 and tells you something is very wrong.
Alarm threshold: set at 1× vCPU count for warning, 2× for critical. For a
db.r5.xlarge (4 vCPUs): warn at DBLoad = 4, critical at DBLoad = 8.
DBLoad
metric alone justifies this — it's the most useful single signal available for
PostgreSQL on RDS.
5. FreeableMemory — catches the gradual memory leak
PostgreSQL uses free OS memory for the page cache — the OS keeps recently-accessed
data pages in memory so that repeat reads don't hit disk. As FreeableMemory
drops, the effective cache size shrinks, read IOPS increase, and query latency creeps up.
This happens gradually over weeks on instances that are sized too tight, and it's easy
to miss without a trend alarm.
Alarm at 15% of total instance RAM remaining. For a db.r5.xlarge (32GB):
alarm when FreeableMemory drops below ~4.8GB (5,154,734,080 bytes). If this alarm
fires regularly, you're either memory-constrained and need a larger instance, or a query
is doing large hash joins / sort operations without an appropriate work_mem
setting.
To identify memory-intensive queries in PostgreSQL:
SELECT left(query, 100) AS query, calls, rows, round(mean_exec_time::numeric, 2) AS mean_ms, shared_blks_hit + shared_blks_read AS total_blocks_touched FROM pg_stat_statements ORDER BY total_blocks_touched DESC LIMIT 10;
High total_blocks_touched combined with low shared_blks_hit / total_blocks
ratio indicates cache misses — queries that are reading from disk rather than memory.
This is a sign that either the working set has grown beyond available cache, or an
index is missing and full table scans are evicting hot data from the cache.
What about CPU?
Still set a CPU alarm — just don't make it your only one. 80% is a reasonable critical threshold. The important thing is to treat a CPU alarm as "go look at what's running" rather than "page everyone." Most CPU spikes on well-configured PostgreSQL instances have a specific cause that takes 5–10 minutes to diagnose and fix. The P1 CPU runbook covers that exact investigation.
Alarm summary
| Metric | Threshold | Why it matters |
|---|---|---|
DatabaseConnections |
80% of max_connections | Predicts connection exhaustion before errors start |
FreeStorageSpace |
20% remaining | Gives time to respond before writes fail hard |
ReadLatency |
> 5ms warn, > 20ms critical | Early indicator of seq scans and I/O saturation |
DBLoad |
> vCPU count warn, > 2× vCPU critical | Catches lock waits and queue buildup CPU misses |
FreeableMemory |
< 15% of total RAM | Catches slow cache pressure increase before latency impact |
CPUUtilization |
> 80% | Still useful; treat as "go investigate" not "panic" |
Want these alarms set up correctly for your instance?
The Diagnostic Session includes a monitoring
review — we check what you're currently alarming on, identify the gaps, and provide the exact
CloudWatch configuration for your specific instance class and workload pattern.