The most common scenario we encounter: a team provisioned a db.r5.4xlarge
eighteen months ago because they expected traffic to grow into it. Traffic grew, but not
nearly that much. CloudWatch shows CPU sitting at 8%, FreeableMemory never dropping below
40GB, and the monthly bill is £2,800. The actual workload would fit comfortably on a
db.r5.xlarge at £700 — but nobody wants to be the person who triggered
a production incident to save money.
Here is how to validate the headroom, choose the right target, and execute the migration with under 2 minutes of downtime — or zero downtime if you use blue/green deployment.
Phase 1 — Validate you actually have headroom
Never make a right-sizing decision from a single CloudWatch metric. CPU alone is misleading — an instance can look idle on CPU while being heavily constrained by memory, IOPS, or connection count. Run all four checks before committing to a target.
CPU: Look at the p95 of CPUUtilization across the last
30 days, not the average. Averages hide peak behaviour. A 30-day p95 below 30% is a
reasonable signal that a smaller instance won't hit CPU limits during normal peaks.
Below 20% p95 and a half-size instance is almost certainly safe.
Memory: Check FreeableMemory at its minimum over the
last 30 days. PostgreSQL uses free OS memory for shared_buffers and OS-level page
cache — both directly affect query performance. The rule of thumb: the minimum
FreeableMemory should be comfortably above 15% of the instance's total RAM. If your
db.r5.4xlarge has 128GB RAM and the minimum FreeableMemory over 30 days
was 52GB, there's clearly room to downsize. If it was 8GB, there isn't.
IOPS: Check ReadIOPS and WriteIOPS at their
peak versus your gp3 provisioned baseline. gp3 volumes default to 3,000 IOPS — if
your peak IOPS never exceeds 1,500, halving the instance class won't change this constraint.
If you're provisioned above 3,000 IOPS specifically to handle peaks, you need to
understand whether those IOPS are instance-constrained or volume-constrained before
downsizing the instance.
Connections: Check peak DatabaseConnections against the
max_connections value for your current and target instance. In PostgreSQL,
max_connections defaults to a formula based on available memory. Downsizing
from db.r5.4xlarge (128GB, max_connections ~5000) to db.r5.xlarge
(32GB, max_connections ~1250) can hit a connection limit you never noticed before.
Check your current peak connection count:
SELECT max(numbackends) AS peak_connections, avg(numbackends) AS avg_connections, current_setting('max_connections')::int AS max_connections FROM pg_stat_database WHERE datname = current_database();
If peak connections are above 60% of the target instance's max_connections,
install pgBouncer before downsizing rather than after. See the
connection pool guide.
Phase 2 — Choose the target instance class
Based on the four checks above, the practical sizing heuristic:
| Observation | Recommended move |
|---|---|
| CPU p95 < 20%, FreeableMemory min > 40% of total RAM | Drop one full size (e.g. r5.4xl → r5.2xl). Validate for 2 weeks, then consider going one more. |
| CPU p95 < 30%, memory constrained (FreeableMemory min < 15%) | Don't downsize the instance class. Investigate query-level memory usage first (sort_mem, hash joins). The bottleneck is the workload, not the instance. |
| CPU fine, connections near max_connections limit | Install pgBouncer first, then downsize. Without a pooler, the smaller instance's lower max_connections will become the binding constraint. |
| Workload has clear daily peaks (e.g. business hours only) | Consider Aurora Serverless v2 with min/max ACU configured to match peak and trough. Often cheaper than a fixed r5 instance when utilisation pattern is spiky. |
Phase 3 — Execute with minimum downtime
There are two options: modify-in-place (brief downtime) or blue/green deployment (near-zero downtime). Choose based on your tolerance.
Modify in place — apply the instance class change during a maintenance window. RDS will stop the instance, resize it, and restart. For modern instance families (r6g, r7g), the restart typically takes 90–120 seconds. The AWS CLI command:
aws rds modify-db-instance \ --db-instance-identifier your-db-id \ --db-instance-class db.r5.xlarge \ --apply-immediately # omit to defer to maintenance window
--apply-immediately in production unless you're actively in a maintenance window.
Blue/green deployment — RDS creates a green (new) instance, replicates all data from your production (blue) instance, and lets you test the green instance before switching traffic over. The cutover itself is a short DNS-level switch with typically under 60 seconds of unavailability. This is the right choice for any instance carrying more than £100/month of traffic value.
# Create the green instance at the target class aws rds create-blue-green-deployment \ --blue-green-deployment-name resize-to-xlarge \ --source your-db-identifier \ --target-db-instance-class db.r5.xlarge # After validating the green instance, switch over aws rds switchover-blue-green-deployment \ --blue-green-deployment-identifier bgd-xxxxxxxx
After the switchover, the old blue instance is retained for a period — don't delete it until you've verified application behaviour under real load on the green instance. Keep it for at least 48 hours before terminating.
Real numbers from four migrations
These are actual savings from right-sizing work done in the last 12 months:
| Before | After | Monthly saving | What enabled it |
|---|---|---|---|
db.r5.4xlarge (Multi-AZ) |
db.r5.xlarge (Multi-AZ) |
~£1,900/month | pgBouncer installed first, CPU p95 was 11% |
db.r5.2xlarge (Single-AZ) |
db.r5.large + read replica |
~£580/month | Read-heavy workload shifted to replica |
db.r6g.4xlarge (Multi-AZ) |
db.r6g.2xlarge (Multi-AZ) |
~£1,100/month | Slow queries fixed first; CPU dropped from 35% to 14% |
db.r5.xlarge (Multi-AZ) |
Aurora Serverless v2 (0.5–4 ACU) | ~£340/month | Business-hours-only workload; near-zero overnight |
In all four cases the validation step took longer than the actual migration. The common mistake is skipping the 30-day CloudWatch review and making a decision based on "it looks quiet" rather than verified p95 metrics.
Think you're on an oversized instance?
The Diagnostic Session reviews your CloudWatch
metrics, query profile, and connection patterns, then gives you a specific recommendation on
whether to resize, tune, or both — with the expected monthly saving clearly stated.