Knowledge Base › Right-Sizing RDS Without Downtime
Cost 5 min read · AWS RDS · Cost Optimisation

Right-Sizing Your RDS Instance Without Downtime

The most common scenario we encounter: a team provisioned a db.r5.4xlarge eighteen months ago because they expected traffic to grow into it. Traffic grew, but not nearly that much. CloudWatch shows CPU sitting at 8%, FreeableMemory never dropping below 40GB, and the monthly bill is £2,800. The actual workload would fit comfortably on a db.r5.xlarge at £700 — but nobody wants to be the person who triggered a production incident to save money.

Here is how to validate the headroom, choose the right target, and execute the migration with under 2 minutes of downtime — or zero downtime if you use blue/green deployment.

Phase 1 — Validate you actually have headroom

Never make a right-sizing decision from a single CloudWatch metric. CPU alone is misleading — an instance can look idle on CPU while being heavily constrained by memory, IOPS, or connection count. Run all four checks before committing to a target.

CPU: Look at the p95 of CPUUtilization across the last 30 days, not the average. Averages hide peak behaviour. A 30-day p95 below 30% is a reasonable signal that a smaller instance won't hit CPU limits during normal peaks. Below 20% p95 and a half-size instance is almost certainly safe.

Memory: Check FreeableMemory at its minimum over the last 30 days. PostgreSQL uses free OS memory for shared_buffers and OS-level page cache — both directly affect query performance. The rule of thumb: the minimum FreeableMemory should be comfortably above 15% of the instance's total RAM. If your db.r5.4xlarge has 128GB RAM and the minimum FreeableMemory over 30 days was 52GB, there's clearly room to downsize. If it was 8GB, there isn't.

IOPS: Check ReadIOPS and WriteIOPS at their peak versus your gp3 provisioned baseline. gp3 volumes default to 3,000 IOPS — if your peak IOPS never exceeds 1,500, halving the instance class won't change this constraint. If you're provisioned above 3,000 IOPS specifically to handle peaks, you need to understand whether those IOPS are instance-constrained or volume-constrained before downsizing the instance.

Connections: Check peak DatabaseConnections against the max_connections value for your current and target instance. In PostgreSQL, max_connections defaults to a formula based on available memory. Downsizing from db.r5.4xlarge (128GB, max_connections ~5000) to db.r5.xlarge (32GB, max_connections ~1250) can hit a connection limit you never noticed before. Check your current peak connection count:

SELECT
  max(numbackends)      AS peak_connections,
  avg(numbackends)      AS avg_connections,
  current_setting('max_connections')::int AS max_connections
FROM   pg_stat_database
WHERE  datname = current_database();

If peak connections are above 60% of the target instance's max_connections, install pgBouncer before downsizing rather than after. See the connection pool guide.

Phase 2 — Choose the target instance class

Based on the four checks above, the practical sizing heuristic:

Observation Recommended move
CPU p95 < 20%, FreeableMemory min > 40% of total RAM Drop one full size (e.g. r5.4xl → r5.2xl). Validate for 2 weeks, then consider going one more.
CPU p95 < 30%, memory constrained (FreeableMemory min < 15%) Don't downsize the instance class. Investigate query-level memory usage first (sort_mem, hash joins). The bottleneck is the workload, not the instance.
CPU fine, connections near max_connections limit Install pgBouncer first, then downsize. Without a pooler, the smaller instance's lower max_connections will become the binding constraint.
Workload has clear daily peaks (e.g. business hours only) Consider Aurora Serverless v2 with min/max ACU configured to match peak and trough. Often cheaper than a fixed r5 instance when utilisation pattern is spiky.

Phase 3 — Execute with minimum downtime

There are two options: modify-in-place (brief downtime) or blue/green deployment (near-zero downtime). Choose based on your tolerance.

Modify in place — apply the instance class change during a maintenance window. RDS will stop the instance, resize it, and restart. For modern instance families (r6g, r7g), the restart typically takes 90–120 seconds. The AWS CLI command:

aws rds modify-db-instance \
  --db-instance-identifier your-db-id \
  --db-instance-class db.r5.xlarge \
  --apply-immediately  # omit to defer to maintenance window
--apply-immediately means right now, not "at next maintenance window." If you run this during peak hours, your instance will restart immediately. Always omit --apply-immediately in production unless you're actively in a maintenance window.

Blue/green deployment — RDS creates a green (new) instance, replicates all data from your production (blue) instance, and lets you test the green instance before switching traffic over. The cutover itself is a short DNS-level switch with typically under 60 seconds of unavailability. This is the right choice for any instance carrying more than £100/month of traffic value.

# Create the green instance at the target class
aws rds create-blue-green-deployment \
  --blue-green-deployment-name resize-to-xlarge \
  --source your-db-identifier \
  --target-db-instance-class db.r5.xlarge

# After validating the green instance, switch over
aws rds switchover-blue-green-deployment \
  --blue-green-deployment-identifier bgd-xxxxxxxx

After the switchover, the old blue instance is retained for a period — don't delete it until you've verified application behaviour under real load on the green instance. Keep it for at least 48 hours before terminating.

Real numbers from four migrations

These are actual savings from right-sizing work done in the last 12 months:

Before After Monthly saving What enabled it
db.r5.4xlarge (Multi-AZ) db.r5.xlarge (Multi-AZ) ~£1,900/month pgBouncer installed first, CPU p95 was 11%
db.r5.2xlarge (Single-AZ) db.r5.large + read replica ~£580/month Read-heavy workload shifted to replica
db.r6g.4xlarge (Multi-AZ) db.r6g.2xlarge (Multi-AZ) ~£1,100/month Slow queries fixed first; CPU dropped from 35% to 14%
db.r5.xlarge (Multi-AZ) Aurora Serverless v2 (0.5–4 ACU) ~£340/month Business-hours-only workload; near-zero overnight

In all four cases the validation step took longer than the actual migration. The common mistake is skipping the 30-day CloudWatch review and making a decision based on "it looks quiet" rather than verified p95 metrics.

One important note: fixing slow queries before right-sizing almost always increases the savings. In the third example above, CPU dropped from 35% to 14% after a query fix — making a downsize possible that wasn't viable before. The Diagnostic Session typically identifies both performance and right-sizing opportunities in the same session.

Think you're on an oversized instance?
The Diagnostic Session reviews your CloudWatch metrics, query profile, and connection patterns, then gives you a specific recommendation on whether to resize, tune, or both — with the expected monthly saving clearly stated.

Diagnostic Session — £497 →