Knowledge Base › Why Your RDS Bill Doubled
Cost 6 min read · AWS RDS · Schema Migrations

Why Your RDS Bill Doubled After a Schema Migration

The pattern is consistent: a development team runs a schema migration on a large production table — adding a column, changing a type, adding a constraint — and within 24 hours the AWS bill has a visible spike. Storage has grown by 30–80GB. IOPS are elevated for hours after the migration completed. Occasionally the next billing cycle shows a permanent increase that nobody can explain. This guide covers the three most common causes and how to check which one hit you.

Cause 1: ALTER TABLE triggered a full table rewrite

Not all ALTER TABLE operations are equal. PostgreSQL can add a nullable column with no default in-place (just a catalogue update, essentially instant). But certain operations require rewriting every row in the table to disk — which means a full sequential read of the existing table followed by a full sequential write of the new version. On a 100GB table, that's 200GB of I/O before the migration is even done.

The operations that force a full rewrite:

ALTER TABLE operation Rewrites table? Notes
ADD COLUMN col TEXT DEFAULT 'x' NOT NULL Yes (PG < 11) / No (PG 11+) PG 11+ stores the default in the catalogue and rewrites lazily. PG 10 and below rewrote immediately.
ALTER COLUMN type USING … Yes Any type change with a USING clause rewrites all rows.
ALTER COLUMN SET NOT NULL No (PG 18) / Scan only (PG < 18) PG 18 uses a constraint check without a rewrite. Earlier versions scan to verify.
ADD COLUMN col SERIAL Yes Populates a sequence default for every existing row.
ADD COLUMN col TEXT (nullable, no default) No Just a system catalogue update. Safe on large tables.
ADD COLUMN col TEXT DEFAULT 'x' (nullable) No (PG 11+) Default stored in catalogue, applied at read time.

After a rewrite, you have two copies of the table on disk briefly — the old and the new. On a 100GB table, peak storage during a rewrite can reach 200GB. The old version is then deleted, but the storage release is not instantaneous. If autovacuum is lagging, the old dead pages can persist for hours before the space is reclaimed. In the meantime, RDS provisioned storage may have auto-scaled up to accommodate the spike — and auto-scaled storage does not shrink back.

RDS storage auto-scaling is one-way. If a migration causes a storage spike that triggers RDS to increase your allocated storage, that increase is permanent — you cannot reduce allocated storage on a running RDS instance without a snapshot-restore. If your migration caused a 40GB spike and auto-scaling kicked in, that's a permanent cost increase until the next planned DR exercise.

Cause 2: pg_toast table bloat

PostgreSQL stores large column values (text over ~2KB, jsonb, bytea) in a separate TOAST table. When you migrate a column containing large values — changing a text column to jsonb, for example — PostgreSQL creates new TOAST entries for every row that was updated. The old TOAST entries become dead tuples, but they live in a separate heap that autovacuum has to clean up independently.

The problem is that the TOAST table for a large primary table can itself be hundreds of gigabytes, and its autovacuum parameters are inherited from the parent table's settings (which are often conservative). After a migration touching many rows with large values, the TOAST table can sit bloated for days while autovacuum works through it.

To check TOAST table sizes after a migration:

SELECT
  c.relname                                  AS table_name,
  t.relname                                  AS toast_table,
  pg_size_pretty(pg_relation_size(c.reltoastrelid))
                                             AS toast_size,
  pg_size_pretty(pg_table_size(c.oid))       AS table_size
FROM   pg_class c
JOIN   pg_class t ON t.oid = c.reltoastrelid
WHERE  c.relkind = 'r'
  AND  c.reltoastrelid IS NOT NULL
  AND  pg_relation_size(c.reltoastrelid) > 104857600  -- > 100MB
ORDER BY pg_relation_size(c.reltoastrelid) DESC;

If a table's TOAST size looks disproportionate to the actual data volume, autovacuum hasn't cleaned up the dead TOAST tuples yet. You can accelerate this by running VACUUM directly on the TOAST table — though you need to find the TOAST table OID first:

SELECT 'VACUUM pg_toast.pg_toast_' || c.oid::text
FROM   pg_class c
WHERE  c.relname = 'your_table_name';
-- then run the VACUUM command it outputs

Cause 3: Index rebuilds consuming IOPS for hours

A full table rewrite also rebuilds every index on that table. On a table with 8–12 indexes (not uncommon on a busy operational table), each index is built sequentially after the table write. Index builds are sequential scans of the new table, producing sorted output. On a 100GB table with 10 indexes, you are looking at the I/O equivalent of reading and writing the table 10 additional times.

The IOPS cost doesn't show up as a clear spike in CloudWatch's WriteIOPS metric because it's spread over the duration of the migration. What it looks like is sustained elevated I/O for the entire migration window — often 30 minutes to several hours on tables above 50GB — which can push gp3 volumes past their provisioned IOPS baseline and into burst territory. If your instance has gp3 volumes provisioned at 3,000 IOPS (the default), a prolonged migration can drain the I/O credit balance, slowing everything else on the instance during the migration window.

Check your provisioned IOPS vs actual IOPS during a migration window:

-- Check how many indexes exist on a table and their sizes
SELECT
  indexrelname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM   pg_stat_user_indexes
WHERE  relname = 'your_table_name'
ORDER BY pg_relation_size(indexrelid) DESC;

Diagnosing what happened after the fact

If the migration has already run and you're trying to understand the bill spike, the most useful signals are in CloudWatch. Look at the 24-hour window around the migration time for:

You can also check current table and index sizes against what you expect:

SELECT
  relname,
  pg_size_pretty(pg_table_size(oid))       AS table_size,
  pg_size_pretty(pg_indexes_size(oid))     AS indexes_size,
  pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
  n_live_tup,
  n_dead_tup,
  round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct
FROM   pg_stat_user_tables
JOIN   pg_class USING (relname)
ORDER BY pg_total_relation_size(oid) DESC
LIMIT  20;

A high dead_pct on the migrated table means autovacuum hasn't finished cleaning up the old rows yet. If it's been more than a few hours since the migration, you can speed up reclamation by running VACUUM ANALYZE your_table_name directly — this won't lock the table and runs faster than waiting for autovacuum.

Preventing this on the next migration

The three most effective interventions:

1. Use pg_repack instead of ALTER TABLE for large column type changes. pg_repack rewrites the table online, in the background, without a full exclusive lock and without the storage spike. It's not available as an RDS extension by default but can be compiled and run as a client-side tool against an RDS endpoint. For tables above 10GB, the reduced I/O cost alone justifies the setup time.

2. Split large migrations into two steps. Add the new column as nullable, backfill it in batches, then add the NOT NULL constraint separately. This avoids the full rewrite and spreads the I/O over time rather than concentrating it in a single migration window. On a 500M-row table, batching at 10,000 rows with a short sleep between batches keeps the IOPS profile flat and doesn't touch the burst credit balance.

3. Run large migrations during off-peak hours and monitor FreeStorageSpace during the window. Set a CloudWatch alarm on FreeStorageSpace at 20% of allocated storage before starting. If storage auto-scaling is enabled, disable it temporarily and watch the metric manually — the auto-scale is a safety net but its one-directional nature makes it an expensive one.

Trying to understand a recent bill spike?
The Diagnostic Session covers exactly this — we review your CloudWatch metrics, table sizes, and autovacuum history to pinpoint what happened and what it will cost going forward. Most clients recover the session fee in avoided storage costs within a quarter.

Diagnostic Session — £497 →