Re: RDS restore failed due to WAL log and disk space-- any tidy fixes?

Wells Oliver <wells.oliver@xxxxxxxxx> · Sun, 17 Nov 2024 09:12:05 -0800

Interesting. I am migrating a pg_dump archive to a new server, in a single go. Does it make sense to disable (or speed up?) WAL archiving during the restore, then reenable it after the restore so a future replica could work? What would be the steps here? Would disabling or "speeding up" be faster?

max_slot_wal_keep_size is -1 at the moment so I think that's why it kept a ton of WAL and ran out of space.

On Sun, Nov 17, 2024 at 7:41 AM Laurenz Albe <laurenz.albe@xxxxxxxxxxx> wrote:
On Sat, 2024-11-16 at 16:33 -0800, Wells Oliver wrote:

> I provisioned an RDS instance with 2500GB space and began the restore of a database I know to be about 1750 GB using 16 jobs.

> 

> Unfortunately, it died very near the end when it ran out of disk space due to WAL log usage. Lots of:

> 

> 2024-11-17 00:07:09 UTC::@:[19861]:PANIC:  could not write to file "pg_wal/xlogtemp.19861": No space left on device

> 

> 

> And then kaboom.

> 

> I'm wondering what my course of action should be. Can I disable/reduce WAL during a restore?

> wal_level is set to replica, can this temporarily be set to minimal? Should I just eat the extra

> costs to add headroom for the WAL? Would using fewer jobs during a restore reduce the amount of WAL

> created?

If you are using minimal WAL logging and you restore the dump in a single transaction, you

should see way less WAL generated, because data inserted into the table in the same transaction

as the CREATE TABLE statement need not be WAL logged.

But you might more easily solve the problem by speeding up or disabling the WAL archiver,

so that PostgreSQL removes old WAL after the next checkpoint.

Yours,

Laurenz Albe

-- 
Wells Oliver
wells.oliver@xxxxxxxxx