On 8/2/22 09:26, Don Seiler wrote:
I'm on PG 12 on Ubuntu 18.04.
I have a very large database (~18TB) with which we have a separate
pg_wal volume configured. The pg_wal volume is around 50GB but typical
daily usage rarely sees it over 10GB. We use pgbackrest to archive WALs
as well as performing backups.
The problem is when we are performing restore and recovery (e.g. to set
up a new physical replica), also via pgbackrest. The DB restore works
fine but when it comes to the WAL restore and recovery, the pg_wal
volume will fill up before PG can clear out the already-recovered WAL
files. This means I have to restart the database and start the recovery
process over again. The last time I ended up writing a cron job to
delete around 100 logs per minute just via an `rm` command based on the
recovery rate I saw.
My understanding is that the recovered WAL files would be cleared when
the replica hits a recovery start point. My primary is configured with a
10 minute checkpoint_timeout. The replica pg_wal will fill up before 10
minutes. The primary and replica have the same size pg_wal volume but
the primary never comes close to filling up, as I said before.
I believe two options (aside from the ugly rm cron job) would be to
either shorten the checkpoint_timeout on the primary, which would be
hard to do due to the activity level, or make a larger pg_wal volume
(trial and error to determine just how much larger?).
I'm interested to know if there's anything else I can do to avoid the
toil when we do these restores and also if maybe there is something
wrong and that PG shouldn't be filling up the volume blindly.
This appears to be related to [1], which we have been discussing over on
that thread.
Regards,
-David
[1]
https://www.postgresql.org/message-id/flat/20210202151416.GB3304930%40rfd.leadboat.com