Re: replication primary writting infinite number of WAL files

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Fri, 24 Nov 2023 08:50:19 -0800

On 11/24/23 03:39, Les wrote:
Hello,
Yesterday, the primary server suddenly started 
writing to the pg_wal directory at a crazy pace, 1.5GB/sec, but 
sometimes it went up to over 3GB/sec. The pg_wal started fattening up 
and didn't stop until it ran out of disk space. It happened so fast that 
we didn't have time to react. We then stopped all applications 
(postgresql clients) because we thought one of them was causing the 
problem. 

The only exception is a sequence 
value that was moved millions of steps within a single minute. Of 

Did you determine this by looking at select * from some_seq?

This new instance worked for about 12 hours.  This morning, the 
error occurred again, in the same form. Based on our previous 
experience, we immediately deleted the standby and its replication slot, 
and the problem resolved itself (except that the standby had to be 
deleted again). Without rebooting or restarting anything else, the 
problem went away. I managed to save small part of the pg_wal before 
deleting the slot. We looked into this, we saw something like this:

Are the servers open to the world and if so have you explored whether 
there has been an intrusion?

Do you have logs that cover the period from when it transitioned from 
working normally to going haywire?

We looked at the PostgreSQL release history, and we see some bug fixes 
in version 14.7 that might have something to do with this:

https://www.postgresql.org/docs/release/14.7/ 
<https://www.postgresql.org/docs/release/14.7/>

 > Ignore invalidated logical-replication slots while determining oldest 
catalog xmin (Sirisha Chamarthi) A replication slot could prevent 
cleanup of dead tuples in the system catalogs even after it becomes 
invalidated due to exceeding max_slot_wal_keep_size. Thus, failure of a 
replication consumer could lead to indefinitely-large catalog bloat.

You are using repmgr which as I understand it uses streaming not logical 
replication.

Thank you,

    Laszlo

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx