On 11/24/23 03:39, Les wrote:
Hello,
Yesterday, the primary server suddenly started
writing to the pg_wal directory at a crazy pace, 1.5GB/sec, but
sometimes it went up to over 3GB/sec. The pg_wal started fattening up
and didn't stop until it ran out of disk space. It happened so fast that
we didn't have time to react. We then stopped all applications
(postgresql clients) because we thought one of them was causing the
problem.
The only exception is a sequence
value that was moved millions of steps within a single minute. Of
Did you determine this by looking at select * from some_seq?
This new instance worked for about 12 hours. This morning, the
error occurred again, in the same form. Based on our previous
experience, we immediately deleted the standby and its replication slot,
and the problem resolved itself (except that the standby had to be
deleted again). Without rebooting or restarting anything else, the
problem went away. I managed to save small part of the pg_wal before
deleting the slot. We looked into this, we saw something like this:
Are the servers open to the world and if so have you explored whether
there has been an intrusion?
Do you have logs that cover the period from when it transitioned from
working normally to going haywire?
We looked at the PostgreSQL release history, and we see some bug fixes
in version 14.7 that might have something to do with this:
https://www.postgresql.org/docs/release/14.7/
<https://www.postgresql.org/docs/release/14.7/>
> Ignore invalidated logical-replication slots while determining oldest
catalog xmin (Sirisha Chamarthi) A replication slot could prevent
cleanup of dead tuples in the system catalogs even after it becomes
invalidated due to exceeding max_slot_wal_keep_size. Thus, failure of a
replication consumer could lead to indefinitely-large catalog bloat.
You are using repmgr which as I understand it uses streaming not logical
replication.
Thank you,
Laszlo
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx