Re: replication primary writting infinite number of WAL files

Les <nagylzs@xxxxxxxxx> · Fri, 24 Nov 2023 16:59:55 +0100

Laurenz Albe <laurenz.albe@xxxxxxxxxxx>  (2023. nov. 24., P, 16:00):
On Fri, 2023-11-24 at 12:39 +0100, Les wrote:

> Under normal circumstances, the number of write operations is relatively low, with an

> average of 4-5 MB/sec total write speed on the disk associated with the data directory.

> Yesterday, the primary server suddenly started writing to the pg_wal directory at a

> crazy pace, 1.5GB/sec, but sometimes it went up to over 3GB/sec.

> [...]

> Upon further analysis of the database, we found that we did not see any mass data

> changes in any of the tables. The only exception is a sequence value that was moved

> millions of steps within a single minute.

That looks like some application went crazy and inserted millions of rows, but the

inserts were rolled back.  But it is hard to be certain with the clues given.

Writing of WAL files continued after we shut down all clients, and restarted the primary PostgreSQL server.

The order was:

1. shut down all clients
2. stop the primary
3. start the primary
4. primary started to write like mad again
5. removed replication slot
6. primary stopped madness and deleted all WAL files (except for a few)

How can the primary server generate more and more WAL files (writes) after all clients have been shut down and the server was restarted? My only bet was the autovacuum. But I ruled that out, because removing a replication slot has no effect on the autovacuum (am I wrong?). Now you are saying that this looks like a huge rollback. Does rolling back changes require even more data to be written to the WAL after server restart? As far as I know, if something was not written to the WAL, then it is not something that can be rolled back. Does removing a replication slot lessen the amount of data needed to be written for a rollback (or for anything else)? It is a fact that the primary stopped writing at 1.5GB/sec the moment we removed the slot.

I'm not saying that you are wrong. Maybe there was a crazy application. I'm just saying that a crazy application cannot be the whole picture. It cannot explain this behaviour as a whole. Or maybe I have a deep misunderstanding about how WAL files work.  On the second occasion, the primary was running for a few minutes when pg_wal started to increase. We noticed that early, and shut down all clients, then restarted the primary server. After the restart, the primary was writing out more WAL files for many more minutes, until we dropped the slot again. E.g. it was writing much more data after the restart than before the restart; and it only stopped (exactly) when we removed the slot.

Regards,

   Laszlo