On Tue, May 28, 2024 at 03:00:23PM -0400, Mohan NBSPS wrote: > [...] > Thank you Johannes for the advice. > > We are looking at moving to 16. > We did not implement slots to avoid disk space issues on primary (possible > network disconnect may fill up primary `pg_xlog`). Yes, replication slots can interrupt your primary. Relying on wal_keep_segments alone can kill your replicas. Having a WAL archive can be the best of both worlds, but also needs careful monitoring and management. > We have changed the WAL settings to retain more WAL files. > > Since we have not seen this issue before, (have been running postgresql for > over 10 years), what kind > of scenario would trigger this ? Every time you interrupt the replication stream (such as when a replica reboots, or its postgres master process is stopped), you enter a race condition between WAL segments accumulating on the primary, and the replication stream to pick up again once the replica is up once more. So if, during your replica restart, enough WAL was produced to exceed wal_keep_segments, the lineage is broken, and the replica cannot ever catch up again. Also, the "invalid resource manager" log line you reported *might* hint at data corruption in your WAL segments. I think that data checksums and WAL compression could both make detection of such conditions more reliable. -- with best regards: - Johannes Truschnigg ( johannes@xxxxxxxxxxxxxxx ) www: https://johannes.truschnigg.info/
Attachment:
signature.asc
Description: PGP signature