Re: Postgresql 9.5: Streaming Replication: Secondaries Fail To Start Post WAL Error

Johannes Truschnigg <johannes@xxxxxxxxxxxxxxx> · Tue, 28 May 2024 21:11:36 +0200

On Tue, May 28, 2024 at 03:00:23PM -0400, Mohan NBSPS wrote:
> [...]
> Thank you Johannes for the advice.
> 
> We are looking at moving to 16.
> We did not implement slots to avoid disk space issues on primary (possible
> network disconnect may fill up primary `pg_xlog`).

Yes, replication slots can interrupt your primary. Relying on
wal_keep_segments alone can kill your replicas. Having a WAL archive can be
the best of both worlds, but also needs careful monitoring and management.

> We have changed the WAL settings to retain more WAL files.
> 
> Since we have not seen this issue before, (have been running postgresql for
> over 10 years), what kind
> of scenario would trigger this ?

Every time you interrupt the replication stream (such as when a replica
reboots, or its postgres master process is stopped), you enter a race
condition between WAL segments accumulating on the primary, and the
replication stream to pick up again once the replica is up once more. So if,
during your replica restart, enough WAL was produced to exceed
wal_keep_segments, the lineage is broken, and the replica cannot ever catch up
again.

Also, the "invalid resource manager" log line you reported *might* hint at
data corruption in your WAL segments. I think that data checksums and WAL
compression could both make detection of such conditions more reliable.

-- 
with best regards:
- Johannes Truschnigg ( johannes@xxxxxxxxxxxxxxx )

www:   https://johannes.truschnigg.info/
Attachment:
signature.asc

Description: PGP signature