On Thu, 2024-03-28 at 15:52 +0100, Emond Papegaaij wrote: > * we detach the primary database backend, forcing a failover > * pgpool selects a new primary database and promotes it > * the other 2 nodes (the old primary and the other standby) are rewound > and streaming is resumed from the new primary > * the node that needed to be taken out of the cluster (the old primary) > is shutdown and rebooted > > This works fine most of the time, but sometimes we see this message on one of the nodes: > pg_rewind: source and target cluster are on the same timeline pg_rewind: no rewind required > This message seems timing related, as the first node might report that, > while the second reports something like: > pg_rewind: servers diverged at WAL location 5/F28AB1A8 on timeline 21 > pg_rewind: rewinding from last common checkpoint at 5/F27FCA98 on timeline 21 > pg_rewind: Done! > > If we ignore the response from pg_rewind, streaming will break on the node that reported > no rewind was required. On the new primary, we do observe the database moving from timeline > 21 to 22, but it seems this takes some time to materialize to be observable by pg_rewind. > > 1. Is my observation about the starting of a new timeline correct? > 2. If yes, is there anything we can do during to block promotion process until the new > timeline has fully materialized, either by waiting or preferably forcing the new > timeline to be started? This must be the problem addressed by commit 009eeee746 [1]. You'd have to upgrade to PostgreSQL v16, which would be a good idea anyway, given that you are running v12. A temporary workaround could be to explicitly trigger a checkpoint right after promotion. Yours, Laurenz Albe [1]. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=009eeee746825090ec7194321a3db4b298d6571e