Op do 28 mrt 2024 om 16:21 schreef Laurenz Albe <laurenz.albe@xxxxxxxxxxx>:
On Thu, 2024-03-28 at 15:52 +0100, Emond Papegaaij wrote:
> This works fine most of the time, but sometimes we see this message on one of the nodes:
> pg_rewind: source and target cluster are on the same timeline pg_rewind: no rewind required
> This message seems timing related, as the first node might report that,
> while the second reports something like:
> pg_rewind: servers diverged at WAL location 5/F28AB1A8 on timeline 21
> pg_rewind: rewinding from last common checkpoint at 5/F27FCA98 on timeline 21
> pg_rewind: Done!
>
> If we ignore the response from pg_rewind, streaming will break on the node that reported
> no rewind was required. On the new primary, we do observe the database moving from timeline
> 21 to 22, but it seems this takes some time to materialize to be observable by pg_rewind.
This must be the problem addressed by commit 009eeee746 [1].
Thanks for the quick help!
This commit does seem to exactly address the problem we are seeing. Great to hear it's fixed in the latest version!
You'd have to upgrade to PostgreSQL v16, which would be a good idea anyway, given
that you are running v12.
This is quite high on our roadmap. We were at v12 when we introduced our HA setup. Before then, upgrading PostgreSQL was as simple as running pg_upgrade, but now we need to deal with upgrading an entire cluster. We are thinking about setting up logical replication to a single v16 node, and resync the cluster from that node. We will make sure to upgrade before v12 is EOL (November this year).
A temporary workaround could be to explicitly trigger a checkpoint right after
promotion.
Would this be as simple as sending a CHECKPOINT to the new primary just after promoting? This would work fine for us until we've migrated to v16.
Best regards,
Emond Papegaaij