Hi,
A client has have been having problems with pg_rewind.
They have two PostgreSQL (Oracle Enterprise Linux 7, 9.6.4) nodes in streaming replication and follow these steps:
1. Stop the master
2. Promote the standby
3. After successful failover wait some time (a lot of data is written)
4. Issue a checkpoint on the new master
5. Issue a pg_rewind on the old master
6. Start up the old master with a recovery.conf pointing at the new master.
The recovery.conf is identical in both cases (it points at a VIP).
They are seeing the old master come up as a standby after replaying some logs from the archive, then connect to the master as a streaming replica but fail when it asks for a log on the new timeline which has never existed. It isn't in the archive, or in the master pg_xlog. The numbering does exist in the old timeline (not sure if this is relevant or a coincidence).
This has been hit three times now with identical results. The old master needs to be destroyed at this stage.
In the most recent example the old master is asking for 0000001500000C73000000A9 from the streaming session. Looking at the 15.history file (timeline 20) I can see:
20 C74/45000000 no recovery target specified+
So this makes no sense, as the file it's asking for is before 0000001500000C7400000044 so never existed on either machine (the old master has never been in timeline 15). You can see that this file is in the archive on the new master.
They are using pg_backrest for archive management, which works fine the rest of the time.
Can anyone think of a way this could be happening? Are we missing a step above?
Cheers,
James Sewell,
The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.