On Wed, Aug 01, 2018 at 09:09:30PM +0000, Richard Schmidt wrote: > Our procedure that runs on machine A and B is as follows: > > 1. Build new databases on A and B, and configure A as Primary and B > as Standby databases. > 2. Make some changes to the A (the primary) and check that they are > replicated to the B (the standby) > 3. Promote B to be the new primary > 4. Switch of the A (the original primary) > 5. Add the replication slot to B (the new primary) for A (soon to > be standby) > 6. Add a recovery.conf to A (soon to be standby). File contains > recovery_target_timeline = 'latest' and restore_command = 'cp > /ice-dev/wal_archive/%f "%p" > 7. Run pg_rewind on A - this appears to work as it returns the > message 'source and target cluster are on the same timeline no > rewind required'; > 8. Start up server A (now a slave) Step 7 is incorrect here, after promotion of B you should see pg_rewind actually do its work. The problem is that you are missing a piece in your flow in the shape of a checkpoint on the promoted standby to run after 3 and before step 7. This makes the promoted standby update its timeline number in the on-disk control file, which is used by pg_rewind to check if a rewind needs to happen or not. We see too many reports of such mistakes, I am going to propose a patch on the -hackers mailing list to mention that in the documentation... -- Michael
Attachment:
signature.asc
Description: PGP signature