On Tue, Aug 22, 2017 at 9:52 AM, Dylan Luong <Dylan.Luong@xxxxxxxxxxxx> wrote: > I have 1 master and 1 slave wal streaming replication setup and the > Application connects via a load balancer (LTM) where the all connections are > redirected to the master member (master db). > > We have archive_mode enabled. First things first. What is the version of PostgreSQL involved here? > I am trying to test to use pg_rewind to restore the new slave (old master) > after a failover while the system is under load. Don't worry. pg_rewind works :) > Here are the steps I take to test: > > 1. Disable the master ltm member (all connections redired to slave > member) > 2. Promote slave (touch promote.me) > 3. Stop the master db (old master) > 4. Do pg_rewind on the new slave (old master) > 5. Start the new slave. That flow looks correct to me. No I think that you should trigger manually a checkpoint after step 2 on the promoted standby so as its control file gets forcibly updated correctly with its new timeline number. This is a small but critical point people usually miss. The documentation of pg_rewind does not mention this point when using a live source server, and many people have fallen into this trap up to now... We should really mention that in the docs. What do others think? > Checking the on the new master, I see that the check point that its trying > to restore is the file 000000040000009C0000006F, but the file does not exist > anywhere on the new master. Not in the pg_xlog or the archive folder. (as > specified in the postgresql.conf) 4 is the number of the last timeline the promoted standby has been using, right? > Please see attached psql.jpg. > > Here is my recovery.conf : > standby_mode = 'on' > primary_conninfo = 'host=10.69.19.18 user=replicant’ > trigger_file = '/var/run/promote_me' > restore_command = 'cp /pg_backup/backup/archive_sync/%f "%p"' > > does anyone know why? What are the contents of /pg_backup/backup/archive_sync/? Are you sure that the promoted standby has archived correctly the first segment of its new timeline for example? > Under what conditions will pg_rewind wont’ work? Only one WAL segment missing would prevent any base backup or rewound node to reach a consistent point. You need to be careful about the contents of your archives. Now a failover done correctly is a tricky thing, which could likely fail if you don't issue a checkpoint immediately on the promoted standby if pg_rewind is kicked in the process before an automatic checkpoint happens (because of timeout or volume, whichever). -- Michael -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general