Re: pg_rewind - restore new slave failed to startup during recovery

Michael Paquier <michael.paquier@xxxxxxxxx> · Tue, 22 Aug 2017 10:06:44 +0900

On Tue, Aug 22, 2017 at 9:52 AM, Dylan Luong <Dylan.Luong@xxxxxxxxxxxx> wrote:
> I have 1 master and 1 slave wal streaming replication setup and the
> Application connects via a load balancer (LTM) where the all connections are
> redirected to the master member (master db).
>
> We have archive_mode enabled.

First things first. What is the version of PostgreSQL involved here?

> I am trying to test to use pg_rewind to restore the new slave (old master)
> after a failover while the system is under load.

Don't worry. pg_rewind works :)

> Here are the steps I take to test:
>
> 1.       Disable the master ltm member (all connections redired to slave
> member)
> 2.       Promote slave (touch promote.me)
> 3.       Stop the master db (old master)
> 4.       Do pg_rewind on the new slave (old master)
> 5.       Start the new slave.

That flow looks correct to me. No I think that you should trigger
manually a checkpoint after step 2 on the promoted standby so as its
control file gets forcibly updated correctly with its new timeline
number. This is a small but critical point people usually miss. The
documentation of pg_rewind does not mention this point when using a
live source server, and many people have fallen into this trap up to
now... We should really mention that in the docs. What do others
think?

> Checking the on the new master, I see that the check point that its trying
> to restore is the file 000000040000009C0000006F, but the file does not exist
> anywhere on the new master. Not in the pg_xlog or the archive folder. (as
> specified in the postgresql.conf)

4 is the number of the last timeline the promoted standby has been using, right?

> Please see attached  psql.jpg.
>
> Here is my recovery.conf :
> standby_mode = 'on'
> primary_conninfo = 'host=10.69.19.18  user=replicant’
> trigger_file = '/var/run/promote_me'
> restore_command = 'cp /pg_backup/backup/archive_sync/%f "%p"'
>
> does anyone know why?

What are the contents of /pg_backup/backup/archive_sync/? Are you sure
that the promoted standby has archived correctly the first segment of
its new timeline for example?

> Under what conditions will pg_rewind wont’ work?

Only one WAL segment missing would prevent any base backup or rewound
node to reach a consistent point. You need to be careful about the
contents of your archives. Now a failover done correctly is a tricky
thing, which could likely fail if you don't issue a checkpoint
immediately on the promoted standby if pg_rewind is kicked in the
process before an automatic checkpoint happens (because of timeout or
volume, whichever).
-- 
Michael

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general