Re: pg_rewind - restore new slave failed to startup during recovery

Dylan Luong <Dylan.Luong@xxxxxxxxxxxx> · Tue, 22 Aug 2017 04:11:49 +0000

Thanks Michael.

> First things first. What is the version of PostgreSQL involved here?

The PostgreSQL is version 9.6.

>4 is the number of the last timeline the promoted standby has been using, right?

The history file in pg_xlog, is dated at the time of promotion on the standby (current master)
-rw-------. 1 postgres postgres      131 Aug 21 13:26 00000004.history

$ more 00000004.history
1       20/5C000098     no recovery target specified

2       76/F8000098     no recovery target specified

3       9C/7CC50680     no recovery target specified

> What are the contents of /pg_backup/backup/archive_sync/?

The archive folder is /pg_backup/backup/archive, I ftp'ed all the contents from the /pg_backup/backup/archive folder from the new  master to the /pg_backup/backup/archive_sync on the new slave.

-----Original Message-----
From: Michael Paquier [mailto:michael.paquier@xxxxxxxxx] 
Sent: Tuesday, 22 August 2017 10:37 AM
To: Dylan Luong <Dylan.Luong@xxxxxxxxxxxx>
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  pg_rewind - restore new slave failed to startup during recovery

On Tue, Aug 22, 2017 at 9:52 AM, Dylan Luong <Dylan.Luong@xxxxxxxxxxxx> wrote:
> I have 1 master and 1 slave wal streaming replication setup and the 
> Application connects via a load balancer (LTM) where the all 
> connections are redirected to the master member (master db).
>
> We have archive_mode enabled.

First things first. What is the version of PostgreSQL involved here?

> I am trying to test to use pg_rewind to restore the new slave (old 
> master) after a failover while the system is under load.

Don't worry. pg_rewind works :)

> Here are the steps I take to test:
>
> 1.       Disable the master ltm member (all connections redired to slave
> member)
> 2.       Promote slave (touch promote.me)
> 3.       Stop the master db (old master)
> 4.       Do pg_rewind on the new slave (old master)
> 5.       Start the new slave.

That flow looks correct to me. No I think that you should trigger manually a checkpoint after step 2 on the promoted standby so as its control file gets forcibly updated correctly with its new timeline number. This is a small but critical point people usually miss. The documentation of pg_rewind does not mention this point when using a live source server, and many people have fallen into this trap up to now... We should really mention that in the docs. What do others think?

> Checking the on the new master, I see that the check point that its 
> trying to restore is the file 000000040000009C0000006F, but the file 
> does not exist anywhere on the new master. Not in the pg_xlog or the 
> archive folder. (as specified in the postgresql.conf)

4 is the number of the last timeline the promoted standby has been using, right?

> Please see attached  psql.jpg.
>
> Here is my recovery.conf :
> standby_mode = 'on'
> primary_conninfo = 'host=10.69.19.18  user=replicant’
> trigger_file = '/var/run/promote_me'
> restore_command = 'cp /pg_backup/backup/archive_sync/%f "%p"'
>
> does anyone know why?

What are the contents of /pg_backup/backup/archive_sync/? Are you sure that the promoted standby has archived correctly the first segment of its new timeline for example?

> Under what conditions will pg_rewind wont’ work?

Only one WAL segment missing would prevent any base backup or rewound node to reach a consistent point. You need to be careful about the contents of your archives. Now a failover done correctly is a tricky thing, which could likely fail if you don't issue a checkpoint immediately on the promoted standby if pg_rewind is kicked in the process before an automatic checkpoint happens (because of timeout or volume, whichever).
--
Michael

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general